Information theory

From WikiPapers
Revision as of 19:17, November 22, 2014 by Nemo bis (Talk | contribs) (Created page with "{{Infobox Keyword}}")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Information theory is included as keyword or extra keyword in 0 datasets, 0 tools and 68 publications.

Datasets

There is no datasets for this keyword.

Tools

There is no tools for this keyword.


Publications

Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Comparing expert and non-expert conceptualisations of the land: An analysis of crowdsourced land cover data Comber A.
Brunsdon C.
Linda See
Steffen Fritz
Ian McCallum
Lecture Notes in Computer Science English 2013 This research compares expert and non-expert conceptualisations of land cover data collected through a Google Earth web-based interface. In so doing it seeks to determine the impacts of varying landscape conceptualisations held by different groups of VGI contributors on decisions that may be made using crowdsourced data, in this case to select the best global land cover dataset in each location. Whilst much other work has considered the quality of VGI, as yet little research has considered the impact of varying semantics and conceptualisations on the use of VGI in formal scientific analyses. This study found that conceptualisation of cropland varies between experts and non-experts. A number of areas for further research are outlined. 0 0
How much is said in a tweet? A multilingual, information-theoretic perspective Neubig G.
Kevin Duh
AAAI Spring Symposium - Technical Report English 2013 This paper describes a multilingual study on how much information is contained in a single post of microblog text from Twitter in 26 different languages. In order to answer this question in a quantitative fashion, we take an information-theoretic approach, using entropy as our criterion for quantifying "how much is said" in a tweet. Our results find that, as expected, languages with larger character sets such as Chinese and Japanese contain more information per character than other languages. However, we also find that, somewhat surprisingly, information per character does not have a strong correlation with information per microblog post, as authors of microblog posts in languages with more information per character do not necessarily use all of the space allotted to them. Finally, we examine the relative importance of a number of factors that contribute to whether a language has more or less information content in each character or post, and also compare the information content of microblog text with more traditional text from Wikipedia. Copyright 0 0
MIM: A minimum information model vocabulary and framework for scientific linked data Gamble M.
Goble C.
Klyne G.
Jun Zhao
2012 IEEE 8th International Conference on E-Science, e-Science 2012 English 2012 Linked Data holds great promise in the Life Sciences as a platform to enable an interoperable data commons, supporting new opportunities for discovery. Minimum Information Checklists have emerged within the Life Sciences as a means of standardising the reporting of experiments in an effort to increase the quality and reusability of the reported data. Existing tooling built around these checklists is aimed at supporting experimental scientists in the production of experiment reports that are compliant. It remains a challenge to quickly and easily assess an arbitrary set of data against these checklists. We present the MIM (Minimum Information Model) vocabulary and framework which aims to provide a practical, and scalable approach to describing and assessing Linked Data against minimum information checklists. The MIM framework aims to support three core activities: (1) publishing well described minimum information checklists in RDF as Linked Data; (2) publishing Linked Data against these checklists; and (3) validating existing "in the wild" Linked Data against a published checklist. We discuss the design considerations of the vocabulary and present its main classes. We demonstrate the utility of the framework with a checklist designed for the publishing of Chemical Structure Linked Data using data extracted from Wikipedia as an example. 0 0
Text segmentation by language using minimum description length Yamaguchi H.
Tanaka-Ishii K.
50th Annual Meeting of the Association for Computational Linguistics, ACL 2012 - Proceedings of the Conference English 2012 The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment. The problem was motivated by an attempt to collect a large amount of linguistic data for non-major languages from the web. The problem is formulated in terms of obtaining the minimum description length of a text, and the proposed solution finds the segments and their languages through dynamic programming. Empirical results demonstrating the potential of this approach are presented for experiments using texts taken from the Universal Declaration of Human Rights and Wikipedia, covering more than 200 languages. 0 0
A lightweight approach to enterprise architecture modeling and documentation Buckl S.
Florian Matthes
Christian Neubert
Schweda C.M.
Lecture Notes in Business Information Processing English 2011 Not quite a few enterprise architecture (EA) management endeavors start with the design of an information model covering the EA-related interests of the various stakeholders. In the design of this model, the enterprise architects resort to prominent frameworks, but often create what would be called an "ivory tower" model. This model at best case misses if not ignores the knowledge of the people that are responsible for business processes, applications, services etc. In this paper, we describe how the wisdom of the crowds can be used to develop information models. Making use of Web 2.0 techniques, wikis, and an open templating mechanism, our approach ties together the EA relevant information in a way, which is accessible to both humans and applications. We demonstrate how the ivory tower syndrome can be cured, typical pitfalls can be avoided, and employees can be empowered to contribute their expert knowledge to EA modeling and documentation. 0 0
Analysis on multilingual discussion for Wikipedia translation Linsi Xia
Naomi Yamashita
Toru Ishida
Proceedings - 2011 2nd International Conference on Culture and Computing, Culture and Computing 2011 English 2011 In current Wikipedia translation activities, most translation tasks are performed by bilingual speakers who have high language skills and specialized knowledge of the articles. Unfortunately, compared to the large amount of Wikipedia articles, the number of such qualified translators is very small. Thus the success of Wikipedia translation activities hinges on the contributions from non-bilingual speakers. In this paper, we report on a study investigating the effects of introducing a machine translation mediated BBS that enables monolinguals to collaboratively translate Wikipedia articles using their mother tongues. From our experiment using this system, we found out that users made high use of the system and communicated actively across different languages. Furthermore, most of such multilingual discussions seemed to be successful in transferring knowledge between different languages. Such success appeared to be made possible by a distinctive communication pattern which emerged as the users tried to avoid misunderstandings from machine translation errors. These findings suggest that there is a fair chance of non-bilingual speakers being capable of effectively contributing to Wikipedia translation activities with the assistance of machine translation. 0 0
Calculating Wikipedia article similarity using machine translation evaluation metrics Maike Erdmann
Andrew Finch
Kotaro Nakayama
Eiichiro Sumita
Takahiro Hara
Shojiro Nishio
Proceedings - 25th IEEE International Conference on Advanced Information Networking and Applications Workshops, WAINA 2011 English 2011 Calculating the similarity of Wikipedia articles in different languages is helpful for bilingual dictionary construction and various other research areas. However, standard methods for document similarity calculation are usually very simple. Therefore, we describe an approach of translating one Wikipedia article into the language of the other article, and then calculating article similarity with standard machine translation evaluation metrics. An experiment revealed that our approach is effective for identifying Wikipedia articles in different languages that are covering the same concept. 0 0
No free lunch: Brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity Ture F.
Elsayed T.
Lin J.
SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2011 This work explores the problem of cross-lingual pairwise similarity, where the task is to extract similar pairs of documents across two different languages. Solutions to this problem are of general interest for text mining in the multilingual context and have specific applications in statistical machine translation. Our approach takes advantage of cross-language information retrieval (CLIR) techniques to project feature vectors from one language into another, and then uses locality-sensitive hashing (LSH) to extract similar pairs. We show that effective cross-lingual pairwise similarity requires working with similarity thresholds that are much lower than in typical monolingual applications, making the problem quite challenging. We present a parallel, scalable MapReduce implementation of the sort-based sliding window algorithm, which is compared to a brute-force approach on German and English Wikipedia collections. Our central finding can be summarized as "no free lunch": there is no single optimal solution. Instead, we characterize effectiveness-efficiency tradeoffs in the solution space, which can guide the developer to locate a desirable operating point based on application- and resource-specific constraints. 0 0
Rapid rule-based machine translation between Dutch and Afrikaans Otte P.
Tyers F.M.
Proceedings of the 15th International Conference of the European Association for Machine Translation, EAMT 2011 English 2011 This paper describes the design, development and evaluation of a machine translation system between Dutch and Afrikaans developed over a period of around a month and a half. The system relies heavily on the re-use of existing publically available resources such as Wiktionary, Wikipedia and the Apertium machine translation platform. A method of translating compound words between the languages by means of left-to-right longest match lookup is also introduced and evaluated. 0 0
Supporting multilingual discussion for Wikipedia translation Noriyuki Ishida
Toshiyuki Takasaki
Masanobu Ishimatsu
Toru Ishida
Proceedings - 2011 2nd International Conference on Culture and Computing, Culture and Computing 2011 English 2011 Nowadays Wikipedia has become useful contents on the Web. However, there are great differences among the number of the articles from language to language. Some people try to increase the numbers by the translation, where they should have a discussion (regarding the discussion about the translation itself) because there are some specific words or phrases in an article. They can make use of machine translation in order to participate in the discussion with their own language, which leads to some problems. In this paper, we present the algorithm "Meta Translation", to keep the designated segments untranslated, and to add the description into it. 0 0
An N-gram-and-wikipedia joint approach to natural language identification Yang X.
Liang W.
2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings English 2010 Natural Language Identification is the process of detecting and determining in which language or languages a given piece of text is written. As one of the key steps in Computational Linguistics/Natural Language Processing(NLP) tasks, such as Machine Translation, Multi-lingual Information Retrieval and Processing of Language Resources, Natural Language Identification has drawn widespread attention and extensive research, making it one of the few relatively well studied sub-fields in the whole NLP field. However, various problems remain far from resolved in this field. Current noncomputational approaches require researchers possess sufficient prior linguistic knowledge about the languages to be identified, while current computational (statistical) approaches demand large-scale training set for each to-be-identified language. Apparently, drawbacks for both are that, few computer scientists are equipped with sufficient knowledge in Linguistics, and the size of the training set may get endlessly larger in pursuit of higher accuracy and the ability to process more languages. Also, faced with multi-lingual documents on the Internet, neither approach can render satisfactory results. To address these problems, this paper proposes a new approach to Natural Language Identification. It exploits N-Gram frequency statistics to segment a piece of text in a language-specific fashion, and then takes advantage of Wikipedia to determine the language used in each segment. Multiple experiments have demonstrated that satisfactory results can be rendered by this approach, especially with multi-lingual documents. 0 0
Automatically acquiring a semantic network of related concepts Szumlanski S.
Gomez F.
International Conference on Information and Knowledge Management, Proceedings English 2010 We describe the automatic construction of a semantic network1, in which over 3000 of the most frequently occurring monosemous nouns2 in Wikipedia (each appearing between 1,500 and 100,000 times) are linked to their semantically related concepts in the WordNet noun ontology. Relatedness between nouns is discovered automatically from cooccurrence in Wikipedia texts using an information theoretic inspired measure. Our algorithm then capitalizes on salient sense clustering among related nouns to automatically dis-ambiguate them to their appropriate senses (i.e., concepts). Through the act of disambiguation, we begin to accumulate relatedness data for concepts denoted by polysemous nouns, as well. The resultant concept-to-concept associations, covering 17,543 nouns, and 27,312 distinct senses among them, constitute a large-scale semantic network of related concepts that can be conceived of as augmenting the WordNet noun ontology with related-to links. 0 0
Cross-language retrieval using link-based language models Benjamin Roth
Dietrich Klakow
SIGIR 2010 Proceedings - 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval English 2010 We propose a cross-language retrieval model that is solely based on Wikipedia as a training corpus. The main contributions of our work are: 1. A translation model based on linked text in Wikipedia and a term weighting method associated with it. 2. A combination scheme to interpolate the link translation model with retrieval based on Latent Dirichlet Allocation. On the CLEF 2000 data we achieve improvement with respect to the best German-English system at the bilingual track (non-significant) and improvement against a baseline based on machine translation (significant). 0 0
Model-driven research in human-centric computing Chi E.H. Proceedings - 2010 IEEE Symposium on Visual Languages and Human-Centric Computing, VL/HCC 2010 English 2010 How can we build systems that enable users to mix and match tools together? How will we know whether we have done a good job in creating usable visual interactive systems that help users accomplish a wide variety of goals? How can people share the results of their explorations with each other, and for innovative tools to be remixed? Widely-used tools such as Web Browsers, Wikis, spreadsheets, and analytics environments like R all contain models of how people mix and combine operators and functionalities. In my own research, system developments are very much informed by models such as information scent, sensemaking, information theory, probabilistic models, and more recently, evolutionary dynamic models. These models have been used to understand a wide-variety of user behaviors in humancentric computing, from individuals interacting with a search system like MrTaggy.com to groups of people working on articles in Wikipedia. These models range in complexity from a simple set of assumptions to complex equations describing human and group behavior. In this talk, I will attempt to illustrate how a model-driven approach to answering the above questions should help to illuminate the path forward for Human-Centric Computing. 0 0
Query translation using Wikipedia-based resources for analysis and disambiguation Gaillard B.
Boualem M.
Collin O.
EAMT 2010 - 14th Annual Conference of the European Association for Machine Translation English 2010 This work investigates query translation using only Wikipedia-based resources in a two step approach: analysis and disambiguation. After arguing that data mined from Wikipedia is particularly relevant to query translation, both from a lexical and a semantic perspective, we detail the implementation of the approach. In the analysis phase, lexical units are extracted from queries and associated to several possible translations using a Wikipedia-based bilingual dictionary. During the second phase, one translation is chosen amongst the many candidates, based on topic homogeneity, asserted with the help of semantic information carried by categories of Wikipedia articles. We report promising results regarding translation accuracy. 0 0
When to cross over? Cross-language linking using Wikipedia for VideoCLEF 2009 Gyarmati A.
Jones G.J.F.
Lecture Notes in Computer Science English 2010 We describe Dublin City University (DCU)'s participation in the VideoCLEF 2009 Linking Task. Two approaches were implemented using the Lemur information retrieval toolkit. Both approaches first extracted a search query from the transcriptions of the Dutch TV broadcasts. One method first performed search on a Dutch Wikipedia archive, then followed links to corresponding pages in the English Wikipedia. The other method first translated the extracted query using machine translation and then searched the English Wikipedia collection directly. We found that using the original Dutch transcription query for searching the Dutch Wikipedia yielded better results. 0 0
WikiBABEL: A system for multilingual Wikipedia content Kumaran A.
Datha N.
Ashok B.
Saravanan K.
Ande A.
Sharma A.
Vedantham S.
Natampally V.
Dendi V.
Maurice S.
AMTA 2010 - 9th Conference of the Association for Machine Translation in the Americas English 2010 This position paper outlines our project - WikiBABEL - which will be released as an open source project for the creation of multi-lingual Wikipedia content, and has potential to produce parallel data as a by-product for Ma-chine Translation systems research. We discuss its architecture, functionality and the user-experience components, and briefly pre-sent an analysis that emphasizes the resonance that the WikiBABEL design and the planned involvement with Wikipedia has with the open source communities in general and Wikipedians in particular. 0 0
A wiki-based approach to enterprise architecture documentation and analysis Buckl S.
Florian Matthes
Christian Neubert
Schweda C.M.
17th European Conference on Information Systems, ECIS 2009 English 2009 Enterprise architecture (EA) management is a challenging task, modern enterprises have to face. This task is often addressed via organization-specific methodologies, which are implemented or derived from a respective EA management tool, or are at least partially aligned and supported by such tools. Nevertheless, especially when starting an EA management endeavor, the documentation of the EA is often not likely to satisfy the level of formalization, which is needed to employ an EA management tool. This paper address the issue of starting EA management, more precise EA documentation and analysis, by utilizing a wiki-based approach. From there, we discuss which functions commonly implemented in wiki-systems could be used in this context, which augmentations and extensions would be needed, and which potential impediments exist. 0 0
An agent- based semantic web service discovery framework Neiat A.G.
Mohsenzadeh M.
Forsati R.
Rahmani A.M.
Proceedings - 2009 International Conference on Computer Modeling and Simulation, ICCMS 2009 English 2009 Web services have changed the Web from a database of static documents to a service provider. To improve the automation of Web services interoperation, a lot of technologies are recommended, such as semantic Web services and agents. In this paper we propose a framework for semantic Web service discovery based on semantic Web services and FIPA multi agents. This paper provides a broker which provides semantic interoperability between semantic Web service provider and agents by translating WSDL to DF description for semantic Web services and DF description to WSDL forFIPA multi agents. We describe how the proposed architecture analyzes the request and match search query. The ontology management in the broker creates the user ontology and merges it with general ontology (i.e. WordNet, Yago, Wikipedia ⋯). We also describe the recommendation component that recommends the WSDL to Web service provider to increase their retrieval probability in the related queries. 0 0
Building knowledge base for Vietnamese information retrieval Nguyen T.C.
Le H.M.
Phan T.T.
IiWAS2009 - The 11th International Conference on Information Integration and Web-based Applications and Services English 2009 At present, Vietnamese knowledge base (vnKB) is one of the most important focuses of Vietnamese researchers because of its applications in wide areas such as Information Retrieval (IR), Machine Translation (MT) etc. There have been several separate projects developing vnKB in various domains. The training in vnBK is the most difficulty because of quantity and quality of training data, and lacking of available Vietnamese corpus with acceptable quality. This paper introduces an approach, which first extracts semantic information from Vietnamese Wikipedia (vnWK), then trains the proposed vnKB by applying support vector machine (SVM) technique. The experimentation of the proposed approach shows that it is a potential solution because of its good results and proves that it can provide more valuable benefits when applying to our Vietnamese Semantic Information Retrieval system. 0 0
Personal knowledge management for knowledge workers using social semantic technologies Hyeoncheol Kim
Breslin J.G.
Stefan Decker
Choi J.
International Journal of Intelligent Information and Database Systems English 2009 Knowledge workers have different applications and resources in heterogeneous environments for doing their knowledge tasks and they often need to solve a problem through combining several resources. Typical personal knowledge management (PKM) systems do not provide effective ways for representing knowledge worker's unstructured knowledge or idea. In order to provide better knowledge activity for them, we implement Wiki-based social Network Thin client (WANT) that is a wiki-based semantic tagging system for collaborative and communicative knowledge creation and maintenance for a knowledge worker. And also, we suggest the social semantic cloud of tags (SCOT) ontology to represent tag data at a semantic level and combine this ontology in WANT. WANT supports a wide scope of social activities through online mash-up services and interlink resources with desktop and web environments. Our approach provides basic functionalities such as creating, organising and searching knowledge at individual level, as well as enhances social connections among knowledge workers based on their activities. Copyright 0 0
Terabytes of tobler: Evaluating the first law in a massive, domain-neutral representation of world knowledge Brent Hecht
Moxley E.
Lecture Notes in Computer Science English 2009 The First Law of Geography states, "everything is related to everything else, but near things are more related than distant things." Despite the fact that it is to a large degree what makes "spatial special," the law has never been empirically evaluated on a large, domain-neutral representation of world knowledge. We address the gap in the literature about this critical idea by statistically examining the multitude of entities and relations between entities present across 22 different language editions of Wikipedia. We find that, at least according to the myriad authors of Wikipedia, the First Law is true to an overwhelming extent regardless of language-defined cultural domain. 0 0
The effect of using a semantic wiki for metadata management: A controlled experiment Huner K.M.
Boris Otto
Proceedings of the 42nd Annual Hawaii International Conference on System Sciences, HICSS English 2009 A coherent and consistent understanding of corporate data is an important factor for effective management of diversified companies and implies a need for company-wide unambiguous data definitions. Inspired by the success of Wikipedia, wiki software has become a broadly discussed alternative for corporate metadata management. However, in contrast to the performance and sustainability of wikis in general, benefits of using semantic wikis have not been investigated sufficiently. The paper at hand presents results of a controlled experiment that investigates effects of using a semantic wiki for metadata management in comparison to a classical wiki. Considering threats to validity, the analysis (i.e. 74 subjects using both a classical and a semantic wiki) shows that the semantic wiki is superior to the classical variant regarding information retrieval tasks. At the same time, the results indicate that more effort is needed to build up the semantically annotated wiki content in the semantic wiki. 0 0
Trdlo, an open source tool for building transducing dictionary Grac M. Lecture Notes in Computer Science English 2009 This paper describes the development of an open-source tool named Trdlo. Trdlo was developed as part of our effort to build a machine translation system between very close languages. These languages usually do not have available pre-processed linguistic resources or dictionaries suitable for computer processing. Bilingual dictionaries have a big impact on quality of translation. Proposed methods described in this paper attempt to extend existing dictionaries with inferable translation pairs. Our approach requires only 'cheap' resources: a list of lemmata for each language and rules for inferring words from one language to another. It is also possible to use other resources like annotated corpora or Wikipedia. Results show that this approach greatly improves effectivity of building Czech-Slovak dictionary. 0 0
Using Wikipedia and Wiktionary in domain-specific information retrieval Muller C.
Iryna Gurevych
Lecture Notes in Computer Science English 2009 The main objective of our experiments in the domain-specific track at CLEF 2008 is utilizing semantic knowledge from collaborative knowledge bases such as Wikipedia and Wiktionary to improve the effectiveness of information retrieval. While Wikipedia has already been used in IR, the application of Wiktionary in this task is new. We evaluate two retrieval models, i.e. SR-Text and SR-Word, based on semantic relatedness by comparing their performance to a statistical model as implemented by Lucene. We refer to Wikipedia article titles and Wiktionary word entries as concepts and map query and document terms to concept vectors which are then used to compute the document relevance. In the bilingual task, we translate the English topics into the document language, i.e. German, by using machine translation. For SR-Text, we alternatively perform the translation process by using cross-language links in Wikipedia, whereby the terms are directly mapped to concept vectors in the target language. The evaluation shows that the latter approach especially improves the retrieval performance in cases where the machine translation system incorrectly translates query terms. 0 0
An empirical research on extracting relations from Wikipedia text Huang J.-X.
Ryu P.-M.
Choi K.-S.
Lecture Notes in Computer Science English 2008 A feature based relation classification approach is presented, in which probabilistic and semantic relatedness features between patterns and relation types are employed with other linguistic information. The importance of each feature set is evaluated with Chi-square estimator, and the experiments show that, the relatedness features have big impact on the relation classification performance. A series experiments are also performed to evaluate the different machine learning approaches on relation classification, among which Bayesian outperformed other approaches including Support Vector Machine (SVM). 0 0
Augmented Social Cognition Chi E.H.
Peter Pirolli
Bongwon Suh
Aniket Kittur
Pendleton B.
Mytkowicz T.
AAAI Spring Symposium - Technical Report English 2008 Research in Augmented Social Cognition is aimed at enhancing the ability of a group of people to remember, think, and reason; to augment their speed and capacity to acquire, produce, communicate, and use knowledge; and to advance collective and individual intelligence in socially mediated information environments. In this paper, we describe the emergence of this research endeavor, and summarize some results from the research. In particular, we have found that (1) analyses of conflicts and coordination in Wikipedia have shown us the scientific need to understand social sensemaking environments; and (2) information theoretic analyses of social tagging behavior in del.icio.us shows the need to understand human vocabulary systems. 0 0
Catriple: Extracting triples from wikipedia categories Qiaoling Liu
Kaifeng Xu
Lei Zhang
Haofen Wang
Yiqin Yu
Yue Pan
Lecture Notes in Computer Science English 2008 As an important step towards bootstrapping the Semantic Web, many efforts have been made to extract triples from Wikipedia because of its wide coverage, good organization and rich knowledge. One kind of important triples is about Wikipedia articles and their non-isa properties, e.g. (Beijing, country, China). Previous work has tried to extract such triples from Wikipedia infoboxes, article text and categories. The infobox-based and text-based extraction methods depend on the infoboxes and suffer from a low article coverage. In contrast, the category-based extraction methods exploit the widespread categories. However, they rely on predefined properties, which is too effort-consuming and explores only very limited knowledge in the categories. This paper automatically extracts properties and triples from the less explored Wikipedia categories so as to achieve a wider article coverage with less manual effort. We manage to realize this goal by utilizing the syntax and semantics brought by super-sub category pairs in Wikipedia. Our prototype implementation outputs about 10M triples with a 12-level confidence ranging from 47.0% to 96.4%, which cover 78.2% of Wikipedia articles. Among them, 1.27M triples have confidence of 96.4%. Applications can on demand use the triples with suitable confidence. 0 0
Collaborative knowledge semantic graph image search Shieh J.-R.
Yeh Y.-T.
Lin C.-H.
Lin C.-Y.
Wu J.-L.
Proceeding of the 17th International Conference on World Wide Web 2008, WWW'08 English 2008 In this paper, we propose a Collaborative Knowledge Semantic Graphs Image Search (CKSGIS) system. It provides a novel way to conduct image search by utilizing the collaborative nature in Wikipedia and by performing network analysis to form semantic graphs for search-term expansion. The collaborative article editing process used by Wikipedia's contributors is formalized as bipartite graphs that are folded into networks between terms. When a user types in a search term, CKSGIS automatically retrieves an interactive semantic graph of related terms that allow users to easily find related images not limited to a specific search term. Interactive semantic graph then serve as an interface to retrieve images through existing commercial search engines. This method significantly saves users' time by avoiding multiple search keywords that are usually required in generic search engines. It benefits both naive users who do not possess a large vocabulary and professionals who look for images on a regular basis. In our experiments, 85% of the participants favored CKSGIS system rather than commercial search engines. 0 0
Constructing a global ontology by concept mapping using Wikipedia thesaurus Minghua Pei
Kotaro Nakayama
Takahiro Hara
Shojiro Nishio
Proceedings - International Conference on Advanced Information Networking and Applications, AINA English 2008 Recently, the importance of semantics on the WWW is widely recognized and a lot of semantic information (RDF, OWL etc.) is being built/published on the WWW. However, the lack of ontology mappings becomes a serious problem for the Semantic Web since it needs well defined relations to retrieve information correctly by inferring the meaning of information. One to one mapping is not an efficient method due to the nature of distributed environment. Therefore, it would be a considerable method to map the concepts by using a large-scale intermediate ontology. On the other hand, Wikipedia is a large-scale of concept network covering almost all concepts in the real world. In this paper, we propose an intermediate ontology construction method using Wikipedia Thesaurus, an association thesaurus extracted from Wikipedia. Since Wikipedia Thesaurus provides associated concepts without explicit relation type, we propose an approach of concept mapping using two sub methods; "name mapping" and "logic-based mapping". 0 0
Decoding Wikipedia categories for knowledge acquisition Vivi Nastase
Michael Strube
Proceedings of the National Conference on Artificial Intelligence English 2008 This paper presents an approach to acquire knowledge from Wikipedia categories and the category network. Many Wikipedia categories have complex names which reflect human classification and organizing instances, and thus encode knowledge about class attributes, taxonomic and other semantic relations. We decode the names and refer back to the network to induce relations between concepts in Wikipedia represented through pages or categories. The category structure allows us to propagate a relation detected between constituents of a category name to numerous concept links. The results of the process are evaluated against ResearchCyc and a subset also by human judges. The results support the idea that Wikipedia category names are a rich source of useful and accurate knowledge. Copyright © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 0 0
Dublin City University at CLEF 2007: Cross-language speech retrieval experiments YanChun Zhang
Jones G.J.F.
Zhang K.
Lecture Notes in Computer Science English 2008 The Dublin City University participation in the CLEF 2007 CL-SR English task concentrated primarily on issues of topic translation. Our retrieval system used the BM25F model and pseudo relevance feedback. Topics were translated into English using the Yahoo! BabelFish free online service combined with domain-specific translation lexicons gathered automatically from Wikipedia. We explored alternative topic translation methods using these resources. Our results indicate that extending machine translation tools using automatically generated domain-specific translation lexicons can provide improved CLIR effectiveness for this task. 0 0
Employing a domain specific ontology to perform semantic search Morneau M.
Mineau G.W.
Lecture Notes in Computer Science English 2008 Increasing the relevancy of Web search results has been a major concern in research over the last years. Boolean search, metadata, natural language based processing and various other techniques have been applied to improve the quality of search results sent to a user. Ontology-based methods were proposed to refine the information extraction process but they have not yet achieved wide adoption by search engines. This is mainly due to the fact that the ontology building process is time consuming. An all inclusive ontology for the entire World Wide Web might be difficult if not impossible to construct, but a specific domain ontology can be automatically built using statistical and machine learning techniques, as done with our tool: SeseiOnto. In this paper, we describe how we adapted the SeseiOnto software to perform Web search on the Wikipedia page on climate change. SeseiOnto, by using conceptual graphs to represent natural language and an ontology to extract links between concepts, manages to properly answer natural language queries about climate change. Our tests show that SeseiOnto has the potential to be used in domain specific Web search as well as in corporate intranets. 0 0
Importance of semantic representation: Dataless classification Chang M.-W.
Lev Ratinov
Dan Roth
Srikumar V.
Proceedings of the National Conference on Artificial Intelligence English 2008 Traditionally, text categorization has been studied as the problem of training of a classifier using labeled data. However, people can categorize documents into named categories without any explicit training because we know the meaning of category names. In this paper, we introduce Dataless Classification, a learning protocol that uses world knowledge to induce classifiers without the need for any labeled data. Like humans, a dataless classifier interprets a string of words as a set of semantic concepts. We propose a model for dataless classification and show that the label name alone is often sufficient to induce classifiers. Using Wikipedia as our source of world knowledge, we get 85.29% accuracy on tasks from the 20 Newsgroup dataset and 88.62% accuracy on tasks from a Yahoo! Answers dataset without any labeled or unlabeled data from the datasets. With unlabeled data, we can further improve the results and show quite competitive performance to a supervised learning algorithm that uses 100 labeled examples. Copyright © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 0 0
Improving interaction with virtual globes through spatial thinking: Helping users ask "Why?" Schoming J.
Raubal M.
Marsh M.
Brent Hecht
Antonio Kruger
Michael Rohs
International Conference on Intelligent User Interfaces, Proceedings IUI English 2008 Virtual globes have progressed from little-known technology to broadly popular software in a mere few years. We investigated this phenomenon through a survey and discovered that, while virtual globes are en vogue, their use is restricted to a small set of tasks so simple that they do not involve any spatial thinking. Spatial thinking requires that users ask "what is where" and "why"; the most common virtual globe tasks only include the "what". Based on the results of this survey, we have developed a multi-touch virtual globe derived from an adapted virtual globe paradigm designed to widen the potential uses of the technology by helping its users to inquire about both the "what is where" and "why" of spatial distribution. We do not seek to provide users with full GIS (geographic information system) functionality, but rather we aim to facilitate the asking and answering of simple "why" questions about general topics that appeal to a wide virtual globe user base. Copyright 2008 ACM. 0 0
Information extraction from Wikipedia: Moving down the long tail Fei Wu
Raphael Hoffmann
Weld D.S.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining English 2008 Not only is Wikipedia a comprehensive source of quality information, it has several kinds of internal structure (e.g., relational summaries known as infoboxes), which enable self-supervised information extraction. While previous efforts at extraction from Wikipedia achieve high precision and recall on well-populated classes of articles, they fail in a larger number of cases, largely because incomplete articles and infrequent use of infoboxes lead to insufficient training data. This paper presents three novel techniques for increasing recall from Wikipedia's long tail of sparse classes: (1) shrinkage over an automatically-learned subsumption taxonomy, (2) a retraining technique for improving the training data, and (3) supplementing results by extracting from the broader Web. Our experiments compare design variations and show that, used in concert, these techniques increase recall by a factor of 1.76 to 8.71 while maintaining or increasing precision. 0 0
Instanced-based mapping between thesauri and folksonomies Wartena C.
Brussee R.
Lecture Notes in Computer Science English 2008 The emergence of web based systems in which users can annotate items, raises the question of the semantic interoperability between vocabularies originating from collaborative annotation processes, often called folksonomies, and keywords assigned in a more traditional way. If collections are annotated according to two systems, e.g. with tags and keywords, the annotated data can be used for instance based mapping between the vocabularies. The basis for this kind of matching is an appropriate similarity measure between concepts, based on their distribution as annotations. In this paper we propose a new similarity measure that can take advantage of some special properties of user generated metadata. We have evaluated this measure with a set of articles from Wikipedia which are both classified according to the topic structure of Wikipedia and annotated by users of the bookmarking service del.icio.us. The results using the new measure are significantly better than those obtained using standard similarity measures proposed for this task in the literature, i.e., it correlates better with human judgments. We argue that the measure also has benefits for instance based mapping of more traditionally developed vocabularies. 0 0
Knowledge supervised text classification with no labeled documents Zhang C.
Xue G.-R.
Yiqin Yu
Lecture Notes in Computer Science English 2008 In traditional text classification approaches, the semantic meanings of the classes are described by the labeled documents. Since labeling documents is often time consuming and expensive, it is a promising idea that asking users to provide some keywords to depict the classes, instead of labeling any documents. However, short pieces of keywords may not contain enough information and therefore may lead to unreliable classifier. Fortunately, there are large amount of public data easily available in web directories, such as ODP, Wikipedia, etc. We are interested in exploring the enormous crowd intelligence contained in such public data to enhance text classification. In this paper, we propose a novel text classification framework called "Knowledge Supervised Learning"(KSL), which utilizes the knowledge in keywords and the crowd intelligence to learn the classifier without any labeled documents. We design a two-stage risk minimization (TSRM) approach for the KSL problem. It can optimize the expected prediction risk and build the high quality classifier. Empirical results verify our claim: our algorithm can achieve above 0.9 on Micro-F1 on average, which is much better than baselines and even comparable against SVM classifier supervised by labeled documents. 0 0
Knowledge-supervised learning by co-clustering based approach Congle Z.
Dikan X.
Proceedings - 7th International Conference on Machine Learning and Applications, ICMLA 2008 English 2008 Traditional text learning algorithms need labeled documents to supervise the learning process, but labeling documents of a specific class is often expensive and time consuming. We observe it is convenient to use some keywords(i.e. class-descriptions) to describe class sometimes. However, short class-description usually does not contain enough information to guide classification. Fortunately, large amount of public data is easily acquired, i.e. ODP, Wikipedia and so on, which contains enormous knowledge. In this paper, we address the text classification problem with such knowledge rather than any labeled documents and propose a co-clustering based knowledge-supervised learning algorithm (CoCKSL) in information theoretic framework, which effectively applies the knowledge to classification tasks. 0 0
L3S at INEX 2007: Query expansion for entity ranking using a highly accurate ontology Gianluca Demartini
Firan C.S.
Tereza Iofciu
Lecture Notes in Computer Science English 2008 Entity ranking on Web scale datasets is still an open challenge. Several resources, as for example Wikipedia-based ontologies, can be used to improve the quality of the entity ranking produced by a system. In this paper we focus on the Wikipedia corpus and propose algorithms for finding entities based on query relaxation using category information. The main contribution is a methodology for expanding the user query by exploiting the semantic structure of the dataset. Our approach focuses on constructing queries using not only keywords from the topic, but also information about relevant categories. This is done leveraging on a highly accurate ontology which is matched to the character strings of the topic. The evaluation is performed using the INEX 2007 Wikipedia collection and entity ranking topics. The results show that our approach performs effectively, especially for early precision metrics. 0 0
Lexical and semantic resources for NLP: From words to meanings Gentile A.L.
Pierpaolo Basile
Iaquinta L.
Giovanni Semeraro
Lecture Notes in Computer Science English 2008 A user expresses her information need through words with a precise meaning, but from the machine point of view this meaning does not come with the word. A further step is needful to automatically associate it to the words. Techniques that process human language are required and also linguistic and semantic knowledge, stored within distinct and heterogeneous resources, which play an important role during all Natural Language Processing (NLP) steps. Resources management is a challenging problem, together with the correct association between URIs coming from the resources and meanings of the words. This work presents a service that, given a lexeme (an abstract unit of morphological analysis in linguistics, which roughly corresponds to a set of words that are different forms of the same word), returns all syntactic and semantic information collected from a list of lexical and semantic resources. The proposed strategy consists in merging data with origin from stable resources, such as WordNet, with data collected dynamically from evolving sources, such as the Web or Wikipedia. That strategy is implemented in a wrapper to a set of popular linguistic resources that provides a single point of access to them, in a transparent way to the user, to accomplish the computational linguistic problem of getting a rich set of linguistic and semantic annotations in a compact way. 0 0
Making More Wikipedians: Facilitating Semantics Reuse for Wikipedia Authoring Linyun Fu
Haofen Wang
Haiping Zhu
Huajie Zhang
Yang Wang
Yong Yu
The Semantic Web English 2008 Wikipedia, a killer application in Web 2.0, has embraced the power of collaborative editing to harness collective intelligence. It can also serve as an ideal Semantic Web data source due to its abundance, influence, high quality and well-structuring. However, the heavy burden of up-building and maintaining such an enormous and ever-growing online encyclopedic knowledge base still rests on a very small group of people. Many casual users may still feel difficulties in writing high quality Wikipedia articles. In this paper, we use RDF graphs to model the key elements in Wikipedia authoring, and propose an integrated solution to make Wikipedia authoring easier based on RDF graph matching, expecting making more Wikipedians. Our solution facilitates semantics reuse and provides users with: 1) a link suggestion module that suggests and auto-completes internal links between Wikipedia articles for the user; 2) a category suggestion module that helps the user place her articles in correct categories. A prototype system is implemented and experimental results show significant improvements over existing solutions to link and category suggestion tasks. The proposed enhancements can be applied to attract more contributors and relieve the burden of professional editors, thus enhancing the current Wikipedia to make it an even better Semantic Web data source. 0 0
Managing requirements knowledge (MaRK'08) Maalej W.
Thurimella A.K.
Happel H.-J.
Bjorn Decker
2008 1st International Workshop on Managing Requirements Knowledge, MARK'08 English 2008 MaRK'08 focuses on potentials and benefits of lightweight knowledge management approaches, such as ontology-based annotation, Semantic Wikis and rationale management techniques, applied to requirements engineering. Methodologies, processes and tools for capturing, externalizing, sharing and reusing of knowledge in (distributed) requirements engineering processes are discussed. Furthermore, the workshop is an interactive exchange platform between the knowledge management community, requirements engineering community and industrial practitioners. This proceeding includes selected and refereed contributions. 0 0
Method for building sentence-aligned corpus from wikipedia Yasuda K.
Eiichiro Sumita
AAAI Workshop - Technical Report English 2008 We propose the framework of a Machine Translation (MT) bootstrapping method by using multilingual Wikipedia articles. This novel method can simultaneously generate a statistical machine translation (SMT) and a sentence-aligned corpus. In this study, we perform two types of experiments. The aim of the first type of experiments is to verify the sentence alignment performance by comparing the proposed method with a conventional sentence alignment approach. For the first type of experiments, we use JENAAD, which is a sentence-aligned corpus built by the conventional sentence alignment method. The second type of experiments uses actual English and Japanese Wikipedia articles for sentence alignment. The result of the first type of experiments shows that the performance of the proposed method is comparable to that of the conventional sentence alignment method. Additionally, the second type of experiments shows that wc can obtain the English translation of 10% of Japanese sentences while maintaining high alignment quality (rank-A ratio of over 0.8). Copyright 0 1
NAGA: Harvesting, searching and ranking knowledge Gjergji Kasneci
Suchanek F.M.
Ifrim G.
Elbassuoni S.
Maya Ramanath
Gerhard Weikum
Proceedings of the ACM SIGMOD International Conference on Management of Data English 2008 The presence of encyclopedic Web sources, such as Wikipedia, the Internet Movie Database (IMDB), World Factbook, etc. calls for new querying techniques that are simple and yet more expressive than those provided by standard keyword-based search engines. Searching for explicit knowledge needs to consider inherent semantic structures involving entities and relationships. In this demonstration proposal, we describe a semantic search system named NAGA. NAGA operates on a knowledge graph, which contains millions of entities and relationships derived from various encyclopedic Web sources, such as the ones above. NAGA's graph-based query language is geared towards expressing queries with additional semantic information. Its scoring model is based on the principles of generative language models, and formalizes several desiderata such as confidence, informativeness and compactness of answers. We propose a demonstration of NAGA which will allow users to browse the knowledge base through a user interface, enter queries in NAGA's query language and tune the ranking parameters to test various ranking aspects. 0 0
Named entity disambiguation: A hybrid statistical and rule-based incremental approach Nguyen H.T.
Cao T.H.
Lecture Notes in Computer Science English 2008 The rapidly increasing use of large-scale data on the Web makes named entity disambiguation become one of the main challenges to research in Information Extraction and development of Semantic Web. This paper presents a novel method for detecting proper names in a text and linking them to the right entities in Wikipedia. The method is hybrid, containing two phases of which the first one utilizes some heuristics and patterns to narrow down the candidates, and the second one employs the vector space model to rank the ambiguous cases to choose the right candidate. The novelty is that the disambiguation process is incremental and includes several rounds that filter the candidates, by exploiting previously identified entities and extending the text by those entity attributes every time they are successfully resolved in a round. We test the performance of the proposed method in disambiguation of names of people, locations and organizations in texts of the news domain. The experiment results show that our approach achieves high accuracy and can be used to construct a robust named entity disambiguation system. 0 0
On visualizing heterogeneous semantic networks from multiple data sources Maureen
Aixin Sun
Lim E.-P.
Anwitaman Datta
Kuiyu Chang
Lecture Notes in Computer Science English 2008 In this paper, we focus on the visualization of heterogeneous semantic networks obtained from multiple data sources. A semantic network comprising a set of entities and relationships is often used for representing knowledge derived from textual data or database records. Although the semantic networks created for the same domain at different data sources may cover a similar set of entities, these networks could also be very different because of naming conventions, coverage, view points, and other reasons. Since digital libraries often contain data from multiple sources, we propose a visualization tool to integrate and analyze the differences among multiple social networks. Through a case study on two terrorism-related semantic networks derived from Wikipedia and Terrorism Knowledge Base (TKB) respectively, the effectiveness of our proposed visualization tool is demonstrated. 0 0
Semantic full-text search with ESTER: Scalable, easy, fast Holger Bast
Fabian Suchanek
Ingmar Weber
Proceedings - IEEE International Conference on Data Mining Workshops, ICDM Workshops 2008 English 2008 We present a demo of ESTER, a search engine that combines the ease of use, speed and scalability of full-text search with the powerful semantic capabilities of ontologies. ESTER supports full-text queries, ontological queries and combinations of these, yet its interface is as easy as can be: A standard search field with semantic information provided interactively as one types. ESTER works by reducing all queries to two basic operations: prefix search and join, which can be implemented very efficiently in terms of both processing time and index space. We demonstrate the capabilities of ESTER on a combination of the English Wikipedia with the Yago ontology, with response times below 100 milliseconds for most queries, and an index size of about 4 GB. The system can be run both stand-alone and as a Web application. 0 0
Semantic modelling of user interests based on cross-folksonomy analysis Szomszor M.
Alani H.
Ivan Cantador
O'Hara K.
Shadbolt N.
Lecture Notes in Computer Science English 2008 The continued increase in Web usage, in particular participation in folksonomies, reveals a trend towards a more dynamic and interactive Web where individuals can organise and share resources. Tagging has emerged as the de-facto standard for the organisation of such resources, providing a versatile and reactive knowledge management mechanism that users find easy to use and understand. It is common nowadays for users to have multiple profiles in various folksonomies, thus distributing their tagging activities. In this paper, we present a method for the automatic consolidation of user profiles across two popular social networking sites, and subsequent semantic modelling of their interests utilising Wikipedia as a multi-domain model. We evaluate how much can be learned from such sites, and in which domains the knowledge acquired is focussed. Results show that far richer interest profiles can be generated for users when multiple tag-clouds are combined. 0 0
Semantically enhanced entity ranking Gianluca Demartini
Firan C.S.
Tereza Iofciu
Wolfgang Nejdl
Lecture Notes in Computer Science English 2008 Users often want to find entities instead of just documents, i.e., finding documents entirely about specific real-world entities rather than general documents where the entities are merely mentioned. Searching for entities on Web scale repositories is still an open challenge as the effectiveness of ranking is usually not satisfactory. Semantics can be used in this context to improve the results leveraging on entity-driven ontologies. In this paper we propose three categories of algorithms for query adaptation, using (1) semantic information, (2) NLP techniques, and (3) link structure, to rank entities in Wikipedia. Our approaches focus on constructing queries using not only keywords but also additional syntactic information, while semantically relaxing the query relying on a highly accurate ontology. The results show that our approaches perform effectively, and that the combination of simple NLP, Link Analysis and semantic techniques improves the retrieval performance of entity search. 0 0
Simultaneous multilingual search for translingual information retrieval Parton K.
McKeown K.R.
Allan J.
Henestroza E.
International Conference on Information and Knowledge Management, Proceedings English 2008 We consider the problem of translingual information retrieval, where monolingual searchers issue queries in a different language than the document language(s) and the results must be returned in the language they know, the query language. We present a framework for translingual IR that integrates document translation and query translation into the retrieval model. The corpus is represented as an aligned, jointly indexed "pseudo-parallel" corpus, where each document contains the text of the document along with its translation into the query language. The queries are formulated as multilingual structured queries, where each query term and its translations into the document language(s) are treated as synonym sets. This model leverages simultaneous search in multiple languages against jointly indexed documents to improve the accuracy of results over search using document translation or query translation alone. For query translation, we compared a statistical machine translation (SMT) approach to a dictionarybased approach. We found that using a Wikipedia-derived dictionary for named entities combined with an SMT-based dictionary worked better than SMT alone. Simultaneous multilingual search also has other important features suited to translingual search, since it can provide an indication of poor document translation when a match with the source document is found. We show how close integration of CLIR and SMT allows us to improve result translation in addition to IR results. Copyright 2008 ACM. 0 0
Using Semantic Graphs for Image Search Shieh J.-R.
Yeh Y.-T.
Lin C.-H.
Lin C.-Y.
Wu J.-L.
2008 IEEE International Conference on Multimedia and Expo, ICME 2008 - Proceedings English 2008 In this paper, we propose a Semantic Graphs for Image Search (SGIS) system, which provides a novel way for image search by utilizing collaborative knowledge in Wikipedia and network analysis to form semantic graphs for search-term suggestion. The collaborative article editing process of Wikipedia's contributors is formalized as bipartite graphs that are folded into networks between terms. When user types in a search term, SGIS automatically retrieves an interactive semantic graph of related terms that allow users easily find related images not limited to a specific search term. Interactive semantic graph then serves as an interface to retrieve images through existing commercial search engines. This method significantly saves users' time by avoiding multiple search keywords that are usually required in generic search engines. It benefits both naive user who does not possess a large vocabulary (e.g., students) and professionals who look for images on a regular basis. In our experiments, 85% of the participants favored SGIS system than commercial search engines. 0 0
Using semantic Wikis to support software reuse Shiva S.G.
Shala L.A.
Journal of Software English 2008 It has been almost four decades since the idea of software reuse was proposed. Many success stories have been told, yet it is believed that software reuse is still in the development phase and has not reached its fall potential. How far are we with software reuse research? What have we learned from previous software reuse efforts? This paper is an attempt to answer these questions and propose a software reuse repository system based on semantic Wikis. In addition to supporting general collaboration among users offered by regular wilds, semantic Wikis provide means of adding metadata about the concepts and relations that are contained within the Wiki. This underlying model of domain knowledge enhances the software repository navigation and search performance and result in a system that is easy to use for non-expert users while being powerful in the way in which new artifacts can be created and located. 0 0
Using wiktionary for computing semantic relatedness Torsten Zesch
Muller C.
Iryna Gurevych
Proceedings of the National Conference on Artificial Intelligence English 2008 We introduce Wiktionary as an emerging lexical semantic resource that can be used as a substitute for expert-made resources in AI applications. We evaluate Wiktionary on the pervasive task of computing semantic relatedness for English and German by means of correlation with human rankings and solving word choice problems. For the first time, we apply a concept vector based measure to a set of different concept representations like Wiktionary pseudo glosses, the first paragraph of Wikipedia articles, English WordNet glosses, and GermaNet pseudo glosses. We show that: (i) Wiktionary is the best lexical semantic resource in the ranking task and performs comparably to other resources in the word choice task, and (ii) the concept vector based approach yields the best results on all datasets in both evaluations. Copyright © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 0 1
Utilizing phrase based semantic information for term dependency Yang X.
Fan D.
Bin W.
ACM SIGIR 2008 - 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Proceedings English 2008 Previous work on term dependency has not taken into account semantic information underlying query phrases. In this work, we study the impact of utilizing phrase based concepts for term dependency. We use Wikipedia to separate important and less important term dependencies, and treat them accordingly as features in a linear feature-based retrieval model. We compare our method with a Markov Random Field (MRF) model on four TREC document collections. Our experimental results show that utilizing phrase based concepts improves the retrieval effectiveness of term dependency, and reduces the size of the feature set to large extent. 0 0
Vispedia*: Interactive visual exploration of wikipedia data via search-based integration Bryan Chan
Wu L.
Justin Talbot
Mike Cammarano
Pat Hanrahan
IEEE Transactions on Visualization and Computer Graphics English 2008 Wikipedia is an example of the collaborative, semi-structured data sets emerging on the Web. These data sets have large, non-uniform schema that require costly data integration into structured tables before visualization can begin. We present Vispedia, a Web-based visualization system that reduces the cost of this data integration. Users can browse Wikipedia, select an interesting data table, then use a search interface to discover, integrate, and visualize additional columns of data drawn from multiple Wikipedia articles. This interaction is supported by a fast path search algorithm over DBpedia, a semantic graph extracted from Wikipedia's hyperlink structure. Vispedia can also export the augmented data tables produced for use in traditional visualization systems. We believe that these techniques begin to address the "long tail" of visualization by allowing a wider audience to visualize a broader class of data. We evaluated this system in a first-use formative lab study. Study participants were able to quickly create effective visualizations for a diverse set of domains, performing data integration as needed. 0 0
WikiBABEL: Community creation of multilingual data Kumaran A.
Saravanan K.
Maurice S.
WikiSym 2008 - The 4th International Symposium on Wikis, Proceedings English 2008 In this paper, we present a collaborative framework - wikiBABEL - for the efficient and effective creation of multilingual content by a community of users. The wikiBABEL framework leverages the availability of fairly stable content in a source language (typically, English) and a reasonable and not necessarily perfect machine translation system between the source language and a given target language, to create the rough initial content in the target language that is published in a collaborative platform. The platform provides an intuitive user interface and a set of linguistic tools for collaborative correction of the rough content by a community of users, aiding creation of clean content in the target language. We describe the architectural components implementing the wikiBABEL framework, namely, the systems for source and target language content management, mechanisms for coordination and collaboration and intuitive user interface for multilingual editing and review. Importantly, we discuss the integrated linguistic resources and tools, such as, bilingual dictionaries, machine translation and transliteration systems, etc., to help the users during the content correction and creation process. In addition, we analyze and present the prime factors - user-interface features or linguistic tools and resources - that significantly influence the user experiences in multilingual content creation. In addition to the creation of multilingual content, another significant motivation for the wikiBABEL framework is the creation of parallel corpora as a by-product. Parallel linguistic corpora are very valuable resources for both Statistical Machine Translation (SMT) and Crosslingual Information Retrieval (CLIR) research, and may be mined effectively from multilingual data with significant content overlap, as may be created in the wikiBABEL framework. Creation of parallel corpora by professional translators is very expensive, and hence the SMT and CLIR research have been largely confined to a handful of languages. Our attempt to engage the large and diverse Internet user population may aid creation of such linguistic resources economically, and may make computational linguistics research possible and practical in many languages of the world. 0 0
Wikipedia in action: Ontological knowledge in text categorization Maciej Janik
Kochut K.J.
Proceedings - IEEE International Conference on Semantic Computing 2008, ICSC 2008 English 2008 We present a new, ontology-based approach to the automatic text categorization. An important and novel aspect of this approach is that our categorization method does not require a training set, which is in contrast to the traditional statistical and probabilistic methods. In the presented method, the ontology, including the domain concepts organized into hierarchies of categories and interconnected by relationships, as well as instances and connections among them, effectively becomes the classifier. Our method focuses on (i) converting a text document into a thematic graph of entities occurring in the document, (ii) ontological classification of the entities in the graph, and (iii) determining the overall categorization of the thematic graph, and as a result, the document itself. In the presented experiments, we used an RDF ontology constructed from the full English version of Wikipedia. Our experiments, conducted on corpora of Reuters news articles, showed that our training-less categorization method achieved a very good overall accuracy. 0 0
A comparison of dimensionality reduction techniques for Web structure mining Chikhi N.F.
Rothenburger B.
Aussenac-Gilles N.
Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, WI 2007 English 2007 In many domains, dimensionality reduction techniques have been shown to be very effective for elucidating the underlying semantics of data. Thus, in this paper we investigate the use of various dimensionality reduction techniques (DRTs) to extract the implicit structures hidden in the web hyperlink connectivity. We apply and compare four DRTs, namely, Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF), Independent Component Analysis (ICA) and Random Projection (RP). Experiments conducted on three datasets allow us to assert the following: NMF outperforms PCA and ICA in terms of stability and interpretability of the discovered structures; the wellknown WebKb dataset used in a large number of works about the analysis of the hyperlink connectivity seems to be not adapted for this task and we suggest rather to use the recent Wikipedia dataset which is better suited. 0 0
DBpedia: A nucleus for a Web of open data Sören Auer
Christian Bizer
Georgi Kobilarov
Janette Lehmann
Richard Cyganiak
Zachary Ives
Lecture Notes in Computer Science English 2007 DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against datasets derived from Wikipedia and to link other datasets on the Web to Wikipedia data. We describe the extraction of the DBpedia datasets, and how the resulting information is published on the Web for human- and machine-consumption. We describe some emerging applications from the DBpedia community and show how website authors can facilitate DBpedia content within their sites. Finally, we present the current status of interlinking DBpedia with other open datasets on the Web and outline how DBpedia could serve as a nucleus for an emerging Web of open data. 0 2
Determining factors behind the PageRank log-log plot Yana Volkovich
Litvak N.
Debora Donato
Lecture Notes in Computer Science English 2007 We study the relation between PageRank and other parameters of information networks such as in-degree, out-degree, and the fraction of dangling nodes. We model this relation through a stochastic equation inspired by the original definition of PageRank. Further, we use the theory of regular variation to prove that PageRank and in-degree follow power laws with the same exponent. The difference between these two power laws is in a multiplicative constant, which depends mainly on the fraction of dangling nodes, average in-degree, the power law exponent, and the damping factor. The out-degree distribution has a minor effect, which we explicitly quantify. Finally, we propose a ranking scheme which does not depend on out-degrees. 0 0
Efficient interactive query expansion with complete Search Holger Bast
Debapriyo Majumdar
Ingmar Weber
International Conference on Information and Knowledge Management, Proceedings English 2007 We present an efficient realization of the following interactive search engine feature: as the user is typing the query, words that are related to the last query word and that would lead to good hits are suggested, as well as selected such hits. The realization has three parts: (i) building clusters of related terms, (ii) adding this information as artificial words to the index such that (iii) the described feature reduces to an instance of prefix search and completion. An efficient solution for the latter is provided by the CompleteSearch engine, with which we have integrated the proposed feature. For building the clusters of related terms we propose a variant of latent semantic indexing that, unlike standard approaches, is completely transparent to the user. By experiments on two large test-collections, we demonstrate that the feature is provided at only a slight increase in query processing time and index size. Copyright 2007 ACM. 0 0
PORE: Positive-only relation extraction from wikipedia text Gang Wang
Yiqin Yu
Haiping Zhu
Lecture Notes in Computer Science English 2007 Extracting semantic relations is of great importance for the creation of the Semantic Web content. It is of great benefit to semi-automatically extract relations from the free text of Wikipedia using the structured content readily available in it. Pattern matching methods that employ information redundancy cannot work well since there is not much redundancy information in Wikipedia, compared to the Web. Multi-class classification methods are not reasonable since no classification of relation types is available in Wikipedia. In this paper, we propose PORE (Positive-Only Relation Extraction), for relation extraction from Wikipedia text. The core algorithm B-POL extends a state-of-the-art positive-only learning algorithm using bootstrapping, strong negative identifi cation, and transductive inference to work with fewer positive training exam ples. We conducted experiments on several relations with different amount of training data. The experimental results show that B-POL can work effectively given only a small amount of positive training examples and it significantly out per forms the original positive learning approaches and a multi-class SVM. Furthermore, although PORE is applied in the context of Wiki pedia, the core algorithm B-POL is a general approach for Ontology Population and can be adapted to other domains. 0 0
Semplore: An IR approach to scalable hybrid query of Semantic Web data Lei Zhang
Qiaoling Liu
Jinghua Zhang
Haofen Wang
Yue Pan
Yiqin Yu
Lecture Notes in Computer Science English 2007 As an extension to the current Web, Semantic Web will not only contain structured data with machine understandable semantics but also textual information. While structured queries can be used to find information more precisely on the Semantic Web, keyword searches are still needed to help exploit textual information. It thus becomes very important that we can combine precise structured queries with imprecise keyword searches to have a hybrid query capability. In addition, due to the huge volume of information on the Semantic Web, the hybrid query must be processed in a very scalable way. In this paper, we define such a hybrid query capability that combines unary tree-shaped structured queries with keyword searches. We show how existing information retrieval (IR) index structures and functions can be reused to index semantic web data and its textual information, and how the hybrid query is evaluated on the index structure using IR engines in an efficient and scalable manner. We implemented this IR approach in an engine called Semplore. Comprehensive experiments on its performance show that it is a promising approach. It leads us to believe that it may be possible to evolve current web search engines to query and search the Semantic Web. Finally, we breifly describe how Semplore is used for searching Wikipedia and an IBM customer's product information. 0 0
Wikify! Linking documents to encyclopedic knowledge Rada Mihalcea
Andras Csomai
International Conference on Information and Knowledge Management, Proceedings English 2007 This paper introduces the use of Wikipedia as a resource for automatic keyword extraction and word sense disambiguation, and shows how this online encyclopedia can be used to achieve state-of-the-art results on both these tasks. The paper also shows how the two methods can be combined into a system able to automatically enrich a text with links to encyclopedic knowledge. Given an input document, the system identifies the important concepts in the text and automatically links these concepts to the corresponding Wikipedia pages. Evaluations of the system show that the automatic annotations are reliable and hardly distinguishable from manual annotations. Copyright 2007 ACM. 0 0
Wikipedia-based Kernels for text categorization Zsolt Minier
Zalan Bodo
Lehel Csato
Proceedings - 9th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2007 English 2007 In recent years several models have been proposed for text categorization. Within this, one of the widely applied models is the vector space model (VSM), where independence between indexing terms, usually words, is assumed. Since training corpora sizes are relatively small - compared to ≈ ∞ what would be required for a realistic number of words - the generalization power of the learning algorithms is low. It is assumed that a bigger text corpus can boost the representation and hence the learning process. Based on the work of Gabrilovich and Markovitch [6], we incorporate Wikipedia articles into the system to give word distributional representation for documents. The extension with this new corpus causes dimensionality increase, therefore clustering of features is needed. We use Latent Semantic Analysis (LSA), Kernel Principal Component Analysis (KPCA) and Kernel Canonical Correlation Analysis (KCCA) and present results for these experiments on the Reuters corpus. 0 0
NNexus: Towards an automatic linker for a massively-distributed collaborative corpus Gardner J.
Krowne A.
Xiong L.
2006 International Conference on Collaborative Computing: Networking, Applications and Worksharing, CollaborateCom English 2006 Collaborative online encyclopedias such as Wikipedia and PlanetMath are becoming increasingly popular. In order to understand an article in a corpus a user must understand the related and underlying concepts through linked articles. In this paper, we introduce NNexus, a generalization of the automatic linking component of PlanetMath.org and the first system that automates the process of linking encyclopedia entries into a semantic network of concepts. We discuss the challenges, present the conceptual models as well as specific mechanisms of NNexus system, and discuss some of our ongoing and completed works. 0 0
Scalable information sharing utilizing decentralized P2P networking integrated with centralized personal and group media tools Guozhen Z.
Qun J.
Proceedings - International Conference on Advanced Information Networking and Applications, AINA English 2006 We proposed a collaborative information sharing environment based on P2P networking technology, to support communication among special groups with given tasks, ensure fast information exchange, increase the productivity of working groups, and reduce maintenance and administration costs in our previous work. However, for a social growing community, not only the information exchange/sharing functions are necessary, but also solutions to support users with idea and knowledge publication tools for private purpose or public use are essential. Some private message (personal idea and experience) posting tools (e.g., weblog) and group collaborative knowledge editing tools (e.g., Wikis) are used in practice; the merits of these tools have been recognized. In this paper, we propose a scalable information sharing solution, which integrates decentralized P2P networking with centralized personal/group media tools. This solution combines the effective tools, such as weblog and Wiki, into P2P-based collaborative groupware system, to facilitate infinite, growing and scalable information management and sharing for individuals and groups. 0 0