Data mining

From WikiPapers
Jump to: navigation, search

data mining is included as keyword or extra keyword in 0 datasets, 3 tools and 199 publications.


There is no datasets for this keyword.


Tool Operating System(s) Language(s) Programming language(s) License Description Image
Dump-downloader Cross-platform Perl Apache License 2.0 dump-downloader Script to request and download the full history dump of all the pages in a MediaWiki. Meant to work for Wikia's wikis but I could work with other wikis. Source code here:
Wikipedia Miner
Wikokit Cross-platform Java EPLv1.0
New BSD License
wikokit (wiki tool kit) - several projects related to wiki.

wiwordik - machine-readable Wiktionary. A visual interface to the parsed English Wiktionary and Russian Wiktionary databases.
Java WebStart application + JavaFX, English interface.
742 languages extracted from the English Wiktionary.

423 languages extracted from the Russian Wiktionary.
Wiwordik-en.0.09.1094 scrollbox.jpg


Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
A generic framework and methodology for extracting semantics from co-occurrences Rachakonda A.R.
Srinivasa S.
Sayali Kulkarni
Srinivasan M.S.
Data and Knowledge Engineering English 2014 Extracting semantic associations from text corpora is an important problem with several applications. It is well understood that semantic associations from text can be discerned by observing patterns of co-occurrences of terms. However, much of the work in this direction has been piecemeal, addressing specific kinds of semantic associations. In this work, we propose a generic framework, using which several kinds of semantic associations can be mined. The framework comprises a co-occurrence graph of terms, along with a set of graph operators. A methodology for using this framework is also proposed, where the properties of a given semantic association can be hypothesized and tested over the framework. To show the generic nature of the proposed model, four different semantic associations are mined over a corpus comprising of Wikipedia articles. The design of the proposed framework is inspired from cognitive science - specifically the interplay between semantic and episodic memory in humans. © 2014 Elsevier B.V. All rights reserved. 0 0
A methodology based on commonsense knowledge and ontologies for the automatic classification of legal cases Capuano N.
De Maio C.
Salerno S.
Toti D.
ACM International Conference Proceeding Series English 2014 We describe a methodology for the automatic classification of legal cases expressed in natural language, which relies on existing legal ontologies and a commonsense knowledge base. This methodology is founded on a process consisting of three phases: an enrichment of a given legal ontology by associating its terms with topics retrieved from the Wikipedia knowledge base; an extraction of relevant concepts from a given textual legal case; and a matching between the enriched ontological terms and the extracted concepts. Such a process has been successfully implemented in a corresponding tool that is part of a larger framework for self-litigation and legal support for the Italian law. 0 0
A piece of my mind: A sentiment analysis approach for online dispute detection Lei Wang
Cardie C.
52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference English 2014 We investigate the novel task of online dispute detection and propose a sentiment analysis solution to the problem: we aim to identify the sequence of sentence-level sentiments expressed during a discussion and to use them as features in a classifier that predicts the DISPUTE/NON-DISPUTE label for the discussion as a whole. We evaluate dispute detection approaches on a newly created corpus of Wikipedia Talk page disputes and find that classifiers that rely on our sentiment tagging features outperform those that do not. The best model achieves a very promising F1 score of 0.78 and an accuracy of 0.80. 0 0
Approach for building high-quality domain ontology based on the Chinese Wikipedia Wu T.
Tang Z.
Xiao K.
ICIC Express Letters English 2014 In this paper, we propose a new approach for building high-quality domain ontology based on the Chinese Wikipedia. In contrast to traditional Wikipedia ontologies, such as DBpedia and YAGO, the domain ontology built in this paper consist of highquality articles. We make use of the C4.5 algorithm to hunt high-quality articles from specific domain in Wikipedia. As a result, a domain ontology is built accordingly. 0 0
Augmenting concept definition in gloss vector semantic relatedness measure using wikipedia articles Pesaranghader A.
Rezaei A.
Lecture Notes in Electrical Engineering English 2014 Semantic relatedness measures are widely used in text mining and information retrieval applications. Considering these automated measures, in this research paper we attempt to improve Gloss Vector relatedness measure for more accurate estimation of relatedness between two given concepts. Generally, this measure, by constructing concepts definitions (Glosses) from a thesaurus, tries to find the angle between the concepts' gloss vectors for the calculation of relatedness. Nonetheless, this definition construction task is challenging as thesauruses do not provide full coverage of expressive definitions for the particularly specialized concepts. By employing Wikipedia articles and other external resources, we aim at augmenting these concepts' definitions. Applying both definition types to the biomedical domain, using MEDLINE as corpus, UMLS as the default thesaurus, and a reference standard of 68 concept pairs manually rated for relatedness, we show exploiting available resources on the Web would have positive impact on final measurement of semantic relatedness. 0 0
Collective memory in Poland: A reflection in street names Radoslaw Nielek
Wawer A.
Adam Wierzbicki
Lecture Notes in Computer Science English 2014 Our article starts with an observation that street names fall into two general types: generic and historically inspired. We analyse street names distributions (of the second type) as a window to nation-level collective memory in Poland. The process of selecting street names is determined socially, as the selections reflect the symbols considered important to the nation-level society, but has strong historical motivations and determinants. In the article, we seek for these relationships in the available data sources. We use Wikipedia articles to match street names with their textual descriptions and assign them to the time points. We then apply selected text mining and statistical techniques to reach quantitative conclusions. We also present a case study: the geographical distribution of two particular street names in Poland to demonstrate the binding between history and political orientation of regions. 0 0
Conceptual clustering Boubacar A.
Niu Z.
Lecture Notes in Electrical Engineering English 2014 Traditional clustering methods are unable to describe the generated clusters. Conceptual clustering is an important and active research area that aims to efficiently cluster and explain the data. Previous conceptual clustering approaches provide descriptions that do not use a human comprehensible knowledge. This paper presents an algorithm which uses Wikipedia concepts to process a clustering method. The generated clusters overlap each other and serve as a basis for an information retrieval system. The method has been implemented in order to improve the performance of the system. It reduces the computation cost. 0 0
Experimental comparison of semantic word clouds Barth L.
Kobourov S.G.
Pupyrev S.
Lecture Notes in Computer Science English 2014 We study the problem of computing semantics-preserving word clouds in which semantically related words are close to each other. We implement three earlier algorithms for creating word clouds and three new ones. We define several metrics for quantitative evaluation of the resulting layouts. Then the algorithms are compared according to these metrics, using two data sets of documents from Wikipedia and research papers. We show that two of our new algorithms outperform all the others by placing many more pairs of related words so that their bounding boxes are adjacent. Moreover, this improvement is not achieved at the expense of significantly worsened measurements for the other metrics. 0 0
Exploratory search with semantic transformations using collaborative knowledge bases Yegin Genc WSDM 2014 - Proceedings of the 7th ACM International Conference on Web Search and Data Mining English 2014 Sometimes we search for simple facts. Other times we search for relationships between concepts. While existing information retrieval systems work well for simple searches, they are less satisfying for complex inquiries because of the ill-structured nature of many searches and the cognitive load involved in the search process. Search can be improved by leveraging the network of concepts that are maintained by collaborative knowledge bases such as Wikipedia. By treating exploratory search inquires as networks of concepts - and then mapping documents to these concepts, exploratory search performance can be improved. This method is applied to an exploratory search task: given a journal abstract, abstracts are ranked based their relevancy to the seed abstract. The results show comparable relevancy scores to state of the art techniques while at the same time providing better diversity. 0 0
Graph-based domain-specific semantic relatedness from Wikipedia Sajadi A. Lecture Notes in Computer Science English 2014 Human made ontologies and lexicons are promising resources for many text mining tasks in domain specific applications, but they do not exist for most domains. We study the suitability of Wikipedia as an alternative resource for ontologies regarding the Semantic Relatedness problem. We focus on the biomedical domain because (1) high quality manually curated ontologies are available and (2) successful graph based methods have been proposed for semantic relatedness in this domain. Because Wikipedia is not hierarchical and links do not convey defined semantic relationships, the same methods used on lexical resources (such as WordNet) cannot be applied here straightforwardly. Our contributions are (1) Demonstrating that Wikipedia based methods outperform state of the art ontology based methods on most of the existing ontologies in the biomedical domain (2) Adapting and evaluating the effectiveness of a group of bibliometric methods of various degrees of sophistication on Wikipedia for the first time (3) Proposing a new graph-based method that is outperforming existing methods by considering some specific features of Wikipedia structure. 0 0
Heterogeneous graph-based intent learning with queries, web pages and Wikipedia concepts Ren X.
Yafang Wang
Yu X.
Yan J.
Zheng Chen
Jangwhan Han
WSDM 2014 - Proceedings of the 7th ACM International Conference on Web Search and Data Mining English 2014 The problem of learning user search intents has attracted intensive attention from both industry and academia. However, state-of-the-art intent learning algorithms suffer from different drawbacks when only using a single type of data source. For example, query text has difficulty in distinguishing ambiguous queries; search log is bias to the order of search results and users' noisy click behaviors. In this work, we for the first time leverage three types of objects, namely queries, web pages and Wikipedia concepts collaboratively for learning generic search intents and construct a heterogeneous graph to represent multiple types of relationships between them. A novel unsupervised method called heterogeneous graph-based soft-clustering is developed to derive an intent indicator for each object based on the constructed heterogeneous graph. With the proposed co-clustering method, one can enhance the quality of intent understanding by taking advantage of different types of data, which complement each other, and make the implicit intents easier to interpret with explicit knowledge from Wikipedia concepts. Experiments on two real-world datasets demonstrate the power of the proposed method where it achieves a 9.25% improvement in terms of NDCG on search ranking task and a 4.67% enhancement in terms of Rand index on object co-clustering task compared to the best state-of-the-art method. 0 0
Identifying the topic of queries based on domain specify ontology ChienTa D.C.
Thi T.P.
WIT Transactions on Information and Communication Technologies English 2014 In order to identify the topic of queries, a large number of past researches have relied on lexicon-syntactic and handcrafted knowledge sources in Machine Learning and Natural Language Processing (NLP). Conversely, in this paper, we introduce the application system that detects the topic of queries based on domain-specific ontology. On this system, we work hard on building this domainspecific ontology, which is composed of instances automatically extracted from available resources such as Wikipedia, WordNet, and ACM Digital Library. The experimental evaluation with many cases of queries related to information technology area shows that this system considerably outperforms a matching and identifying approach. 0 0
Inferring attitude in online social networks based on quadratic correlation Chao Wang
Bulatov A.A.
Lecture Notes in Computer Science English 2014 The structure of an online social network in most cases cannot be described just by links between its members. We study online social networks, in which members may have certain attitude, positive or negative, toward each other, and so the network consists of a mixture of both positive and negative relationships. Our goal is to predict the sign of a given relationship based on the evidences provided in the current snapshot of the network. More precisely, using machine learning techniques we develop a model that after being trained on a particular network predicts the sign of an unknown or hidden link. The model uses relationships and influences from peers as evidences for the guess, however, the set of peers used is not predefined but rather learned during the training process. We use quadratic correlation between peer members to train the predictor. The model is tested on popular online datasets such as Epinions, Slashdot, and Wikipedia. In many cases it shows almost perfect prediction accuracy. Moreover, our model can also be efficiently updated as the underlying social network evolves. 0 0
Leveraging open source tools for Web mining Pennete K.C. Lecture Notes in Electrical Engineering English 2014 Web mining is the most pursued research area and often the most challenging one. Using web mining, corporates and individuals alike are inquisitively pursuing to unravel the hidden knowledge underneath the diverse gargantuan volumes of web data. This paper tries to present how a researcher can leverage the colossal knowledge available in open access sites such as Wikipedia as a source of information rather than subscribing to closed networks of knowledge and use open source tools rather than prohibitively priced commercial mining tools to do web mining. The paper illustrates a step-by-step usage of R and RapidMiner in web mining to enable a novice to understand the concepts as well as apply it in real world. 0 0
Mining knowledge on relationships between objects from the web Xiaodan Zhang
Yasuhito Asano
Masatoshi Yoshikawa
IEICE Transactions on Information and Systems English 2014 How do global warming and agriculture influence each other? It is possible to answer the question by searching knowledge about the relationship between global warming and agriculture. As exemplified by this question, strong demands exist for searching relationships between objects. Mining knowledge about relationships on Wikipedia has been studied. However, it is desired to search more diverse knowledge about relationships on theWeb. By utilizing the objects constituting relationships mined from Wikipedia, we propose a new method to search images with surrounding text that include knowledge about relationships on the Web. Experimental results show that our method is effective and applicable in searching knowledge about relationships. We also construct a relationship search system named "Enishi" based on the proposed new method. Enishi supplies a wealth of diverse knowledge including images with surrounding text to help users to understand relationships deeply, by complementarily utilizing knowledge from Wikipedia and the Web. Copyright 0 0
Research on XML data mining model based on multi-level technology Zhu J.-X. Advanced Materials Research English 2014 The era of Web 2.0 has been coming, and more and more Web 2.0 application, such social networks and Wikipedia, have come up. As an industrial standard of the Web 2.0, the XML technique has also attracted more and more researchers. However, how to mine value information from massive XML documents is still in its infancy. In this paper, we study the basic problem of XML data mining-XML data mining model. We design a multi-level XML data mining model, propose a multi-level data mining method, and list some research issues in the implementation of XML data mining systems. 0 0
The last click: Why users give up information network navigation Scaria A.T.
Philip R.M.
Robert West
Leskovec J.
WSDM 2014 - Proceedings of the 7th ACM International Conference on Web Search and Data Mining English 2014 An important part of finding information online involves clicking from page to page until an information need is fully satisfied. This is a complex task that can easily be frustrating and force users to give up prematurely. An empirical analysis of what makes users abandon click-based navigation tasks is hard, since most passively collected browsing logs do not specify the exact target page that a user was trying to reach. We propose to overcome this problem by using data collected via Wikispeedia, a Wikipedia-based human-computation game, in which users are asked to navigate from a start page to an explicitly given target page (both Wikipedia articles) by only tracing hyperlinks between Wikipedia articles. Our contributions are two-fold. First, by analyzing the differences between successful and abandoned navigation paths, we aim to understand what types of behavior are indicative of users giving up their navigation task. We also investigate how users make use of back clicks during their navigation. We find that users prefer backtracking to high-degree nodes that serve as landmarks and hubs for exploring the network of pages. Second, based on our analysis, we build statistical models for predicting whether a user will finish or abandon a navigation task, and if the next action will be a back click. Being able to predict these events is important as it can potentially help us design more human-friendly browsing interfaces and retain users who would otherwise have given up navigating a website. 0 0
Tibetan-Chinese named entity extraction based on comparable corpus Sun Y.
Zhao Q.
Applied Mechanics and Materials English 2014 Tibetan-Chinese named entity extraction is the foundation of Tibetan-Chinese information processing, which provides the basis for machine translation and cross-language information retrieval research. We used the multi-language links of Wikipedia to obtain Tibetan-Chinese comparable corpus, and combined sentence length, word matching and entity boundary words together to carry out sentence alignment. Then we extracted Tibetan-Chinese named entity from the aligned comparable corpus in three ways: (1) Natural labeling information extraction. (2) The links of Tibetan entries and Chinese entries extraction. (3) The method of sequence intersection. It contained taking the sentence as words sequence, recognizing Chinese named entity from Chinese sentences and intersecting aligned Tibetan sentences. Fianlly, through the experiment, the results prove the extraction method based on comparable corpus is effective. 0 0
Topic ontology-based efficient tag recommendation approach for blogs Subramaniyaswamy V.
Pandian S.C.
International Journal of Computational Science and Engineering English 2014 Efficient tag recommendation systems are required to help users in the task of searching, indexing and browsing appropriate blog content. Tag generation has become more popular to annotate web content, other blogs, photos, videos and music. Tag recommendation is an action of signifying valuable and informative tags to a budding item based on the content. We propose a novel approach based on topic ontology for tag recommendation. The proposed approach intelligently generates tag suggestions to blogs. In this paper, we effectively construct the technology entitled Ontology based on Wikipedia categories and WordNet semantic relationship to make the ontology more meaningful and reliable. Spreading activation algorithm is applied to assign interest scores to existing blog content and tags. High quality tags are suggested based on the significance of the interest score. Evaluation proves that the applicability of topic ontology with spreading activation algorithm helps tag recommendation more effective when compared to collaborative tag recommendations. Our proposed approach offers several solutions to tag spamming, sentiment analysis and popularity. Finally, we report the results of an experiment which improves the performance of tag recommendation approach. 0 0
Towards linking libraries and Wikipedia: Aautomatic subject indexing of library records with Wikipedia concepts Joorabchi A.
Mahdi A.E.
Journal of Information Science English 2014 In this article, we first argue the importance and timely need of linking libraries and Wikipedia for improving the quality of their services to information consumers, as such linkage will enrich the quality of Wikipedia articles and at the same time increase the visibility of library resources which are currently overlooked to a large degree. We then describe the development of an automatic system for subject indexing of library metadata records with Wikipedia concepts as an important step towards library-Wikipedia integration. The proposed system is based on first identifying all Wikipedia concepts occurring in the metadata elements of library records. This is then followed by training and deploying generic machine learning algorithms to automatically select those concepts which most accurately reflect the core subjects of the library materials whose records are being indexed. We have assessed the performance of the developed system using standard information retrieval measures of precision, recall and F-score on a dataset consisting of 100 library metadata records manually indexed with a total of 469 Wikipedia concepts. The evaluation results show that the developed system is capable of achieving an averaged F-score as high as 0.92. 0 0
Trendspedia: An Internet observatory for analyzing and visualizing the evolving web Kang W.
Tung A.K.H.
Chen W.
Li X.
Song Q.
Zhang C.
Fei Zhao
Xiaofeng Zhou
Proceedings - International Conference on Data Engineering English 2014 The popularity of social media services has been innovating the way of information acquisition in modern society. Meanwhile, mass information is generated in every single day. To extract useful knowledge, much effort has been invested in analyzing social media contents, e.g., (emerging) topic discovery. With these findings, however, users may still find it hard to obtain knowledge of great interest in conformity with their preference. In this paper, we present a novel system which brings proper context to continuously incoming social media contents, such that mass information can be indexed, organized and analyzed around Wikipedia entities. Four data analytics tools are employed in the system. Three of them aim to enrich each Wikipedia entity by analyzing the relevant contents while the other one builds an information network among the most relevant Wikipedia entities. With our system, users can easily pinpoint valuable information and knowledge they are interested in, as well as navigate to other closely related entities through the information network for further exploration. 0 0
Trust, but verify: Predicting contribution quality for knowledge base construction and curation Tan C.H.
Agichtein E.
Ipeirotis P.
Evgeniy Gabrilovich
WSDM 2014 - Proceedings of the 7th ACM International Conference on Web Search and Data Mining English 2014 The largest publicly available knowledge repositories, such as Wikipedia and Freebase, owe their existence and growth to volunteer contributors around the globe. While the majority of contributions are correct, errors can still creep in, due to editors' carelessness, misunderstanding of the schema, malice, or even lack of accepted ground truth. If left undetected, inaccuracies often degrade the experience of users and the performance of applications that rely on these knowledge repositories. We present a new method, CQUAL, for automatically predicting the quality of contributions submitted to a knowledge base. Significantly expanding upon previous work, our method holistically exploits a variety of signals, including the user's domains of expertise as reflected in her prior contribution history, and the historical accuracy rates of different types of facts. In a large-scale human evaluation, our method exhibits precision of 91% at 80% recall. Our model verifies whether a contribution is correct immediately after it is submitted, significantly alleviating the need for post-submission human reviewing. 0 0
Using linked data to mine RDF from Wikipedia's tables Munoz E.
Hogan A.
Mileo A.
WSDM 2014 - Proceedings of the 7th ACM International Conference on Web Search and Data Mining English 2014 The tables embedded in Wikipedia articles contain rich, semi-structured encyclopaedic content. However, the cumulative content of these tables cannot be queried against. We thus propose methods to recover the semantics of Wikipedia tables and, in particular, to extract facts from them in the form of RDF triples. Our core method uses an existing Linked Data knowledge-base to find pre-existing relations between entities in Wikipedia tables, suggesting the same relations as holding for other entities in analogous columns on different rows. We find that such an approach extracts RDF triples from Wikipedia's tables at a raw precision of 40%. To improve the raw precision, we define a set of features for extracted triples that are tracked during the extraction phase. Using a manually labelled gold standard, we then test a variety of machine learning methods for classifying correct/incorrect triples. One such method extracts 7.9 million unique and novel RDF triples from over one million Wikipedia tables at an estimated precision of 81.5%. 0 0
A cloud of FAQ: A highly-precise FAQ retrieval system for the Web 2.0 Romero M.
Moreo A.
Castro J.L.
Knowledge-Based Systems English 2013 FAQ (Frequency Asked Questions) lists have attracted increasing attention for companies and organizations. There is thus a need for high-precision and fast methods able to manage large FAQ collections. In this context, we present a FAQ retrieval system as part of a FAQ exploiting project. Following the growing trend towards Web 2.0, we aim to provide users with mechanisms to navigate through the domain of knowledge and to facilitate both learning and searching, beyond classic FAQ retrieval algorithms. To this purpose, our system involves two different modules: an efficient and precise FAQ retrieval module and, a tag cloud generation module designed to help users to complete the comprehension of the retrieved information. Empirical results evidence the validity of our approach with respect to a number of state-of-the-art algorithms in terms of the most popular metrics in the field. © 2013 Elsevier B.V. All rights reserved. 0 0
A generalized flow-based method for analysis of implicit relationships on wikipedia Xiaodan Zhang
Yasuhito Asano
Masatoshi Yoshikawa
IEEE Transactions on Knowledge and Data Engineering English 2013 We focus on measuring relationships between pairs of objects in Wikipedia whose pages can be regarded as individual objects. Two kinds of relationships between two objects exist: in Wikipedia, an explicit relationship is represented by a single link between the two pages for the objects, and an implicit relationship is represented by a link structure containing the two pages. Some of the previously proposed methods for measuring relationships are cohesion-based methods, which underestimate objects having high degrees, although such objects could be important in constituting relationships in Wikipedia. The other methods are inadequate for measuring implicit relationships because they use only one or two of the following three important factors: distance, connectivity, and cocitation. We propose a new method using a generalized maximum flow which reflects all the three factors and does not underestimate objects having high degree. We confirm through experiments that our method can measure the strength of a relationship more appropriately than these previously proposed methods do. Another remarkable aspect of our method is mining elucidatory objects, that is, objects constituting a relationship. We explain that mining elucidatory objects would open a novel way to deeply understand a relationship. 0 0
A method for recommending the most appropriate expansion of acronyms using wikipedia Choi D.
Shin J.
Lee E.
Kim P.
Proceedings - 7th International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, IMIS 2013 English 2013 Over the years, many researchers have been studied to detect expansions of acronyms in texts by using linguistic and syntactical approaches in order to overcome disambiguation problems. Acronym is an abbreviation formed which is composed of initial components of single or multiple words. These initial components bring huge mistakes when a machine conducts experiments to find meaning from given texts. Detecting expansions of acronyms is not a big issue now days. The problem is that a polysemous acronym. In order to solve this problem, this paper proposes a method to recommend the most related expansion of acronym through analyzing co-occurrence words by using Wikipedia. Our goal is not finding acronym definition or expansion but recommending the most appropriate expansion of given acronyms. 0 0
A new approach to detecting content anomalies in Wikipedia Sinanc D.
Yavanoglu U.
Proceedings - 2013 12th International Conference on Machine Learning and Applications, ICMLA 2013 English 2013 The rapid growth of the web has caused to availability of data effective if its content is well organized. Despite the fact that Wikipedia is the biggest encyclopedia on the web, its quality is suspect due to its Open Editing Schemas (OES). In this study, zoology and botany pages are selected in English Wikipedia and their html contents are converted to text then Artificial Neural Network (ANN) is used for classification to prevent disinformation or misinformation. After the train phase, some irrelevant words added in the content about politics or terrorism in proportion to the size of the text. By the time unsuitable content is added in a page until the moderators' intervention, the proposed system realized the error via wrong categorization. The results have shown that, when words number 2% of the content is added anomaly rate begins to cross the 50% border. 0 0
An automatic approach for ontology-based feature extraction from heterogeneous textualresources Vicient C.
Sanchez D.
Moreno A.
Engineering Applications of Artificial Intelligence English 2013 Data mining algorithms such as data classification or clustering methods exploit features of entities to characterise, group or classify them according to their resemblance. In the past, many feature extraction methods focused on the analysis of numerical or categorical properties. In recent years, motivated by the success of the Information Society and the WWW, which has made available enormous amounts of textual electronic resources, researchers have proposed semantic data classification and clustering methods that exploit textual data at a conceptual level. To do so, these methods rely on pre-annotated inputs in which text has been mapped to their formal semantics according to one or several knowledge structures (e.g. ontologies, taxonomies). Hence, they are hampered by the bottleneck introduced by the manual semantic mapping process. To tackle this problem, this paper presents a domain-independent, automatic and unsupervised method to detect relevant features from heterogeneous textual resources, associating them to concepts modelled in a background ontology. The method has been applied to raw text resources and also to semi-structured ones (Wikipedia articles). It has been tested in the Tourism domain, showing promising results. © 2012 Elsevier Ltd. All rights reserved. 0 0
Analyzing multi-dimensional networks within mediawikis Brian C. Keegan
Ceni A.
Smith M.A.
Proceedings of the 9th International Symposium on Open Collaboration, WikiSym + OpenSym 2013 English 2013 The MediaWiki platform supports popular socio-technical systems such as Wikipedia as well as thousands of other wikis. This software encodes and records a variety of rela- Tionships about the content, history, and editors of its arti- cles such as hyperlinks between articles, discussions among editors, and editing histories. These relationships can be an- Alyzed using standard techniques from social network analy- sis, however, extracting relational data from Wikipedia has traditionally required specialized knowledge of its API, in- formation retrieval, network analysis, and data visualization that has inhibited scholarly analysis. We present a soft- ware library called the NodeXL MediaWiki Importer that extracts a variety of relationships from the MediaWiki API and integrates with the popular NodeXL network analysis and visualization software. This library allows users to query and extract a variety of multidimensional relationships from any MediaWiki installation with a publicly-accessible API. We present a case study examining the similarities and dif- ferences between dierent relationships for the Wikipedia articles about \Pope Francis" and \Social media." We con- clude by discussing the implications this library has for both theoretical and methodological research as well as commu- nity management and outline future work to expand the capabilities of the library. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous; D.2.8 [Software Engineering]: Metricscomplexity mea- sures, performance measures General Terms System. Copyright 2010 ACM. 0 0
Automatic extraction of Polish language errors from text edition history Grundkiewicz R. Lecture Notes in Computer Science English 2013 There are no large error corpora for a number of languages, despite the fact that they have multiple applications in natural language processing. The main reason underlying this situation is a high cost of manual corpora creation. In this paper we present the methods of automatic extraction of various kinds of errors such as spelling, typographical, grammatical, syntactic, semantic, and stylistic ones from text edition histories. By applying of these methods to the Wikipedia's article revision history, we created the large and publicly available corpus of naturally-occurring language errors for Polish, called PlEWi. Finally, we analyse and evaluate the detected error categories in our corpus. 0 0
Automatic keyphrase annotation of scientific documents using Wikipedia and genetic algorithms Joorabchi A.
Mahdi A.E.
Journal of Information Science English 2013 Topical annotation of documents with keyphrases is a proven method for revealing the subject of scientific and research documents to both human readers and information retrieval systems. This article describes a machine learning-based keyphrase annotation method for scientific documents that utilizes Wikipedia as a thesaurus for candidate selection from documents' content. We have devised a set of 20 statistical, positional and semantical features for candidate phrases to capture and reflect various properties of those candidates that have the highest keyphraseness probability. We first introduce a simple unsupervised method for ranking and filtering the most probable keyphrases, and then evolve it into a novel supervised method using genetic algorithms. We have evaluated the performance of both methods on a third-party dataset of research papers. Reported experimental results show that the performance of our proposed methods, measured in terms of consistency with human annotators, is on a par with that achieved by humans and outperforms rival supervised and unsupervised methods. 0 0
Automatic topic ontology construction using semantic relations from wordnet and wikipedia Subramaniyaswamy V. International Journal of Intelligent Information Technologies English 2013 Due to the explosive growth of web technology, a huge amount of information is available as web resources over the Internet. Therefore, in order to access the relevant content from the web resources effectively, considerable attention is paid on the semantic web for efficient knowledge sharing and interoperability. Topic ontology is a hierarchy of a set of topics that are interconnected using semantic relations, which is being increasingly used in the web mining techniques. Reviews of the past research reveal that semiautomatic ontology is not capable of handling high usage. This shortcoming prompted the authors to develop an automatic topic ontology construction process. However, in the past many attempts have been made by other researchers to utilize the automatic construction of ontology, which turned out to be challenging due to time, cost and maintenance. In this paper, the authors have proposed a corpus based novel approach to enrich the set of categories in the ODP by automatically identifying the concepts and their associated semantic relationship with corpus based external knowledge resources, such as Wikipedia and WordNet. This topic ontology construction approach relies on concept acquisition and semantic relation extraction. A Jena API framework has been developed to organize the set of extracted semantic concepts, while Protégé provides the platform to visualize the automatically constructed topic ontology. To evaluate the performance, web documents were classified using SVM classifier based on ODP and topic ontology. The topic ontology based classification produced better accuracy than ODP. Copyright 0 0
Characterizing and curating conversation threads: Expansion, focus, volume, re-entry Backstrom L.
Kleinberg J.
Lena Lee
Cristian Danescu-Niculescu-Mizil
WSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining English 2013 Discussion threads form a central part of the experience on many Web sites, including social networking sites such as Facebook and Google Plus and knowledge creation sites such as Wikipedia. To help users manage the challenge of allocating their attention among the discussions that are relevant to them, there has been a growing need for the algorithmic curation of on-line conversations - - the development of automated methods to select a subset of discussions to present to a user. Here we consider two key sub-problems inherent in conversational curation: length prediction - - predicting the number of comments a discussion thread will receive - - and the novel task of re-entry prediction - - predicting whether a user who has participated in a thread will later contribute another comment to it. The first of these sub-problems arises in estimating how interesting a thread is, in the sense of generating a lot of conversation; the second can help determine whether users should be kept notified of the progress of a thread to which they have already contributed. We develop and evaluate a range of approaches for these tasks, based on an analysis of the network structure and arrival pattern among the participants, as well as a novel dichotomy in the structure of long threads. We find that for both tasks, learning-based approaches using these sources of information. 0 0
Computing semantic relatedness using word frequency and layout information of wikipedia Chan P.
Hijikata Y.
Nishida S.
Proceedings of the ACM Symposium on Applied Computing English 2013 Computing the semantic relatedness between two words or phrases is an important problem for fields such as information retrieval and natural language processing. One state-of-the-art approach to solve the problem is Explicit Semantic Analysis (ESA). ESA uses the word frequency in Wikipedia articles to estimate the relevance, so the relevance of words with low frequency cannot always be well estimated. To improve the relevance estimate of the low frequency words, we use not only word frequency but also layout information in Wikipedia articles. Empirical evaluation shows that on the low frequency words, our method achieves better estimate of semantic relatedness over ESA. Copyright 2013 ACM. 0 0
Crawling deep web entity pages He Y.
Xin D.
Ganti V.
Rajaraman S.
Shah N.
WSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining English 2013 Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. These techniques are experimentally evaluated and shown to be effective. 0 0
Cross language prediction of vandalism on wikipedia using article views and revisions Tran K.-N.
Christen P.
Lecture Notes in Computer Science English 2013 Vandalism is a major issue on Wikipedia, accounting for about 2% (350,000+) of edits in the first 5 months of 2012. The majority of vandalism are caused by humans, who can leave traces of their malicious behaviour through access and edit logs. We propose detecting vandalism using a range of classifiers in a monolingual setting, and evaluated their performance when using them across languages on two data sets: the relatively unexplored hourly count of views of each Wikipedia article, and the commonly used edit history of articles. Within the same language (English and German), these classifiers achieve up to 87% precision, 87% recall, and F1-score of 87%. Applying these classifiers across languages achieve similarly high results of up to 83% precision, recall, and F1-score. These results show characteristic vandal traits can be learned from view and edit patterns, and models built in one language can be applied to other languages. 0 0
Detection of article qualities in the chinese wikipedia based on c4.5 decision tree Xiao K.
Li B.
He P.
Yang X.-H.
Lecture Notes in Computer Science English 2013 The number of articles in Wikipedia is growing rapidly. It is important for Wikipedia to provide users with high quality and reliable articles. However, the quality assessment metric provided by Wikipedia are inefficient, and other mainstream quality detection methods only focus on the qualities of the English Wikipedia articles, and usually analyze the text contents of articles, which is also a time-consuming process. In this paper, we propose a method for detecting the article qualities of the Chinese Wikipedia based on C4.5 decision tree. The problem of quality detection is transformed to classification problem of high-quality and low-quality articles. By using the fields from the tables in the Chinese Wikipedia database, we built the decision trees to distinguish high-quality articles from low-quality ones. 0 0
Discovering stakeholders' interests in Wiki-based architectural documentation Nicoletti M.
Diaz-Pace J.A.
Schiaffino S.
CIbSE 2013: 16th Ibero-American Conference on Software Engineering - Memorias de la 16th Conferencia Iberoamericana de Ingenieria de Software, CIbSE 2013 English 2013 The Software Architecture Document (SAD) is an important artifact in the early stages of software development, as it serves to share and discuss key design and quality-attribute concerns among the stakeholders of the project. Nowadays, architectural documentation is commonly hosted in Wikis in order to favor communication and interactions among stakeholders. However, the SAD is still a large and complex document, in which stakeholders often have difficulties in finding information that is relevant to their interests or daily tasks. We argue that the discovery of stakeholders' interests is helpful to tackle this information overload problem, because a recommendation tool can leverage on those interests to provide each stakeholder with SAD sections that match his/her profile. In this work, we propose an approach to infer stakeholders' interests, based on applying a combination of Natural Language Processing and User Profiling techniques. The interests are partially inferred by monitoring the stakeholders' behavior as they browse a Wiki-based SAD. A preliminary evaluation of our approach has shown its potential for making recommendations to stakeholders with different profiles and support them in architectural tasks. 0 0
Document analytics through entity resolution Santos J.
Martins B.
Batista D.S.
Lecture Notes in Computer Science English 2013 We present a prototype system for resolving named entities, mentioned in textual documents, into the corresponding Wikipedia entities. This prototype can aid in document analysis, by using the disambiguated references to provide useful information in context. 0 0
Extraction of biographical data from Wikipedia Viseur R. DATA 2013 - Proceedings of the 2nd International Conference on Data Technologies and Applications English 2013 Using the content of Wikipedia articles is common in academic research. However the practicalities are rarely analysed. Our research focuses on extracting biographical information about personalities from Belgium. Our research is divided into three sections. The first section describes the state of the art for data extraction from Wikipedia. A second section presents the case study about data extraction for biographies of Belgian personalities. Different solutions are discussed and the solution adopted is implemented. In the third section, the quality of the extraction is discussed. Practical recommendations for researchers wishing to use Wikipedia are also proposed on the basis of our case study. 0 0
From Machu-Picchu to "rafting the urubamba river": Anticipating information needs via the entity-query graph Bordino I.
De Francisci Morales G.
Ingmar Weber
Bonchi F.
WSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining English 2013 We study the problem of anticipating user search needs, based on their browsing activity. Given the current web page p that a user is visiting we want to recommend a small and diverse set of search queries that are relevant to the content of p, but also non-obvious and serendipitous. We introduce a novel method that is based on the content of the page visited, rather than on past browsing patterns as in previous literature. Our content-based approach can be used even for previously unseen pages. We represent the topics of a page by the set of Wikipedia entities extracted from it. To obtain useful query suggestions for these entities, we exploit a novel graph model that we call EQGraph (Entity-Query Graph), containing entities, queries, and transitions between entities, between queries, as well as from entities to queries. We perform Personalized PageRank computation on such a graph to expand the set of entities extracted from a page into a richer set of entities, and to associate these entities with relevant query suggestions. We develop an efficient implementation to deal with large graph instances and suggest queries from a large and diverse pool. We perform a user study that shows that our method produces relevant and interesting recommendations, and outperforms an alternative method based on reverse IR. 0 0
Knowledge base population and visualization using an ontology based on semantic roles Siahbani M.
Vadlapudi R.
Whitney M.
Sarkar A.
AKBC 2013 - Proceedings of the 2013 Workshop on Automated Knowledge Base Construction, Co-located with CIKM 2013 English 2013 This paper extracts facts using "micro-reading" of text in contrast to approaches that extract common-sense knowledge using "macro-reading" methods. Our goal is to extract detailed facts about events from natural language using a predicate-centered view of events (who did what to whom, when and how). We exploit semantic role labels in order to create a novel predicate-centric ontology for entities in our knowledge base. This allows users to find uncommon facts easily. To this end, we tightly couple our knowledge base and ontology to an information visualization system that can be used to explore and navigate events extracted from a large natural language text collection. We use our methodology to create a web-based visual browser of history events in Wikipedia. 0 0
Measuring semantic relatedness using wikipedia signed network Yang W.-T.
Kao H.-Y.
Journal of Information Science and Engineering English 2013 Identifying the semantic relatedness of two words is an important task for the information retrieval, natural language processing, and text mining. However, due to the diversity of meaning for a word, the semantic relatedness of two words is still hard to precisely evaluate under the limited corpora. Nowadays, Wikipedia is now a huge and wiki-based encyclopedia on the internet that has become a valuable resource for research work. Wikipedia articles, written by a live collaboration of user editors, contain a high volume of reference links, URL identification for concepts and a complete revision history. Moreover, each Wikipedia article represents an individual concept that simultaneously contains other concepts that are hyperlinks of other articles embedded in its content. Through this, we believe that the semantic relatedness between two words can be found through the semantic relatedness between two Wikipedia articles. Therefore, we propose an Editor-Contribution-based Rank (ECR) algorithm for ranking the concepts in the article's content through all revisions and take the ranked concepts as a vector representing the article. We classify four types of relationship in which the behavior of addition and deletion maps appropriate and inappropriate concepts. ECR also extend the concept semantics by the editor-concept network. ECR ranks those concepts depending on the mutual signed-reinforcement relationship between the concepts and the editors. The results reveal that our method leads to prominent performance improvement and increases the correlation coefficient by a factor ranging from 4% to 23% over previous methods that calculate the relatedness between two articles. 0 0
Monitoring network structure and content quality of signal processing articles on wikipedia Lee T.C.
Unnikrishnan J.
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings English 2013 Wikipedia has become a widely-used resource on signal processing. However, the freelance-editing model of Wikipedia makes it challenging to maintain a high content quality. We develop techniques to monitor the network structure and content quality of Signal Processing (SP) articles on Wikipedia. Using metrics to quantify the importance and quality of articles, we generate a list of SP articles on Wikipedia arranged in the order of their need for improvement. The tools we use include the HITS and PageRank algorithms for network structure, crowdsourcing for quantifying article importance and known heuristics for article quality. 0 0
Semantic relatedness estimation using the layout information of wikipedia articles Chan P.
Hijikata Y.
Kuramochi T.
Nishida S.
International Journal of Cognitive Informatics and Natural Intelligence English 2013 Computing the semantic relatedness between two words or phrases is an important problem in fields such as information retrieval and natural language processing. Explicit Semantic Analysis (ESA), a state-of-the-art approach to solve the problem uses word frequency to estimate relevance. Therefore, the relevance of words with low frequency cannot always be well estimated. To improve the relevance estimate of low-frequency words and concepts, the authors apply regression to word frequency, its location in an article, and its text style to calculate the relevance. The relevance value is subsequently used to compute semantic relatedness. Empirical evaluation shows that, for low-frequency words, the authors' method achieves better estimate of semantic relatedness over ESA. Furthermore, when all words of the dataset are considered, the combination of the authors' proposed method and the conventional approach outperforms the conventional approach alone. Copyright 0 0
Tell me more: An actionable quality model for wikipedia Morten Warncke-Wang
Dan Cosley
John Riedl
Proceedings of the 9th International Symposium on Open Collaboration, WikiSym + OpenSym 2013 English 2013 In this paper we address the problem of developing actionable quality models for Wikipedia, models whose features directly suggest strategies for improving the quality of a given article. We rst survey the literature in order to understand the notion of article quality in the context of Wikipedia and existing approaches to automatically assess article quality. We then develop classication models with varying combinations of more or less actionable features, and nd that a model that only contains clearly actionable features delivers solid performance. Lastly we discuss the implications of these results in terms of how they can help improve the quality of articles across Wikipedia. Categories and Subject Descriptors H.5 [Information Interfaces and Presentation]: Group and Organization InterfacesCollaborative computing, Computer-supported cooperative work, Web-based interac- Tion. Copyright 2010 ACM. 0 0
Term extraction from sparse, ungrammatical domain-specific documents Ittoo A.
Gosse Bouma
Expert Systems with Applications English 2013 Existing term extraction systems have predominantly targeted large and well-written document collections, which provide reliable statistical and linguistic evidence to support term extraction. In this article, we address the term extraction challenges posed by sparse, ungrammatical texts with domain-specific contents, such as customer complaint emails and engineers' repair notes. To this aim, we present ExtTerm, a novel term extraction system. Specifically, as our core innovations, we accurately detect rare (low frequency) terms, overcoming the issue of data sparsity. These rare terms may denote critical events, but they are often missed by extant TE systems. ExtTerm also precisely detects multi-word terms of arbitrarily lengths, e.g. with more than 2 words. This is achieved by exploiting fundamental theoretical notions underlying term formation, and by developing a technique to compute the collocation strength between any number of words. Thus, we address the limitation of existing TE systems, which are primarily designed to identify terms with 2 words. Furthermore, we show that open-domain (general) resources, such as Wikipedia, can be exploited to support domain-specific term extraction. Thus, they can be used to compensate for the unavailability of domain-specific knowledge resources. Our experimental evaluations reveal that ExtTerm outperforms a state-of-the-art baseline in extracting terms from a domain-specific, sparse and ungrammatical real-life text collection. © 2012 Elsevier B.V. All rights reserved. 0 0
Unsupervised gazette creation using information distance Patil S.
Pawar S.
Palshikar G.K.
Bhat S.
Srivastava R.
Lecture Notes in Computer Science English 2013 Named Entity extraction (NEX) problem consists of automatically constructing a gazette containing instances for each NE of interest. NEX is important for domains which lack a corpus with tagged NEs. In this paper, we propose a new unsupervised (bootstrapping) NEX technique, based on a new variant of the Multiword Expression Distance (MED)[1] and information distance [2]. Efficacy of our method is shown using comparison with BASILISK and PMI in agriculture domain. Our method discovered 8 new diseases which are not found in Wikipedia. 0 0
Wiki3C: Exploiting wikipedia for context-aware concept categorization Jiang P.
Hou H.
Long Chen
Shun-ling Chen
Conglei Yao
Chenliang Li
Wang M.
WSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining English 2013 Wikipedia is an important human generated knowledge base containing over 21 million articles organized by millions of categories. In this paper, we exploit Wikipedia for a new task of text mining: Context-aware Concept Categorization. In the task, we focus on categorizing concepts according to their context. We exploit article link feature and category structure in Wikipedia, followed by introducing Wiki3C, an unsupervised and domain independent concept categorization approach based on context. In the approach, we investigate two strategies to select and filter Wikipedia articles for the category representation. Besides, a probabilistic model is employed to compute the semantic relatedness between two concepts in Wikipedia. Experimental evaluation using manually labeled ground truth shows that our proposed Wiki3C can achieve a noticeable improvement over the baselines without considering contextual information. 0 0
Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning Bing L.
Lam W.
Wong T.-L.
WSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining English 2013 We develop a new framework to achieve the goal of Wikipedia entity expansion and attribute extraction from the Web. Our framework takes a few existing entities that are automatically collected from a particular Wikipedia category as seed input and explores their attribute infoboxes to obtain clues for the discovery of more entities for this category and the attribute content of the newly discovered entities. One characteristic of our framework is to conduct discovery and extraction from desirable semi-structured data record sets which are automatically collected from the Web. A semi-supervised learning model with Conditional Random Fields is developed to deal with the issues of extraction learning and limited number of labeled examples derived from the seed entities. We make use of a proximate record graph to guide the semi-supervised learning process. The graph captures alignment similarity among data records. Then the semi-supervised learning process can leverage the unlabeled data in the record set by controlling the label regularization under the guidance of the proximate record graph. Extensive experiments on different domains have been conducted to demonstrate its superiority for discovering new entities and extracting attribute content. 0 0
Reverts Revisited: Accurate Revert Detection in Wikipedia Fabian Flöck
Denny Vrandečić
Elena Simperl
Hypertext and Social Media 2012 English June 2012 Wikipedia is commonly used as a proving ground for research in collaborative systems. This is likely due to its popularity and scale, but also to the fact that large amounts of data about its formation and evolution are freely available to inform and validate theories and models of online collaboration. As part of the development of such approaches, revert detection is often performed as an important pre-processing step in tasks as diverse as the extraction of implicit networks of editors, the analysis of edit or editor features and the removal of noise when analyzing the emergence of the con-tent of an article. The current state of the art in revert detection is based on a rather naïve approach, which identifies revision duplicates based on MD5 hash values. This is an efficient, but not very precise technique that forms the basis for the majority of research based on revert relations in Wikipedia. In this paper we prove that this method has a number of important drawbacks - it only detects a limited number of reverts, while simultaneously misclassifying too many edits as reverts, and not distinguishing between complete and partial reverts. This is very likely to hamper the accurate interpretation of the findings of revert-related research. We introduce an improved algorithm for the detection of reverts based on word tokens added or deleted to adresses these drawbacks. We report on the results of a user study and other tests demonstrating the considerable gains in accuracy and coverage by our method, and argue for a positive trade-off, in certain research scenarios, between these improvements and our algorithm’s increased runtime. 13 0
A study of social behavior in collaborative user generated services Yao P.
Hu Z.
Zhao Z.
Crespi N.
Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ICUIMC'12 English 2012 User-generated content has become more and more popular. The success of collaborative content creation such as Wikipedia shows the level of user's accomplishments in knowledge sharing and socialization. In this paper we extend this research in the service domain, to explore users' social behavior in Collaborative User-Generated Services (Co-UGS). We create a model which is derived from a real social network with its behavior being similar to that of Co-UGS. The centrality approach of social network analysis is used to analyze Co-UGS simulation on this model. Three Co-UGS network actors are identified to distinguish users according to their reactions to a service, i.e. ignoring users, sharing users and co-creating users. Moreover, six hypotheses are proposed to keep the Co-UGS simulation. The results show that the Co-UGS network constructed by the sharing and co-creating users is a connected group superimposed on the basis of the social network of users. In addition, the feasibility of this simulation method is demonstrated along with the validity of applying social network analysis to the study of users' social behavior in Co-UGS. 0 0
A supervised method for lexical annotation of schema labels based on wikipedia Sorrentino S.
Bergamaschi S.
Parmiggiani E.
Lecture Notes in Computer Science English 2012 Lexical annotation is the process of explicit assignment of one or more meanings to a term w.r.t. a sense inventory (e.g., a thesaurus or an ontology). We propose an automatic supervised lexical annotation method, called ALA TK (Automatic Lexical Annotation -Topic Kernel), based on the Topic Kernel function for the annotation of schema labels extracted from structured and semi-structured data sources. It exploits Wikipedia as sense inventory and as resource of training data. 0 0
Adding semantics to microblog posts Edgar Meij
Weerkamp W.
Maarten de Rijke
WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining English 2012 Microblogs have become an important source of information for the purpose of marketing, intelligence, and reputation management. Streams of microblogs are of great value because of their direct and real-time nature. Determining what an individual microblog post is about, however, can be non-trivial because of creative language usage, the highly contextualized and informal nature of microblog posts, and the limited length of this form of communication. We propose a solution to the problem of determining what a microblog post is about through semantic linking: we add semantics to posts by automatically identifying concepts that are semantically related to it and generating links to the corresponding Wikipedia articles. The identified concepts can subsequently be used for, e.g., social media mining, thereby reducing the need for manual inspection and selection. Using a purpose-built test collection of tweets, we show that recently proposed approaches for semantic linking do not perform well, mainly due to the idiosyncratic nature of microblog posts. We propose a novel method based on machine learning with a set of innovative features and show that it is able to achieve significant improvements over all other methods, especially in terms of precision. Copyright 2012 ACM. 0 0
Annotation of adversarial and collegial social actions in discourse Bracewell D.B.
Tomlinson M.T.
Brunson M.
Plymale J.
Bracewell J.
Boerger D.
LAW 2012 - 6th Linguistic Annotation Workshop, In Conjunction with ACL 2012 - Proceedings English 2012 We posit that determining the social goals and intentions of dialogue participants is crucial for understanding discourse taking place on social media. In particular, we examine the social goals of being collegial and being adversarial. Through our early experimentation, we found that speech and dialogue acts are not able to capture the complexities and nuances of the social intentions of discourse participants. Therefore, we introduce a set of 9 social acts specifically designed to capture intentions related to being collegial and being adversarial. Social acts are pragmatic speech acts that signal a dialogue participant's social intentions. We annotate social acts in discourses communicated in English and Chinese taken from Wikipedia talk pages, public forums, and chat transcripts. Our results show that social acts can be reliably understood by annotators with a good level of inter-rater agreement. 0 0
Automatic Document Topic Identification using Wikipedia Hierarchical Ontology Hassan M.M.
Fakhri Karray
Kamel M.S.
2012 11th International Conference on Information Science, Signal Processing and their Applications, ISSPA 2012 English 2012 The rapid growth in the number of documents available to end users from around the world has led to a greatly-increased need for machine understanding of their topics, as well as for automatic grouping of related documents. This constitutes one of the main current challenges in text mining. In this work, a novel technique is proposed, to automatically construct a background knowledge structure in the form of a hierarchical ontology, using one of the largest online knowledge repositories: Wikipedia. Then, a novel approach is presented to automatically identify the documents' topics based on the proposed Wikipedia Hierarchical Ontology (WHO). Results show that the proposed model is efficient in identifying documents' topics, and promising, as it outperforms the accuracy of the other conventional algorithms for document clustering. 0 0
Automatic subject metadata generation for scientific documents using wikipedia and genetic algorithms Joorabchi A.
Mahdi A.E.
Lecture Notes in Computer Science English 2012 Topical annotation of documents with keyphrases is a proven method for revealing the subject of scientific and research documents. However, scientific documents that are manually annotated with keyphrases are in the minority. This paper describes a machine learning-based automatic keyphrase annotation method for scientific documents, which utilizes Wikipedia as a thesaurus for candidate selection from documents' content and deploys genetic algorithms to learn a model for ranking and filtering the most probable keyphrases. Reported experimental results show that the performance of our method, evaluated in terms of inter-consistency with human annotators, is on a par with that achieved by humans and outperforms rival supervised methods. 0 0
Automatic taxonomy extraction in different languages using wikipedia and minimal language-specific information Dominguez Garcia R.
Schmidt S.
Rensing C.
Steinmetz R.
Lecture Notes in Computer Science English 2012 Knowledge bases extracted from Wikipedia are particularly useful for various NLP and Semantic Web applications due to their co- verage, actuality and multilingualism. This has led to many approaches for automatic knowledge base extraction from Wikipedia. Most of these approaches rely on the English Wikipedia as it is the largest Wikipedia version. However, each Wikipedia version contains socio-cultural knowledge, i.e. knowledge with relevance for a specific culture or language. In this work, we describe a method for extracting a large set of hyponymy relations from the Wikipedia category system that can be used to acquire taxonomies in multiple languages. More specifically, we describe a set of 20 features that can be used for for Hyponymy Detection without using additional language-specific corpora. Finally, we evaluate our approach on Wikipedia in five different languages and compare the results with the WordNet taxonomy and a multilingual approach based on interwiki links of the Wikipedia. 0 0
BiCWS: Mining cognitive differences from bilingual web search results Xiangji Huang
Wan X.
Jie Xiao
Lecture Notes in Computer Science English 2012 In this paper we propose a novel comparative web search system - BiCWS, which can mine cognitive differences from web search results in a multi-language setting. Given a topic represented by two queries (they are the translations of each other) in two languages, the corresponding web search results for the two queries are firstly retrieved by using a general web search engine, and then the bilingual facets for the topic are mined by using a bilingual search results clustering algorithm. The semantics in Wikipedia are leveraged to improve the bilingual clustering performance. After that, the semantic distributions of the search results over the mined facets are visually presented, which can reflect the cognitive differences in the bilingual communities. Experimental results show the effectiveness of our proposed system. 0 0
Classifying trust/distrust relationships in online social networks Bachi G.
Coscia M.
Monreale A.
Giannotti F.
Proceedings - 2012 ASE/IEEE International Conference on Privacy, Security, Risk and Trust and 2012 ASE/IEEE International Conference on Social Computing, SocialCom/PASSAT 2012 English 2012 Online social networks are increasingly being used as places where communities gather to exchange information, form opinions, collaborate in response to events. An aspect of this information exchange is how to determine if a source of social information can be trusted or not. Data mining literature addresses this problem. However, if usually employs social balance theories, by looking at small structures in complex networks known as triangles. This has proven effective in some cases, but it under performs in the lack of context information about the relation and in more complex interactive structures. In this paper we address the problem of creating a framework for the trust inference, able to infer the trust/distrust relationships in those relational environments that cannot be described by using the classical social balance theory. We do so by decomposing a trust network in its ego network components and mining on this ego network set the trust relationships, extending a well known graph mining algorithm. We test our framework on three public datasets describing trust relationships in the real world (from the social media Epinions, Slash dot and Wikipedia) and confronting our results with the trust inference state of the art, showing better performances where the social balance theory fails. 0 0
Collaboratively constructed knowledge repositories as a resource for domain independent concept extraction Kerschbaumer J.
Reichhold M.
Winkler C.
Fliedl G.
Proceedings of the 10th Terminology and Knowledge Engineering Conference: New Frontiers in the Constructive Symbiosis of Terminology and Knowledge Engineering, TKE 2012 English 2012 To achieve a domain independent text management, a flexible and adaptive knowledge repository is indispensable and represents the key resource for solving many challenges in natural language processing. Especially for real world applications, the needed resources cannot be provided for technical disciplines, like engineering in the energy or the automotive domain. We therefore propose in this paper, a new approach for knowledge (concept) acquisition based on collaboratively constructed knowledge repositories like Wikipedia and enterprise Wikis. 0 0
Discovery of novel term associations in a document collection Hynonen T.
Mahler S.
Toivonen H.
Lecture Notes in Computer Science English 2012 We propose a method to mine novel, document-specific associations between terms in a collection of unstructured documents. We believe that documents are often best described by the relationships they establish. This is also evidenced by the popularity of conceptual maps, mind maps, and other similar methodologies to organize and summarize information. Our goal is to discover term relationships that can be used to construct conceptual maps or so called BisoNets. The model we propose, tpf-idf-tpu, looks for pairs of terms that are associated in an individual document. It considers three aspects, two of which have been generalized from tf-idf to term pairs: term pair frequency (tpf; importance for the document), inverse document frequency (idf; uniqueness in the collection), and term pair uncorrelation (tpu; independence of the terms). The last component is needed to filter out statistically dependent pairs that are not likely to be considered novel or interesting by the user. We present experimental results on two collections of documents: one extracted from Wikipedia, and one containing text mining articles with manually assigned term associations. The results indicate that the tpf-idf-tpu method can discover novel associations, that they are different from just taking pairs of tf-idf keywords, and that they match better the subjective associations of a reader. 0 0
English-to-traditional Chinese cross-lingual link discovery in articles with wikipedia corpus Chen L.-P.
Shih Y.-L.
Chen C.-T.
Ku T.
Hsieh W.-T.
Chiu H.-S.
Yang R.-D.
Proceedings of the 24th Conference on Computational Linguistics and Speech Processing, ROCLING 2012 English 2012 In this paper, we design a processing flow to produce linked data in articles, providing anchor-based term's additional information and related terms in different languages (English to Chinese). Wikipedia has been a very important corpus and knowledge bank. Although Wikipedia describes itself not a dictionary or encyclopedia, it is if high potential values in applications and data mining researches. Link discovery is a useful IR application, based on Data Mining and NLP algorithms and has been used in several fields. According to the results of our experiment, this method does make the result has improved. 0 0
Explanatory semantic relatedness and explicit spatialization for exploratory search Brent Hecht
Carton S.H.
Mahmood Quaderi
Johannes Schoning
Raubal M.
Darren Gergle
Doug Downey
SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval English 2012 Exploratory search, in which a user investigates complex concepts, is cumbersome with today's search engines. We present a new exploratory search approach that generates interactive visualizations of query concepts using thematic cartography (e.g. choropleth maps, heat maps). We show how the approach can be applied broadly across both geographic and non-geographic contexts through explicit spatialization, a novel method that leverages any figure or diagram - from a periodic table, to a parliamentary seating chart, to a world map - as a spatial search environment. We enable this capability by introducing explanatory semantic relatedness measures. These measures extend frequently-used semantic relatedness measures to not only estimate the degree of relatedness between two concepts, but also generate human-readable explanations for their estimates by mining Wikipedia's text, hyperlinks, and category structure. We implement our approach in a system called Atlasify, evaluate its key components, and present several use cases. 0 0
Exploiting Turkish Wikipedia as a semantic resource for text classification Poyraz M.
Ganiz M.C.
Akyokus S.
Gorener B.
Kilimci Z.H.
INISTA 2012 - International Symposium on INnovations in Intelligent SysTems and Applications English 2012 Majority of the existing text classification algorithms are based on the "bag of words" (BOW) approach, in which the documents are represented as weighted occurrence frequencies of individual terms. However, semantic relations between terms are ignored in this representation. There are several studies which address this problem by integrating background knowledge such as WordNet, ODP or Wikipedia as a semantic source. However, vast majority of these studies are applied to English texts and to the date there are no similar studies on classification of Turkish documents. We empirically analyze the effect of using Turkish Wikipedia (Vikipedi) as a semantic resource in classification of Turkish documents. Our results demonstrate that performance of classification algorithms can be improved by exploiting Vikipedi concepts. Additionally, we show that Vikipedi concepts have surprisingly large coverage in our datasets which mostly consist of Turkish newspaper articles. 0 0
Extracting knowledge from web search engine results Kanavos A.
Theodoridis E.
Tsakalidis A.
Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI English 2012 Nowadays, people frequently use search engines in order to find the information they need on the web. However, usually web search engines return web page references in a global ranking making it difficult to the users to browse different topics captured in the result set and thus making it difficult to find quickly the desired web pages. There is need for special computational systems, that will discover knowledge in these web search results providing the user with the possibility to browse different topics contained in a given result set. In this paper, we focus on the problem of determining different thematic groups on web search engine results that existing web search engines provide. We propose a novel system that exploits a set of reformulation strategies so as to help users gain more relevant results to their desired query. It additionally tries to discover among the result set different topic groups, according to the various meanings of the provided query. The proposed method utilizes a number of semantic annotation techniques using Knowledge Bases, like Word Net and Wikipedia, in order to perceive the different senses of each query term. Finally, the method annotates the extracted topics using information derived from the clusters and presents them to the end user. 0 0
Extraction of historical events from Wikipedia Hienert D.
Luciano F.
CEUR Workshop Proceedings English 2012 The DBpedia project extracts structured information from Wikipedia and makes it available on the web. Information is gathered mainly with the help of infoboxes that contain structured information of the Wikipedia article. A lot of information is only contained in the article body and is not yet included in DBpedia. In this paper we focus on the extraction of historical events from Wikipedia articles that are available for about 2,500 years for different languages. We have extracted about 121,000 events with more than 325,000 links to DBpedia entities and provide access to this data via a Web API, SPARQL endpoint, Linked Data Interface and in a timeline application. 0 0
Generalized optimization framework for graph-based semi-supervised learning Avrachenkov K.
Goncalves P.
Mishenin A.
Sokol M.
Proceedings of the 12th SIAM International Conference on Data Mining, SDM 2012 English 2012 We develop a generalized optimization framework for graphbased semi-supervised learning. The framework gives as particular cases the Standard Laplacian, Normalized Laplacian and PageRank based methods. We have also provided new probabilistic interpretation based on random walks and characterized the limiting behaviour of the methods. The random walk based interpretation allows us to explain differences between the performances of methods with different smoothing kernels. It appears that the PageRank based method is robust with respect to the choice of the regularization parameter and the labelled data. We illustrate our theoretical results with two realistic datasets, characterizing different challenges: Les Miserables characters social network and Wikipedia hyper-link graph. The graphbased semi-supervised learning classifies the Wikipedia articles with very good precision and perfect recall employing only the information about the hyper-text links. Copyright 0 0
Happy or not: Generating topic-based emotional heatmaps for culturomics using CyberGIS Shook E.
Kalev Leetaru
Cao G.
Padmanabhan A.
Se Wang
2012 IEEE 8th International Conference on E-Science, e-Science 2012 English 2012 The field of Culturomics exploits "big data" to explore human society at population scale. Culturomics increasingly needs to consider geographic contexts and, thus, this research develops a geospatial visual analytical approach that transforms vast amounts of textual data into emotional heatmaps with fine-grained spatial resolution. Fulltext geocoding and sentiment mining extract locations and latent "tone" from text-based data, which are combined with spatial analysis methods - kernel density estimation and spatial interpolation - to generate heatmaps that capture the interplay of location, topic, and tone toward narrative impacts. To demonstrate the effectiveness of the approach, the complete English edition of Wikipedia is processed using a supercomputer to extract all locations and tone associated with the year of 2003. An emotional heatmap ofWikipedia's discussion of "armed conflict" for that year is created using the spatial analysis methods. Unlike previous research, our approach is designed for exploratory spatial analysis of topics in text archives by incorporating multiple attributes including the prominence of each location mentioned in the text, the density of a topic at each location compared to other topics, and the tone of the topics of interest into a single analysis. The generation of such fine-grained emotional heatmaps is computationally intensive particularly when accounting for the multiple attributes at fine scales. Therefore a CyberGIS platform based on national cyberinfrastructure in the United States is used to enable the computationally intensive visual analytics. 0 0
Hidden community detection based on microblog by opinion-consistent analysis Fu M.-H.
Peng C.-H.
Kuo Y.-H.
Lee K.-R.
International Conference on Information Society, i-Society 2012 English 2012 The content or topic of post on the social network such as microblog, forum are usually reflected user's interests. Traditional community detection methods only consider explicit information of users. So that data analysis is limited in user predefined attributes. In order to solve this problem, a hidden community detection framework is proposed in this paper called opinion-consistent hidden community (OCHC) framework. Firstly, we collect and process post comments on facebook. Then, the post topic that the target user participated in can be defined through topic identification by the selected ontology, Wikipedia. Moreover, opinion-consistency between users and the target user is discovered by sentiment analysis. In brief, opinion mining and sentiment analysis are used to track the users who have the similar opinion on the specific topics. Besides, users focus on different features with different scopes on facebook can be found by multi-level OCHC framework that we proposed in this paper. Communities of opinion-consistent users are clustered Multi-level OCHC model. There are two major improvements of OCHC framework, one is that post topic is decided by topic identification instead of user-self, and the other is that user opinions are also considered during analysis phrase on OCHC framework. In experiment results, accuracy of topic identification promoted 5.5% than other methods and the time complexity reached 26 times faster than other one. On quantitative measurements of Polarity and Multi-Dimension sentiment analysis methods are performed well. 0 0
Improving cross-document knowledge discovery using explicit semantic analysis Yan P.
Jin W.
Lecture Notes in Computer Science English 2012 Cross-document knowledge discovery is dedicated to exploring meaningful (but maybe unapparent) information from a large volume of textual data. The sparsity and high dimensionality of text data present great challenges for representing the semantics of natural language. Our previously introduced Concept Chain Queries (CCQ) was specifically designed to discover semantic relationships between two concepts across documents where relationships found reveal semantic paths linking two concepts across multiple text units. However, answering such queries only employed the Bag of Words (BOW) representation in our previous solution, and therefore terms not appearing in the text literally are not taken into consideration. Explicit Semantic Analysis (ESA) is a novel method proposed to represent the meaning of texts in a higher dimensional space of concepts which are derived from large-scale human built repositories such as Wikipedia. In this paper, we propose to integrate the ESA technique into our query processing, which is capable of using vast knowledge from Wikipedia to complement existing information from text corpus and alleviate the limitations resulted from the BOW representation. The experiments demonstrate the search quality has been greatly improved when incorporating ESA into answering CCQ, compared with using a BOW-based approach. 0 0
InfoExtractor-A Tool for Social Media Data Mining File C.
Shah C.
Journal of Information Technology and Politics English 2012 In this workbench note, we present InfoExtractor, a Web-based tool for collecting data and metadata from focused social media content. InfoExtractor then provides these data in various structured and unstructured formats for manipulation and analysis. The tool allows social science researchers to collect data for quantitative analysis, and is designed to deliver data from popular and influential social media sites in a useful and easy-to-access format. InfoExtractor was designed to replace traditional means of content aggregation, such as page scraping and brute-force copying. 0 0
Language models for keyword search over data graphs Mass Y.
Sagiv Y.
WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining English 2012 In keyword search over data graphs, an answer is a nonredundant subtree that includes the given keywords. This paper focuses on improving the effectiveness of that type of search. A novel approach that combines language models with structural relevance is described. The proposed approach consists of three steps. First, language models are used to assign dynamic, query-dependent weights to the graph. Those weights complement static weights that are pre-assigned to the graph. Second, an existing algorithm returns candidate answers based on their weights. Third, the candidate answers are re-ranked by creating a language model for each one. The effectiveness of the proposed approach is verified on a benchmark of three datasets: IMDB, Wikipedia and Mondial. The proposed approach outperforms all existing systems on the three datasets, which is a testament to its robustness. It is also shown that the effectiveness can be further improved by augmenting keyword queries with very basic knowledge about the structure. Copyright 2012 ACM. 0 0
Man-machine collaboration to acquire cooking adaptation knowledge for the TAAABLE case-based reasoning system Cordier A.
Gaillard E.
Nauer E.
WWW'12 - Proceedings of the 21st Annual Conference on World Wide Web Companion English 2012 This paper shows how humans and machines can better collaborate to acquire adaptation knowledge (AK) in the framework of a case-based reasoning (CBR) system whose knowledge is encoded in a semantic wiki. Automatic processes like the CBR reasoning process itself, or specific tools for acquiring AK are integrated as wiki extensions. These tools and processes are combined on purpose to collect AK. Users are at the center of our approach, as they are in a classical wiki, but they will now benefit from automatic tools for helping them to feed the wiki. In particular, the CBR system, which is currently only a consumer for the knowledge encoded in the semantic wiki, will also be used for producing knowledge for the wiki. A use case in the domain of cooking is given to exemplify the man-machine collaboration. Copyright is held by the International World Wide Web Conference Committee (IW3C2). 0 0
Mining Wikipedia's snippets graph: First step to build a new knowledge base Wira-Alam A.
Mathiak B.
CEUR Workshop Proceedings English 2012 In this paper, we discuss the aspects of mining links and text snippets from Wikipedia as a new knowledge base. Current knowledge base, e.g. DBPedia[1], covers mainly the structured part of Wikipedia, but not the content as a whole. Acting as a complement, we focus on extracting information from the text of the articles. We extract a database of the hyperlinks between Wikipedia articles and populate them with the textual context surrounding each hyperlink. This would be useful for network analysis, e.g. to measure the influence of one topic on another, or for question-answering directly (for stating the relationship between two entities). First, we describe the technical parts related to extracting the data from Wikipedia. Second, we specify how to represent the data extracted as an extended triple through a Web service. Finally, we discuss the usage possibilities upon our expectation and also the challenges. 0 0
Mining semantic relations between research areas Osborne F.
Motta E.
Lecture Notes in Computer Science English 2012 For a number of years now we have seen the emergence of repositories of research data specified using OWL/RDF as representation languages, and conceptualized according to a variety of ontologies. This class of solutions promises both to facilitate the integration of research data with other relevant sources of information and also to support more intelligent forms of querying and exploration. However, an issue which has only been partially addressed is that of generating and characterizing semantically the relations that exist between research areas. This problem has been traditionally addressed by manually creating taxonomies, such as the ACM classification of research topics. However, this manual approach is inadequate for a number of reasons: these taxonomies are very coarse-grained and they do not cater for the fine-grained research topics, which define the level at which typically researchers (and even more so, PhD students) operate. Moreover, they evolve slowly, and therefore they tend not to cover the most recent research trends. In addition, as we move towards a semantic characterization of these relations, there is arguably a need for a more sophisticated characterization than a homogeneous taxonomy, to reflect the different ways in which research areas can be related. In this paper we propose Klink, a new approach to i) automatically generating relations between research areas and ii) populating a bibliographic ontology, which combines both machine learning methods and external knowledge, which is drawn from a number of resources, including Google Scholar and Wikipedia. We have tested a number of alternative algorithms and our evaluation shows that a method relying on both external knowledge and the ability to detect temporal relations between research areas performs best with respect to a manually constructed standard. 0 0
Mining spatio-temporal patterns in the presence of concept hierarchies Anh L.V.Q.
Gertz M.
Proceedings - 12th IEEE International Conference on Data Mining Workshops, ICDMW 2012 English 2012 In the past, approaches to mining spatial and spatio-temporal data for interesting patterns have mainly concentrated on data obtained through observations and simulations where positions of objects, such as areas, vehicles, or persons, are collected over time. In the past couple of years, however, new datasets have been built by automatically extracting facts, as subject-predicate-object triples, from semistructured information sources such as Wikipedia. Recently some approaches, for example, in the context of YAGO2, have extended such facts by adding temporal and spatial information. The presence of such new data sources gives rise to new approaches for discovering spatio-temporal patterns. In this paper, we present a framework in support of the discovery of interesting spatio-temporal patterns from knowledge base datasets. Different from traditional approaches to mining spatio-temporal data, we focus on mining patterns at different levels of granularity by exploiting concept hierarchies, which are a key ingredient in knowledge bases.We introduce a pattern specification language and outline an algorithmic approach to efficiently determine complex patterns. We demonstrate the utility of our framework using two different real-world datasets from YAGO2 and the Website 0 0
Mining web query logs to analyze political issues Ingmar Weber
Garimella V.R.K.
Borra E.
Proceedings of the 3rd Annual ACM Web Science Conference, WebSci'12 English 2012 We present a novel approach to using anonymized web search query logs to analyze and visualize political issues. Our starting point is a list of politically annotated blogs (left vs. right). We use this list to assign a numerical political leaning to queries leading to clicks on these blogs. Furthermore, we map queries to Wikipedia articles and to fact-checked statements from, as well as applying sentiment analysis to search results. With this rich, multi-faceted data set we obtain novel graphical visualizations of issues and discover connections between the different variables. Our findings include (i) an interest in "the other side" where queries about Democrat politicians have a right leaning and vice versa, (ii) evidence that "lies are catchy" and that queries pertaining to false statements are more likely to attract large volumes, and (iii) the observation that the more right-leaning a query it is, the more negative sentiments can be found in its search results. Copyright 0 0
Pattern for python De Smedt T.
Daelemans W.
Journal of Machine Learning Research English 2012 Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from pattern. 0 0
Predicting quality flaws in user-generated content: The case of wikipedia Maik Anderka
Benno Stein
Nedim Lipka
SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval English 2012 The detection and improvement of low-quality information is a key concern in Web applications that are based on user-generated content; a popular example is the online encyclopedia Wikipedia. Existing research on quality assessment of user-generated content deals with the classification as to whether the content is high-quality or low-quality. This paper goes one step further: it targets the prediction of quality flaws, this way providing specific indications in which respects low-quality content needs improvement. The prediction is based on user-defined cleanup tags, which are commonly used in many Web applications to tag content that has some shortcomings. We apply this approach to the English Wikipedia, which is the largest and most popular user-generated knowledge source on the Web. We present an automatic mining approach to identify the existing cleanup tags, which provides us with a training corpus of labeled Wikipedia articles. We argue that common binary or multiclass classification approaches are ineffective for the prediction of quality flaws and hence cast quality flaw prediction as a one-class classification problem. We develop a quality flaw model and employ a dedicated machine learning approach to predict Wikipedia's most important quality flaws. Since in the Wikipedia setting the acquisition of significant test data is intricate, we analyze the effects of a biased sample selection. In this regard we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. The flaw prediction performance is evaluated with 10,000 Wikipedia articles that have been tagged with the ten most frequent quality flaws: provided test data with little noise, four flaws can be detected with a precision close to 1. 0 0
Query suggestion by constructing term-transition graphs Song Y.
Zhou D.
He L.-W.
WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining English 2012 Query suggestion is an interactive approach for search engines to better understand users information need. In this paper, we propose a novel query suggestion framework which leverages user re-query feedbacks from search engine logs. Specifically, we mined user query reformulation activities where the user only modifies part of the query by (1) adding terms after the query, (2) deleting terms within the query, or (3) modifying terms to new terms. We build a term-transition graph based on the mined data. Two models are proposed which address topic-level and term-level query suggestions, respectively. In the first topic-based unsupervised Pagerank model, we perform random walk on each of the topic-based term-transition graph and calculate the Pagerank for each term within a topic. Given a new query, we suggest relevant queries based on its topic distribution and term-transition probability within each topic. Our second model resembles the supervised learning-to-rank (LTR) framework, in which term modifications are treated as documents so that each query reformulation is treated as a training instance. A rich set of features are constructed for each (query, document) pair from Pagerank, Wikipedia, Ngram, ODP and so on. This supervised model is capable of suggesting new queries on a term level which addresses the limitation of previous methods. Experiments are conducted on a large data set from a commercial search engine. By comparing the with state-of-the-art query suggestion methods [4, 2], our proposals exhibit significant performance increase for all categories of queries. Copyright 2012 ACM. 0 0
Self-organization with additional learning based on category mapping and its application to dynamic news clustering Toyota T.
Nobuhara H.
IEEJ Transactions on Electronics, Information and Systems Japanese; English 2012 The Internet news are texts which involve from various fields, therefore, when a text data that will show a rapid increase of the number of dimensions of feature vectors of Self-OrganizingMap (SOM) is added, these results cannot be reflected to learning. Furthermore, it is difficult for users to recognize the learning results because SOM can not produce any label information by each cluster. In order to solve these problems, we propose SOM with additional learning and dimensional by category mapping which is based on the category structure of Wikipedia. In this method, input vector is generated from each text and the correspondingWikipedia categories extracted fromWikipedia articles. Input vectors are formed in the common category taking the hierarchical structure of Wikipedia category into consideration. By using the proposed method, the problem of reconfiguration of vector elements caused by dynamic changes in the text can be solved. Moreover, information loss in newly obtained index term can be prevented. 0 0
Snip! Andrew Trotman
Crane M.
Lecture Notes in Computer Science English 2012 The University of Otago submitted runs to the Snippet Retrieval Track and the Relevance Feedback tracks at INEX 2011. Snippets were generated using vector space ranking functions, taking into account or ignoring structural hints, and using word clouds. We found that using passages made better snippets than XML elements and that word clouds make bad snippets. In our runs in the Relevance Feedback track we were testing the INEX gateway to C/C++ and blind relevance feedback (with and without stemming). We found that blind relevance feedback with stemming does improve prevision in the INEX framework. 0 0
Some approaches to the development of information influence and hidden communications detection systems in wiki-environment Alekperova I. 2012 4th International Conference "Problems of Cybernetics and Informatics", PCI 2012 - Proceedings English 2012 The overall structure of hidden connections and information influences detection system in a wiki-environment is presented in this article. Several concepts are available for detection of information influences, as well as hidden connections between users: OLAP, Data Mining, also advantages of given mechanisms are listed. 0 0
Study of ontology or thesaurus based document clustering and information retrieval Bharathi G.
Venkatesan D.
Journal of Theoretical and Applied Information Technology
Journal of Engineering and Applied Sciences
English 2012 Document clustering generates clusters from the whole document collection automatically and is used in many fields, including data mining and information retrieval. Clustering text data faces a number of new challenges. Among others, the volume of text data, dimensionality, sparsity and complex semantics are the most important ones. These characteristics of text data require clustering techniques to be scalable to large and high dimensional data, and able to handle sparsity and semantics. In the traditional vector space model, the unique words occurring in the document set are used as the features. But because of the synonym problem and the polysemous problem, such a bag of original words cannot represent the content of a document precisely. Most of the existing text clustering methods use clustering techniques which depend only on term strength and document frequency where single terms are used as features for representing the documents and they are treated independently which can be easily applied to non-ontological clustering. To overcome the above issues, this paper makes a survey of recent research done on ontology or thesaurus based document clustering.
Document clustering generate clusters from the whole document collection automatically and is used in many fields including data mining and information retrieval. Clustering text data faces a number of new challenges. Among others, the volume of text data, dimensionality, sparsity and complex semantics are the most impodant ones. These characteristics of text data require clustering techmques to be scalable to large and hgh dimensional data and able to handle sparsity and semantics. In the traditional vector space model, the unique words occurring in the document set are used as the features. But because of the synonym problem and the polysemous problem such a bag of original words cannot represent the content of a document precisely. Most of the existing text clustering methods use clustering techniques whch depend only on term strength and document frequency where single terms are used as features for representing the documents and they are treated independently whch can be easily applied to non-ontological clustering. To overcome these issues, this study makes a survey of recent research done on ontology or thesaurus based document clustering.
0 0
Supporting wiki users with natural language processing Bahar Sateli
René Witte
WikiSym 2012 English 2012 We present a "self-aware" wiki system, based on the MediaWiki engine, that can develop and organize its content using state-of-art techniques from the Natural Language Processing (NLP) and Semantic Computing domains. This is achieved with an architecture that integrates novel NLP solutions within the MediaWiki environment to allow wiki users to benefit from modern text mining techniques. As concrete applications, we present how the enhanced MediaWiki engine can be used for biomedical literature curation, cultural heritage data management, and software requirements engineering. 0 0
TCSST: Transfer classification of short & sparse text using external data Long G.
Long Chen
Zhu X.
Zhang C.
ACM International Conference Proceeding Series English 2012 Short & sparse text is becoming more prevalent on the web, such as search snippets, micro-blogs and product reviews. Accurately classifying short & sparse text has emerged as an important while challenging task. Existing work has considered utilizing external data (e.g. Wikipedia) to alleviate data sparseness, by appending topics detected from external data as new features. However, training a classifier on features concatenated from different spaces is not easy considering the features have different physical meanings and different significance to the classification task. Moreover, it exacerbates the "curse of dimensionality" problem. In this study, we propose a transfer classification method, TCSST, to exploit the external data to tackle the data sparsity issue. The transfer classifier will be learned in the original feature space. Considering that the labels of the external data may not be readily available or sufficiently enough, TCSST further exploits the unlabeled external data to aid the transfer classification. We develop novel strategies to allow TCSST to iteratively select high quality unlabeled external data to help with the classification. We evaluate the performance of TCSST on both benchmark as well as real-world data sets. Our experimental results demonstrate that the proposed method is effective in classifying very short & sparse text, consistently outperforming existing and baseline methods. 0 0
Tapping into knowledge base for concept feedback: Leveraging ConceptNet to improve search results for difficult queries Kotov A.
Zhai C.X.
WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining English 2012 Query expansion is an important and commonly used technique for improving Web search results. Existing methods for query expansion have mostly relied on global or local analysis of document collection, click-through data, or simple ontologies such as WordNet. In this paper, we present the results of a systematic study of the methods leveraging the ConceptNet knowledge base, an emerging new Web resource, for query expansion. Specifically, we focus on the methods leveraging ConceptNet to improve the search results for poorly performing (or difficult) queries. Unlike other lexico-semantic resources, such as WordNet and Wikipedia, which have been extensively studied in the past, ConceptNet features a graph-based representation model of commonsense knowledge, in which the terms are conceptually related through rich relational ontology. Such representation structure enables complex, multi-step inferences between the concepts, which can be applied to query expansion. We first demonstrate through simulation experiments that expanding queries with the related concepts from ConceptNet has great potential for improving the search results for difficult queries. We then propose and study several supervised and unsupervised methods for selecting the concepts from ConceptNet for automatic query expansion. The experimental results on multiple data sets indicate that the proposed methods can effectively leverage ConceptNet to improve the retrieval performance of difficult queries both when used in isolation as well as in combination with pseudo-relevance feedback. Copyright 2012 ACM. 0 0
Towards building a global oracle: A physical mashup using artificial intelligence technology Fortuna C.
Vucnik M.
Blaz Fortuna
Kenda K.
Moraru A.
Mladenic D.
ACM International Conference Proceeding Series English 2012 In this paper, we describe Videk - a physical mashup which uses artificial intelligence technology. We make an analogy between human senses and sensors; and between human brain and artificial intelligence technology respectively. This analogy leads to the concept of Global Oracle. We introduce a mashup system which automatically collects data from sensors. The data is processed and stored by SenseStream while the meta-data is fed into ResearchCyc. SenseStream indexes aggregates, performs clustering and learns rules which then it exports as RuleML. ResearchCyc performs logical inference on the meta-data and transliterates logical sentences. The GUI mashes up sensor data with SenseStream output, ResearchCyc output and other external data sources: GoogleMaps, Geonames, Wikipedia and Panoramio. Copyright 0 0
Using wikipedia anchor text and weighted clustering coefficient to enhance the traditional multi-document summarization Kumar N.
Srinathan K.
Vasudeva Varma
Lecture Notes in Computer Science English 2012 Similar to the traditional approach, we consider the task of summarization as selection of top ranked sentences from ranked sentence-clusters. To achieve this goal, we rank the sentence clusters by using the importance of words calculated by using page rank algorithm on reverse directed word graph of sentences. Next, to rank the sentences in every cluster we introduce the use of weighted clustering coefficient. We use page rank score of words for calculation of weighted clustering coefficient. Finally the most important issue is the presence of a lot of noisy entries in the text, which downgrades the performance of most of the text mining algorithms. To solve this problem, we introduce the use of Wikipedia anchor text based phrase mapping scheme. Our experimental results on DUC-2002 and DUC-2004 dataset show that our system performs better than unsupervised systems and better than/comparable with novel supervised systems of this area. 0 0
Validation and discovery of genotype-phenotype associations in chronic diseases using linked data Pathak J.
Kiefer R.
Freimuth R.
Chute C.
Studies in Health Technology and Informatics English 2012 This study investigates federated SPARQL queries over Linked Open Data (LOD) in the Semantic Web to validate existing, and potentially discover new genotype-phenotype associations from public datasets. In particular, we report our preliminary findings for identifying such associations for commonly occurring chronic diseases using the Online Mendelian Inheritance in Man (OMIM) and Database for SNPs (dbSNP) within the LOD knowledgebase and compare them with Gene Wiki for coverage and completeness. Our results indicate that Semantic Web technologies can play an important role for in-silico identification of novel disease-gene-SNP associations, although additional verification is required before such information can be applied and used effectively. © 2012 European Federation for Medical Informatics and IOS Press. All rights reserved. 0 0
ViDaX: An interactive semantic data visualisation and exploration tool Dumas B.
Broche T.
Hoste L.
Signer B.
Proceedings of the Workshop on Advanced Visual Interfaces AVI English 2012 We present the Visual Data Explorer (ViDaX), a tool for visualising and exploring large RDF data sets. ViDaX enables the extraction of information from RDF data sources and offers functionality for the analysis of various data characteristics as well as the exploration of the corresponding ontology graph structure. In addition to some basic data mining features, our interactive semantic data visualisation and exploration tool offers various types of visualisations based on the type of data. In contrast to existing semantic data visualisation solutions, ViDaX also offers non-expert users the possibility to explore semantic data based on powerful automatic visualisation and interaction techniques without the need for any low-level programming. To illustrate some of ViDaX's functionality, we present a use case based on semantic data retrieved from DBpedia, a semantic version of the well-known Wikipedia online encyclopedia, which forms a major component of the emerging linked data initiative. 0 0
WikiSent: Weakly supervised sentiment analysis through extractive summarization with Wikipedia Saswati Mukherjee
Prantik Bhattacharyya
Lecture Notes in Computer Science English 2012 This paper describes a weakly supervised system for sentiment analysis in the movie review domain. The objective is to classify a movie review into a polarity class, positive or negative, based on those sentences bearing opinion on the movie alone, leaving out other irrelevant text. Wikipedia incorporates the world knowledge of movie-specific features in the system which is used to obtain an extractive summary of the review, consisting of the reviewer's opinions about the specific aspects of the movie. This filters out the concepts which are irrelevant or objective with respect to the given movie. The proposed system, WikiSent, does not require any labeled data for training. It achieves a better or comparable accuracy to the existing semi-supervised and unsupervised systems in the domain, on the same dataset. We also perform a general movie review trend analysis using WikiSent. 0 0
Wikipedia-based efficient sampling approach for topic model Zhao T.
Chenliang Li
Li M.
Proceedings of the 9th International Network Conference, INC 2012 English 2012 In this paper, we propose a novel approach called Wikipedia-based Collapsed Gibbs sampling (Wikipedia-based CGS) to improve the efficiency of the collapsed Gibbs sampling(CGS), which has been widely used in latent Dirichlet Allocation (LDA) model. Conventional CGS method views each word in the documents as an equal status for the topic modeling. Moreover, sampling all the words in the documents always leads to high computational complexity. Considering this crucial drawback of LDA we propose the Wikipedia-based CGS approach that commits to extracting more meaningful topics and improving the efficiency of the sampling process in LDA by distinguishing different statuses of words in the documents for sampling topics with Wikipedia as the background knowledge. The experiments on real world datasets show that our Wikipedia-based approach for collapsed Gibbs sampling can significantly improve the efficiency and have a better perplexity compared to existing approaches. 0 0
A resource-based method for named entity extraction and classification Gamallo P.
Garcia M.
Lecture Notes in Computer Science English 2011 We propose a resource-based Named Entity Classification (NEC) system, which combines named entity extraction with simple language-independent heuristics. Large lists (gazetteers) of named entities are automatically extracted making use of semi-structured information from the Wikipedia, namely infoboxes and category trees. Language-independent heuristics are used to disambiguate and classify entities that have been already identified (or recognized) in text. We compare the performance of our resource-based system with that of a supervised NEC module implemented for the FreeLing suite, which was the winner system in CoNLL-2002 competition. Experiments were performed over Portuguese text corpora taking into account several domains and genres. 0 0
A statistical approach for automatic keyphrase extraction Abulaish M.
Dey L.
Proceedings of the 5th Indian International Conference on Artificial Intelligence, IICAI 2011 English 2011 Due to availability of voluminous textual data either on the World Wide Web or in textual databases automatic keyphrase extraction has gained increasing popularity in recent past to summarize and characterize text documents. Consequently, a number of machine learning techniques, mostly supervised, have been proposed to mine keyphrases in an automatic way. But, the non-availability of annotated corpus for training such systems is the main hinder for their success. In this paper, we propose the design of an automatic keyphrase extraction system which uses NLP and statistical approach to mine keyphrases from unstructured text documents. The efficacy of the proposed system is established over texts crawled from Wikipedia server. On evaluation we found that the proposed method outperforms KEA which uses naïve Bayes classification technique for keyphrase extraction. 0 0
Automatic acquisition of taxonomies in different languages from multiple Wikipedia versions Garcia R.D.
Rensing C.
Steinmetz R.
ACM International Conference Proceeding Series English 2011 In the last years, the vision of the Semantic Web has led to many approaches that aim to automatically derive knowledge bases from Wikipedia. These approaches rely mostly on the English Wikipedia as it is the largest Wikipedia version and have lead to valuable knowledge bases. However, each Wikipedia version contains socio-cultural knowledge, i.e. knowledge with specific relevance for a culture or language. One difficulty of the application of existing approaches to multiple Wikipedia versions is the use of additional corpora. In this paper, we describe the adaptation of existing heuristics that make the extraction of large sets of hyponymy relations from multiple Wikipedia versions with little information about each language possible. Further, we evaluate our approach with Wikipedia versions in four different languages and compare results with GermaNet for German and WordNet for English. 0 0
Automatic knowledge extraction from manufacturing research publications Boonyasopon P.
Riel A.
Uys W.
Louw L.
Tichkiewitch S.
Du Preez N.
CIRP Annals - Manufacturing Technology English 2011 Knowledge mining is a young and rapidly growing discipline aiming at automatically identifying valuable knowledge in digital documents. This paper presents the results of a study of the application of document retrieval and text mining techniques to extract knowledge from CIRP research papers. The target is to find out if and how such tools can help researchers to find relevant publications in a cluster of papers and increase the citation indices their own papers. Two different approaches to automatic topic identification are investigated. One is based on Latent Dirichlet Allocation of a huge document set, the other uses Wikipedia to discover significant words in papers. The study uses a combination of both approaches to propose a new approach to efficient and intelligent knowledge mining. 0 0
Baudenkmalnetz - Creating a semantically annotated web resource of historical buildings Dumitrache A.
Christoph Lange
CEUR Workshop Proceedings English 2011 BauDenkMalNetz ("listed buildings web") deals with creating a semantically annotated website of urban historical landmarks. The annotations cover the most relevant information about the landmarks (e.g. the buildings' architects, architectural style or construction details), for the purpose of extended accessibility and smart querying. BauDenkMalNetz is based on a series of touristic books on architectural landscape. After a thorough analysis on the requirements that our website should provide, we processed these books using automated tools for text mining, which led to an ontology that allows for expressing all relevant architectural and historical information. In preparation of publishing the books on a website powered by this ontology, we analyze how well Semantic MediaWiki and the RDF-aware Drupal 7 content management system satisfy our requirements. 0 0
Calculating Wikipedia article similarity using machine translation evaluation metrics Maike Erdmann
Andrew Finch
Kotaro Nakayama
Eiichiro Sumita
Takahiro Hara
Shojiro Nishio
Proceedings - 25th IEEE International Conference on Advanced Information Networking and Applications Workshops, WAINA 2011 English 2011 Calculating the similarity of Wikipedia articles in different languages is helpful for bilingual dictionary construction and various other research areas. However, standard methods for document similarity calculation are usually very simple. Therefore, we describe an approach of translating one Wikipedia article into the language of the other article, and then calculating article similarity with standard machine translation evaluation metrics. An experiment revealed that our approach is effective for identifying Wikipedia articles in different languages that are covering the same concept. 0 0
Citizens as database: Conscious ubiquity in data collection Richter K.-F.
Winter S.
Lecture Notes in Computer Science English 2011 Crowd sourcing [1], citzens as sensors [2], user-generated content [3,4], or volunteered geographic information [5] describe a relatively recent phenomenon that points to dramatic changes in our information economy. Users of a system, who often are not trained in the matter at hand, contribute data that they collected without a central authority managing or supervising the data collection process. The individual approaches vary and cover a spectrum from conscious user actions ('volunteered') to passive modes ('citizens as sensors'). Volunteered user-generated content is often used to replace existing commercial or authoritative datasets, for example, Wikipedia as an open encyclopaedia, or OpenStreetMap as an open topographic dataset of the world. Other volunteered content exploits the rapid update cycles of such mechanisms to provide improved services. For example, reports damages related to streets; Google, TomTom and other dataset providers encourage their users to report updates of their spatial data. In some cases, the database itself is the service; for example, Flickr allows users to upload and share photos. At the passive end of the spectrum, data mining methods can be used to further elicit hidden information out of the data. Researchers identified, for example, landmarks defining a town from Flickr photo collections [6], and commercial services track anonymized mobile phone locations to estimate traffic flow and enable real-time route planning. 0 0
Cross lingual text classification by mining multilingual topics from Wikipedia Xiaochuan Ni
Sun J.-T.
Jian Hu
Zheng Chen
Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 English 2011 This paper investigates how to effectively do cross lingual text classification by leveraging a large scale and multilingual knowledge base, Wikipedia. Based on the observation that each Wikipedia concept is described by documents of different languages, we adapt existing topic modeling algorithms for mining multilingual topics from this knowledge base. The extracted topics have multiple types of representations, with each type corresponding to one language. In this work, we regard such topics extracted from Wikipedia documents as universal-topics, since each topic corresponds with same semantic information of different languages. Thus new documents of different languages can be represented in a space using a group of universal-topics. We use these universal-topics to do cross lingual text classification. Given the training data labeled for one language, we can train a text classifier to classify the documents of another language by mapping all documents of both languages into the universal-topic space. This approach does not require any additional linguistic resources, like bilingual dictionaries, machine translation tools, or labeling data for the target language. The evaluation results indicate that our topic modeling approach is effective for building cross lingual text classifier. Copyright 2011 ACM. 0 0
D-cores: Measuring collaboration of directed graphs based on degeneracy Giatsidis C.
Thilikos D.M.
Vazirgiannis M.
Proceedings - IEEE International Conference on Data Mining, ICDM English 2011 Community detection and evaluation is an important task in graph mining. In many cases, a community is defined as a subgraph characterized by dense connections or interactions among its nodes. A large variety of measures have been proposed to evaluate the quality of such communities - in most cases ignoring the directed nature of edges. In this paper, we introduce novel metrics for evaluating the collaborative nature of directed graphs - a property not captured by the single node metrics or by other established community evaluation metrics. In order to accomplish this objective, we capitalize on the concept of graph degeneracy and define a novel D-core framework, extending the classic graph-theoretic notion of k-cores for undirected graphs to directed ones. Based on the D-core, which essentially can be seen as a measure of the robustness of a community under degeneracy, we devise a wealth of novel metrics used to evaluate graph collaboration features of directed graphs. We applied the D-core approach on large real-world graphs such as Wikipedia and DBLP and report interesting results at the graph as well at node level. 0 0
Detecting community kernels in large social networks Lei Wang
Lou T.
Tang J.
Hopcroft J.E.
Proceedings - IEEE International Conference on Data Mining, ICDM English 2011 In many social networks, there exist two types of users that exhibit different influence and different behavior. For instance, statistics have shown that less than 1% of the Twitter users (e.g. entertainers, politicians, writers) produce 50% of its content [1], while the others (e.g. fans, followers, readers) have much less influence and completely different social behavior. In this paper, we define and explore a novel problem called community kernel detection in order to uncover the hidden community structure in large social networks. We discover that influential users pay closer attention to those who are more similar to them, which leads to a natural partition into different community kernels. We propose GREEDY and WEBA, two efficient algorithms for finding community kernels in large social networks. GREEDY is based on maximum cardinality search, while WEBA formalizes the problem in an optimization framework. We conduct experiments on three large social networks: Twitter, Wikipedia, and Coauthor, which show that WEBA achieves an average 15%- 50% performance improvement over the other state-of-the-art algorithms, and WEBA is on average 6-2,000 times faster in detecting community kernels. 0 0
Distributed tuning of machine learning algorithms using MapReduce Clusters Yasser Ganjisaffar
Debeauvais T.
Sara Javanmardi
Caruana R.
Lopes C.V.
Proceedings of the 3rd Workshop on Large Scale Data Mining: Theory and Applications, LDMTA 2011 - Held in Conjunction with ACM SIGKDD 2011 English 2011 Obtaining the best accuracy in machine learning usually requires carefully tuning learning algorithm parameters for each problem. Parameter optimization is computationally challenging for learning methods with many hyperparameters. In this paper we show that MapReduce Clusters are particularly well suited for parallel parameter optimization. We use MapReduce to optimize regularization parameters for boosted trees and random forests on several text problems: three retrieval ranking problems and a Wikipedia vandalism problem. We show how model accuracy improves as a function of the percent of parameter space explored, that accuracy can be hurt by exploring parameter space too aggressively, and that there can be significant interaction between parameters that appear to be independent. Our results suggest that MapReduce is a two-edged sword: it makes parameter optimization feasible on a massive scale that would have been unimaginable just a few years ago, but also creates a new opportunity for overfitting that can reduce accuracy and lead to inferior learning parameters. 0 0
Entity disambiguation with hierarchical topic models Kataria S.S.
Kumar K.S.
Rastogi R.
Sen P.
Sengamedu S.H.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining English 2011 Disambiguating entity references by annotating them with unique ids from a catalog is a critical step in the enrichment of unstructured content. In this paper, we show that topic models, such as Latent Dirichlet Allocation (LDA) and its hierarchical variants, form a natural class of models for learning accurate entity disambiguation models from crowd-sourced knowledge bases such as Wikipedia. Our main contribution is a semi-supervised hierarchical model called Wikipedia-based Pachinko Allocation Model (WPAM) that exploits: (1) All words in the Wikipedia corpus to learn word-entity associations (unlike existing approaches that only use words in a small fixed window around annotated entity references in Wikipedia pages), (2) Wikipedia annotations to appropriately bias the assignment of entity labels to annotated (and co-occurring unannotated) words during model learning, and (3) Wikipedia's category hierarchy to capture co-occurrence patterns among entities. We also propose a scheme for pruning spurious nodes from Wikipedia's crowd-sourced category hierarchy. In our experiments with multiple real-life datasets, we show that WPAM outperforms state-of-the-art baselines by as much as 16% in terms of disambiguation accuracy. Copyright 2011 ACM. 0 0
Extracting multi-dimensional relations: A generative model of groups of entities in a corpus Au Yeung C.-M.
Iwata T.
International Conference on Information and Knowledge Management, Proceedings English 2011 Extracting relations among different entities from various data sources has been an important topic in data mining. While many methods focus only on a single type of relations, real world entities maintain relations that contain much richer information. We propose a hierarchical Bayesian model for extracting multi-dimensional relations among entities from a text corpus. Using data from Wikipedia, we show that our model can accurately predict the relevance of an entity given the topic of the document as well as the set of entities that are already mentioned in that document. 0 0
Geodesic distances for web document clustering Tekir S.
Mansmann F.
Keim D.
IEEE SSCI 2011: Symposium Series on Computational Intelligence - CIDM 2011: 2011 IEEE Symposium on Computational Intelligence and Data Mining English 2011 While traditional distance measures are often capable of properly describing similarity between objects, in some application areas there is still potential to fine-tune these measures with additional information provided in the data sets. In this work we combine such traditional distance measures for document analysis with link information between documents to improve clustering results. In particular, we test the effectiveness of geodesic distances as similarity measures under the space assumption of spherical geometry in a 0-sphere. Our proposed distance measure is thus a combination of the cosine distance of the term-document matrix and some curvature values in the geodesic distance formula. To estimate these curvature values, we calculate clustering coefficient values for every document from the link graph of the data set and increase their distinctiveness by means of a heuristic as these clustering coefficient values are rough estimates of the curvatures. To evaluate our work, we perform clustering tests with the k-means algorithm on the English Wikipedia hyperlinked data set with both traditional cosine distance and our proposed geodesic distance. The effectiveness of our approach is measured by computing micro-precision values of the clusters based on the provided categorical information of each article. 0 0
High-order co-clustering text data on semantics-based representation model Liping Jing
Jiali Yun
Jian Yu
Jiao-Sheng Huang
Lecture Notes in Computer Science English 2011 The language modeling approach is widely used to improve the performance of text mining in recent years because of its solid theoretical foundation and empirical effectiveness. In essence, this approach centers on the issue of estimating an accurate model by choosing appropriate language models as well as smooth techniques. Semantic smoothing, which incorporates semantic and contextual information into the language models, is effective and potentially significant to improve the performance of text mining. In this paper, we proposed a high-order structure to represent text data by incorporating background knowledge, Wikipedia. The proposed structure consists of three types of objects, term, document and concept. Moreover, we firstly combined the high-order co-clustering algorithm with the proposed model to simultaneously cluster documents, terms and concepts. Experimental results on benchmark data sets (20Newsgroups and Reuters-21578) have shown that our proposed high-order co-clustering on high-order structure outperforms the general co-clustering algorithm on bipartite text data, such as document-term, document-concept and document-(term+concept). 0 0
ITEM: Extract and integrate entities from tabular data to RDF knowledge base Guo X.
Yirong Chen
Jilin Chen
Du X.
Lecture Notes in Computer Science English 2011 Many RDF Knowledge Bases are created and enlarged by mining and extracting web data. Hence their data sources are limited to social tagging networks, such as Wikipedia, WordNet, IMDB, etc., and their precision is not guaranteed. In this paper, we propose a new system, ITEM, for extracting and integrating entities from tabular data to RDF knowledge base. ITEM can efficiently compute the schema mapping between a table and a KB, and inject novel entities into the KB. Therefore, ITEM can enlarge and improve RDF KB by employing tabular data, which is assumed of high quality. ITEM detects the schema mapping between table and RDF KB only by tuples, rather than the table's schema information. Experimental results show that our system has high precision and good performance. 0 0
Identifying task-based sessions in search engine query logs Lucchese C.
Orlando S.
Perego R.
Silvestri F.
Tolomei G.
Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 English 2011 The research challenge addressed in this paper is to devise effective techniques for identifying task-based sessions, i.e. sets of possibly non contiguous queries issued by the user of a Web Search Engine for carrying out a given task. In order to evaluate and compare different approaches, we built, by means of a manual labeling process, a ground-truth where the queries of a given query log have been grouped in tasks. Our analysis of this ground-truth shows that users tend to perform more than one task at the same time, since about 75% of the submitted queries involve a multi-tasking activity. We formally define the Task-based Session Discovery Problem (TSDP) as the problem of best approximating the manually annotated tasks, and we propose several variants of well known clustering algorithms, as well as a novel efficient heuristic algorithm, specifically tuned for solving the TSDP. These algorithms also exploit the collaborative knowledge collected by Wiktionary and Wikipedia for detecting query pairs that are not similar from a lexical content point of view, but actually semantically related. The proposed algorithms have been evaluated on the above groundtruth, and are shown to perform better than state-of-the-art approaches, because they effectively take into account the multi-tasking behavior of users. Copyright 2011 ACM. 0 0
Information quality in wikipedia: The effects of group composition and task conflict Ofer Arazy
Oded Nov
Raymond Patterson
Lisa Yeo
Journal of Management Information Systems English 2011 The success of Wikipedia demonstrates that self-organizing production communities can produce high-quality information-based products. Research on Wikipedia has proceeded largely atheoretically, focusing on (1) the diversity in members' knowledge bases as a determinant of Wikipedia's content quality, (2) the task-related conflicts that occur during the collaborative authoring process, and (3) the different roles members play in Wikipedia. We develop a theoretical model that explains how these three factors interact to determine the quality of Wikipedia articles. The results from the empirical study of 96 Wikipedia articles suggest that (1) diversity should be encouraged, as the creative abrasion that is generated when cognitively diverse members engage in task-related conflict leads to higher-quality articles, (2) task conflict should be managed, as conflict-notwithstanding its contribution to creative abrasion-can negatively affect group output, and (3) groups should maintain a balance of both administrative- and content-oriented members, as both contribute to the collaborative process. © 2011 M.E. Sharpe, Inc. 0 2
Knowledge transfer across multilingual corpora via latent topics De Smet W.
Tang J.
Moens M.-F.
Lecture Notes in Computer Science English 2011 This paper explores bridging the content of two different languages via latent topics. Specifically, we propose a unified probabilistic model to simultaneously model latent topics from bilingual corpora that discuss comparable content and use the topics as features in a cross-lingual, dictionary-less text categorization task. Experimental results on multilingual Wikipedia data show that the proposed topic model effectively discovers the topic information from the bilingual corpora, and the learned topics successfully transfer classification knowledge to other languages, for which no labeled training data are available. 0 0
Linking online news and social media Tsagkias M.
Maarten de Rijke
Weerkamp W.
Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 English 2011 Much of what is discussed in social media is inspired by events in the news and, vice versa, social media provide us with a handle on the impact of news events. We address the following linking task: given a news article, find social media utterances that implicitly reference it. We follow a three-step approach: we derive multiple query models from a given source news article, which are then used to retrieve utterances from a target social media index, resulting in multiple ranked lists that we then merge using data fusion techniques. Query models are created by exploiting the structure of the source article and by using explicitly linked social media utterances that discuss the source article. To combat query drift resulting from the large volume of text, either in the source news article itself or in social media utterances explicitly linked to it, we introduce a graph-based method for selecting discriminative terms. For our experimental evaluation, we use data from Twitter, Digg, Delicious, the New York Times Community, Wikipedia, and the blogosphere to generate query models. We show that different query models, based on different data sources, provide complementary information and manage to retrieve different social media utterances from our target index. As a consequence, data fusion methods manage to significantly boost retrieval performance over individual approaches. Our graph-based term selection method is shown to help improve both effectiveness and efficiency. Copyright 2011 ACM. 0 0
ListOPT: Learning to optimize for XML ranking Gao N.
Deng Z.-H.
Yu H.
Jiang J.-J.
Lecture Notes in Computer Science English 2011 Many machine learning classification technologies such as boosting, support vector machine or neural networks have been applied to the ranking problem in information retrieval. However, since the purpose of these learning-to-rank methods is to directly acquire the sorted results based on the features of documents, they are unable to combine and utilize the existing ranking methods proven to be effective such as BM25 and PageRank. To solve this defect, we conducted a study on learning-to-optimize, which is to construct a learning model or method for optimizing the free parameters in ranking functions. This paper proposes a listwise learning-to-optimize process ListOPT and introduces three alternative differentiable query-level loss functions. The experimental results on the XML dataset of Wikipedia English show that these approaches can be successfully applied to tuning the parameters used in an existing highly cited ranking function BM25. Furthermore, we found that the formulas with optimized parameters indeed improve the effectiveness compared with the original ones. 0 0
Mining Fuzzy Domain Ontology Based on Concept Vector from Wikipedia Category Network Cheng-Yu Lu
Shou-Wei Ho
Jen-Ming Chung
Fu-Yuan Hsu
Hahn-Ming Lee
Jan-Ming Ho
WI-IAT English 2011 0 0
Mining fuzzy domain ontology based on concept vector from Wikipedia Category Network Lu C.-Y.
Ho S.-W.
Chung J.-M.
Hsu F.-Y.
Lee H.-M.
Ho J.-M.
Proceedings - 2011 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT 2011 English 2011 Ontology is essential in the formalization of domain knowledge for effective human-computer interactions (i.e., expert-finding). Many researchers have proposed approaches to measure the similarity between concepts by accessing fuzzy domain ontology. However, engineering of the construction of domain ontologies turns out to be labor intensive and tedious. In this paper, we propose an approach to mine domain concepts from Wikipedia Category Network, and to generate the fuzzy relation based on a concept vector extraction method to measure the relatedness between a single term and a concept. Our methodology can conceptualize domain knowledge by mining Wikipedia Category Network. An empirical experiment is conducted to evaluate the robustness by using TREC dataset. Experiment results show the constructed fuzzy domain ontology derived by proposed approach can discover robust fuzzy domain ontology with satisfactory accuracy in information retrieval tasks. 0 0
Ontology enhancement and concept granularity learning: Keeping yourself current and adaptive Jiang S.
Bing L.
Sun B.
YanChun Zhang
Lam W.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining English 2011 As a well-known semantic repository, WordNet is widely used in many applications. However, due to costly edit and maintenance, WordNet's capability of keeping up with the emergence of new concepts is poor compared with on-line encyclopedias such as Wikipedia. To keep WordNet current with folk wisdom, we propose a method to enhance WordNet automatically by merging Wikipedia entities into WordNet, and construct an enriched ontology, named as WorkiNet. WorkiNet keeps the desirable structure of WordNet. At the same time, it captures abundant information from Wikipedia. We also propose a learning approach which is able to generate a tailor-made semantic concept collection for a given document collection. The learning process takes the characteristics of the given document collection into consideration and the semantic concepts in the tailor-made collection can be used as new features for document representation. The experimental results show that the adaptively generated feature space can outperform a static one significantly in text mining tasks, and WorkiNet dominates WordNet most of the time due to its high coverage. Copyright 2011 ACM. 1 0
Ontology-based feature extraction Vicient C.
Sanchez D.
Moreno A.
Proceedings - 2011 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT 2011 English 2011 Knowledge-based data mining and classification algorithms require of systems that are able to extract textual attributes contained in raw text documents, and map them to structured knowledge sources (e.g. ontologies) so that they can be semantically analyzed. The system presented in this paper performs this tasks in an automatic way, relying on a predefined ontology which states the concepts in this the posterior data analysis will be focused. As features, our system focuses on extracting relevant Named Entities from textual resources describing a particular entity. Those are evaluated by means of linguistic and Web-based co-occurrence analyses to map them to ontological concepts, thereby discovering relevant features of the object. The system has been preliminary tested with tourist destinations and Wikipedia textual resources, showing promising results. 0 0
Processing Wikipedia dumps: A case-study comparing the XGrid and mapreduce approaches Thiebaut D.
Yanyan Li
Jaunzeikare D.
Cheng A.
Recto E.R.
Riggs G.
Zhao X.T.
Stolpestad T.
Nguyen C.L.T.
CLOSER 2011 - Proceedings of the 1st International Conference on Cloud Computing and Services Science English 2011 We present a simple comparison of the performance of three different cluster platforms: Apple's XGrid, and Hadoop the open-source version of Google's MapReduce as the total execution time taken by each to parse a 27-GByte XML dump of the English Wikipedia. A local hadoop cluster of Linux workstation, as well as an Elastic MapReduce cluster rented from Amazon are used. We show that for this specific workload, XGrid yields the fastest execution time, with the local Hadoop cluster a close second. The overhead of fetching data from Amazon's Simple Storage System (S3), along with the inability to skip the reduce, sort, and merge phases on Amazon penalizes this platform targeted for much larger data sets. 0 0
Query relaxation for entity-relationship search Elbassuoni S.
Maya Ramanath
Gerhard Weikum
Lecture Notes in Computer Science English 2011 Entity-relationship-structured data is becoming more important on the Web. For example, large knowledge bases have been automatically constructed by information extraction from Wikipedia and other Web sources. Entities and relationships can be represented by subject-property-object triples in the RDF model, and can then be precisely searched by structured query languages like SPARQL. Because of their Boolean-match semantics, such queries often return too few or even no results. To improve recall, it is thus desirable to support users by automatically relaxing or reformulating queries in such a way that the intention of the original user query is preserved while returning a sufficient number of ranked results. In this paper we describe comprehensive methods to relax SPARQL-like triple-pattern queries in a fully automated manner. Our framework produces a set of relaxations by means of statistical language models for structured RDF data and queries. The query processing algorithms merge the results of different relaxations into a unified result list, with ranking based on any ranking function for structured queries over RDF-data. Our experimental evaluation, with two different datasets about movies and books, shows the effectiveness of the automatically generated relaxations and the improved quality of query results based on assessments collected on the Amazon Mechanical Turk platform. 0 0
Semantic relatedness measurement based on Wikipedia link co-occurrence analysis Masahiro Ito
Kotaro Nakayama
Takahiro Hara
Shojiro Nishio
International Journal of Web Information Systems English 2011 Purpose: Recently, the importance and effectiveness of Wikipedia Mining has been shown in several researches. One popular research area on Wikipedia Mining focuses on semantic relatedness measurement, and research in this area has shown that Wikipedia can be used for semantic relatedness measurement. However, previous methods are facing two problems; accuracy and scalability. To solve these problems, the purpose of this paper is to propose an efficient semantic relatedness measurement method that leverages global statistical information of Wikipedia. Furthermore, a new test collection is constructed based on Wikipedia concepts for evaluating semantic relatedness measurement methods. Design/methodology/approach: The authors' approach leverages global statistical information of the whole Wikipedia to compute semantic relatedness among concepts (disambiguated terms) by analyzing co-occurrences of link pairs in all Wikipedia articles. In Wikipedia, an article represents a concept and a link to another article represents a semantic relation between these two concepts. Thus, the co-occurrence of a link pair indicates the relatedness of a concept pair. Furthermore, the authors propose an integration method with tfidf as an improved method to additionally leverage local information in an article. Besides, for constructing a new test collection, the authors select a large number of concepts from Wikipedia. The relatedness of these concepts is judged by human test subjects. Findings: An experiment was conducted for evaluating calculation cost and accuracy of each method. The experimental results show that the calculation cost ofthis approachisvery low compared toone of the previous methods and more accurate than all previous methods for computing semantic relatedness. Originality/value: This is the first proposal of co-occurrence analysis of Wikipedia links for semantic relatedness measurement. The authors show that this approach is effective to measure semantic relatedness among concepts regarding calculation cost and accuracy. The findings may be useful to researchers who are interested in knowledge extraction, as well as ontology researches. 0 0
Sentiment analysis of news titles: The role of entities and a new affective lexicon Loureiro D.
Marreiros G.
Neves J.
Lecture Notes in Computer Science English 2011 The growth of content on the web has been followed by increasing interest in opinion mining. This field of research relies on accurate recognition of emotion from textual data. There's been much research in sentiment analysis lately, but it always focuses on the same elements. Sentiment analysis traditionally depends on linguistic corpora, or common sense knowledge bases, to provide extra dimensions of information to the text being analyzed. Previous research hasn't yet explored a fully automatic method to evaluate how events associated to certain entities may impact each individual's sentiment perception. This project presents a method to assign valence ratings to entities, using information from their Wikipedia page, and considering user preferences gathered from the user's Facebook profile. Furthermore, a new affective lexicon is compiled entirely from existing corpora, without any intervention from the coders. 0 0
Supporting resource-based learning on the web using automatically extracted large-scale taxonomies from multiple wikipedia versions Garcia R.D.
Scholl P.
Rensing C.
Lecture Notes in Computer Science English 2011 CROKODIL is a platform for the support of collaborative resource-based learning with Web resources. It enables the building of learning communities in which learners annotate their relevant resources using tags. In this paper, we propose the use of automatically generated large-scale taxonomies in different languages to cope with two challenges in CROKODIL: The multilingualism of the resources, i.e. web resources are in different languages and the connectivity of the semantic network, i.e. learners do not tag resources on the same topic with identical tags. More specifically, we describe a set of features that can be used for detecting hyponymy relations from the category system of Wikipedia. 0 0
Understanding temporal query dynamics Kulkarni A.
Teevan J.
Svore K.M.
Dumais S.T.
Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 English 2011 Web search is strongly influenced by time. The queries people issue change over time, with some queries occasionally spiking in popularity (e.g., earthquake) and others remaining relatively constant (e.g., youtube). likewise, the documents indexed by a search engine change, with some documents always being about a particular query (e.g., the Wikipedia page on earthquakes is about the query earthquake) and others being about the query only at a particular point in time (e.g., the New York Times is only about earthquakes following a major seismic activity). The relationship between documents and queries can also change as people's intent changes (e.g., people sought different content for the query earthquake before the Haitian earthquake than they did after). In this paper, we explore how queries, their associated documents, and the query intent change over the course of 10 weeks by analyzing query log data, a daily Web crawl, and periodic human relevance judgments. We identify several interesting features by which changes to query popularity can be classified, and show that presence of these features, when accompanied by changes in result content, can be a good indicator of change in query intent. Copyright 2011 ACM. 0 0
Unsupervised feature weighting based on local feature relatedness Jiali Yun
Liping Jing
Jian Yu
Houkuan Huang
Lecture Notes in Computer Science English 2011 Feature weighting plays an important role in text clustering. Traditional feature weighting is determined by the syntactic relationship between feature and document (e.g. TF-IDF). In this paper, a semantically enriched feature weighting approach is proposed by introducing the semantic relationship between feature and document, which is implemented by taking account of the local feature relatedness - the relatedness between feature and its contextual features within each individual document. Feature relatedness is measured by two methods, document collection-based implicit relatedness measure and Wikipedia link-based explicit relatedness measure. Experimental results on benchmark data sets show that the new feature weighting approach surpasses traditional syntactic feature weighting. Moreover, clustering quality can be further improved by linearly combining the syntactic and semantic factors. The new feature weighting approach is also compared with two existing feature relatedness-based approaches which consider the global feature relatedness (feature relatedness in the entire feature space) and the inter-document feature relatedness (feature relatedness between different documents) respectively. In the experiments, the new feature weighting approach outperforms these two related work in clustering quality and costs much less computational complexity. 0 0
View of the world according to Wikipedia: Are we all little Steinbergs? Overell S.E.
Stefan Ruger
Journal of Computational Science English 2011 Saul Steinberg's most famous cartoon " View of the world from 9th Avenue" depicts the world as seen by self-absorbed New Yorkers. By analysing wikipediae of a range of different languages, we find that this particular fish-eye world view is ubiquitous and inherently part of human nature.By measuring the skew in the distribution of locations in different languages we can confirm the validity of plausible quantitative models. These models demonstrate convincingly that people all have similar world views: " We are all little Steinbergs." Our Steinberg hypothesis allows the world view of specific people to be more accurately modelled; this will allow greater understanding of a person's discourse, either by someone else or automatically by a computer. © 2011 Elsevier B.V. 0 0
Voting Behavior Analysis in the Election of Wikipedia Admins Gerard Cabunducan
Ralph Castillo
John Boaz Lee
ASONAM English 2011 0 1
Wikipedia Sets: Context-Oriented Related Entity Acquisition from Multiple Words Masumi Shirakawa
Kotaro Nakayama
Takahiro Hara
Shojiro Nishio
WI-IAT English 2011 0 0
Wikipedia sets: Context-oriented related entity acquisition from multiple words Masumi Shirakawa
Kotaro Nakayama
Takahiro Hara
Shojiro Nishio
Proceedings - 2011 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2011 English 2011 In this paper, we propose a method which acquires related words (entities) from multiple words by naturally disambiguating their meaning and considering their contexts. In addition, we introduce a bootstrapping method for improving the coverage of association relations. Experimental result shows that our method can acquire related words depending on the contexts of multiple words compared to the ESA-based method. 0 0
A content-based image retrieval system based on unsupervised topological learning Rogovschi N.
Grozavu N.
Proc. - 6th Intl. Conference on Advanced Information Management and Service, IMS2010, with ICMIA2010 - 2nd International Conference on Data Mining and Intelligent Information Technology Applications English 2010 Internet offers to its users an ever-increasing number of information. Among those, the multimodal data (images, text, video, sound) are widely requested by users, and there is a strong need for effective ways to process and to manage it, respectively. Most of existed algorithms/frameworks are doing only images annotations and the search is doing by this annotations, or combined with some clustering results, but most of them do not allow a quick browsing of these images. Even if the search is very quickly, but if the number of images is very large, the system must give the possibility to the user to browse this data. In this paper, an image retrieval system is presented, including detailed descriptions of used lwo-SOM (local weighting observations Self-Organizing Map) approach and a new interactive learning process using user information/response. Also, we show the use of unsupervised learning on an images dataset, we do not dispose of the labels, and we will not take into account the corresponding text for the images. The used real dataset contains 17812 images extracted from wikipedia pages, each of which is characterized by its color and texture. 0 0
Adapting recommender systems to the requirements of personal health record systems Wiesner M.
Pfeifer D.
IHI'10 - Proceedings of the 1st ACM International Health Informatics Symposium English 2010 In the future many people in industrialized countries will manage their personal health data electronically in centralized, reliable and trusted repositories - so-called personal health record systems (PHR). At this stage PHR systems still fail to satisfy the individual medical information needs of their users. Personalized recommendations could solve this problem. A first approach of integrating recommender system (RS) methodology into personal health records - termed health recommender system (HRS) - is presented. By exploitation of existing semantic networks like Wikipedia a health graph data structure is obtained. The data kept within such a graph represent health related concepts and are used to compute semantic distances among pairs of such concepts. A ranking procedure based on the health graph is outlined which enables a match between entries of a PHR system and health information artifacts. This way a PHR user will obtain individualized health information he might be interested in. 0 0
Adaptive ranking of search results by considering user's comprehension Makoto Nakatani
Adam Jatowt
Katsumi Tanaka
Proceedings of the 4th International Conference on Ubiquitous Information Management and Communication ICUIMC 10 English 2010 Given a search query, conventional Web search engines provide users with the same ranking although users' comprehension levels can be different. It is often difficult especially for non-expert users to find comprehensible Web pages from the list of search results. In this paper, we propose the method of adaptively ranking search results by considering user's comprehension level. The main issues are (a) estimating the comprehensibility of Web pages and (b) estimating the user's comprehension level. In our method, the com-prehensibility of each search result is computed by using the readability index and technical terms extracted from Wikipedia. User's comprehension level is estimated by the users' feedback about the difficulty of search results that they have viewed. We implement a prototype system and evaluate the usefulness of our approach by user experiments. 0 0
Analysis of implicit relations on wikipedia: Measuring strength through mining elucidatory objects Xiaodan Zhang
Yasuhito Asano
Masatoshi Yoshikawa
Lecture Notes in Computer Science English 2010 We focus on measuring relations between pairs of objects in Wikipedia whose pages can be regarded as individual objects. Two kinds of relations between two objects exist: in Wikipedia, an explicit relation is represented by a single link between the two pages for the objects, and an implicit relation is represented by a link structure containing the two pages. Previously proposed methods are inadequate for measuring implicit relations because they use only one or two of the following three important factors: distance, connectivity, and co-citation. We propose a new method reflecting all the three factors by using a generalized maximum flow. We confirm that our method can measure the strength of a relation more appropriately than these previously proposed methods do. Another remarkable aspect of our method is mining elucidatory objects, that is, objects constituting a relation. We explain that mining elucidatory objects opens a novel way to deeply understand a relation. 0 0
Analysis of implicit relations on wikipedia: measuring strength through mining elucidatory objects Xinpeng Zhang
Yasuhito Asano
Masatoshi Yoshikawa
DASFAA English 2010 0 0
Automatically suggesting topics for augmenting text documents Robert West
Doina Precup
Joelle Pineau
International Conference on Information and Knowledge Management, Proceedings English 2010 We present a method for automated topic suggestion. Given a plain-text input document, our algorithm produces a ranking of novel topics that could enrich the input document in a meaningful way. It can thus be used to assist human authors, who often fail to identify important topics relevant to the context of the documents they are writing. Our approach marries two algorithms originally designed for linking documents to Wikipedia articles, proposed by Milne and Witten [15] and West et al. [22], While neither of them can suggest novel topics by itself, their combination does have this capability. The key step towards finding missing topics consists in generalizing from a large background corpus using principal component analysis. In a quantitative evaluation we conclude that our method achieves the precision of human editors when input documents are Wikipedia articles, and we complement this result with a qualitative analysis showing that the approach also works well on other types of input documents. 0 0
Computational Methods for Historical Research on Wikipedia's Archives Jonathan Cohen E-Research: A Journal of Undergraduate Work English 2010 This paper presents a novel study of geographic information implicit in the English Wikipedia archive. This project demonstrates a method to extract data from the archive with data mining, map the global distribution of Wikipedia editors through geocoding in GIS, and proceed with a spatial analysis of Wikipedia use in metropolitan cities. 0 0
Creating a Wikipedia-based Persian-English word association dictionary Rahimi Z.
Shakery A.
2010 5th International Symposium on Telecommunications, IST 2010 English 2010 One of the most important issues in cross language information retrieval is how to cross the language barrier between the query and the documents. Different translation resources have been studied for this purpose. In this research, we study using Wikipedia for query translation by constructing a Wikipedia-based bilingual association dictionary. We use English and Persian Wikipedia inter-language links to align related titles and then mine word by word associations between the two languages using the extracted alignments. We use the mined word association dictionary for translating queries in Persian-English cross language information retrieval. Our experimental results on Hamshari corpus show that the proposed method is effective in extracting word associations and that Persian Wikipedia is a promising translation resource. Using the association dictionary, we can improve the pure dictionary-based method, where the only translation resource is a bilingual dictionary, by 33.6% and its recall by 26.2%. 0 0
Detecting task-based query sessions using collaborative knowledge Lucchese C.
Orlando S.
Perego R.
Silvestri F.
Tolomei G.
Proceedings - 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT 2010 English 2010 Our research challenge is to provide a mechanism for splitting into user task-based sessions a long-term log of queries submitted to a Web Search Engine (WSE). The hypothesis is that some query sessions entail the concept of user task. We present an approach that relies on a centroid-based and a density-based clustering algorithm, which consider queries inter-arrival times and use a novel distance function that takes care of query lexical content and exploits the collaborative knowledge collected by Wiktionary and Wikipedia. 0 0
Efficient Wikipedia-based semantic interpreter by exploiting top-k processing Kim J.W.
Ashwin Kashyap
Deyi Li
Sandilya Bhamidipati
International Conference on Information and Knowledge Management, Proceedings English 2010 Proper representation of the meaning of texts is crucial to enhancing many data mining and information retrieval tasks, including clustering, computing semantic relatedness between texts, and searching. Representing of texts in the concept-space derived from Wikipedia has received growing attention recently, due to its comprehensiveness and expertise. This concept-based representation is capable of extracting semantic relatedness between texts that cannot be deduced with the bag of words model. A key obstacle, however, for using Wikipedia as a semantic interpreter is that the sheer size of the concepts derived from Wikipedia makes it hard to efficiently map texts into concept-space. In this paper, we develop an efficient algorithm which is able to represent the meaning of a text by using the concepts that best match it. In particular, our approach first computes the approximate top-k concepts that are most relevant to the given text. We then leverage these concepts for representing the meaning of the given text. The experimental results show that the proposed technique provides significant gains in execution time over current solutions to the problem. 0 0
Enhancing Short Text Clustering with Small External Repositories Petersen H.
Poon J.
Conferences in Research and Practice in Information Technology Series English 2010 The automatic clustering of textual data according to their semantic concepts is a challenging, yet important task. Choosing an appropriate method to apply when clustering text depends on the nature of the documents being analysed. For example, traditional clustering algorithms can struggle to correctly model collections of very short text due to their extremely sparse nature. In recent times, much attention has been directed to finding methods for adequately clustering short text. Many popular approaches employ large, external document repositories, such as Wikipedia or the Open Directory Project, to incorporate additional world knowledge into the clustering process. However the sheer size of many of these external collections can make these techniques difficult or time consuming to apply. This paper also employs external document collections to aid short text clustering performance. The external collections are referred to in this paper as Background Knowledge. In contrast to most previous literature a separate collection of Background Knowledge is obtained for each short text dataset. However, this Background Knowledge contains several orders of magnitude fewer documents than commonly used repositories like Wikipedia. A simple approach is described where the Background Knowledge is used to re-express short text in terms of a much richer feature space. A discussion of how best to cluster documents in this feature space is presented. A solution is proposed, and an experimental evaluation is performed that demonstrates significant improvement over clustering based on standard metrics with several publicly available datasets represented in the richer feature space. 0 0
Enishi: Searching knowledge about relations by complementarily utilizing wikipedia and the web Xiaodan Zhang
Yasuhito Asano
Masatoshi Yoshikawa
Lecture Notes in Computer Science English 2010 How global warming and agriculture mutually influence each other? It is possible to answer the question by searching knowledge about the relation between global warming and agriculture. As exemplified by this question, strong demands exist for searching relations between objects. However, methods or systems for searching relations are not well studied. In this paper, we propose a relation search system named "Enishi." Enishi supplies a wealth of diverse multimedia information for deep understanding of relations between two objects by complementarily utilizing knowledge from Wikipedia and the Web. Enishi first mines elucidatory objects constituting relations between two objects from Wikipedia. We then propose new approaches for Enishi to search more multimedia information about relations on the Web using elucidatory objects. Finally, we confirm through experiments that our new methods can search useful information from the Web for deep understanding of relations. 0 0
Enishi: searching knowledge about relations by complementarily utilizing wikipedia and the web Xinpeng Zhang
Yasuhito Asano
Masatoshi Yoshikawa
WISE English 2010 0 0
Incorporating multi-partite networks and expertise to construct related-term graphs Shieh J.-R.
Lin C.-Y.
Wang S.-X.
Hsieh Y.-H.
Wu J.-L.
Proceedings - IEEE International Conference on Data Mining, ICDM English 2010 Term suggestion techniques recommend query terms to a user based on his initial query. Providing adequate term suggestions is a challenging task. Most existing commercial search engines suggest search terms based on the frequency of prior used terms that match the first few letters typed by the user. We present a novel mechanism to construct semantic term-relation graphs to suggest semantically relevant search terms. We build term relation graphs based on multi-partite networks of existing social media. These linkage networks are extracted from Wikipedia to eventually form term relation graphs. We propose incorporating contributor-category networks to model the contributor expertise. This step has been shown to significantly enhance the accuracy of the inferred relatedness of the term-semantic graphs. Experiments showed the obvious advantage of our algorithms over existing approaches 0 0
Mining Wikipedia and Yahoo! Answers for question expansion in Opinion QA Yajie Miao
Chenliang Li
Lecture Notes in Computer Science English 2010 Opinion Question Answering (Opinion QA) is still a relatively new area in QA research. The achieved methods focus on combining sentiment analysis with the traditional Question Answering methods. Few attempts have been made to expand opinion questions with external background information. In this paper, we introduce the broad-mining and deep-mining strategies. Based on these two strategies, we propose four methods to exploit Wikipedia and Yahoo! Answers for enriching representation of questions in Opinion QA. The experimental results show that the proposed expansion methods perform effectively for improving existing Opinion QA models. 0 0
Mining and explaining relationships in Wikipedia Xiaodan Zhang
Yasuhito Asano
Masatoshi Yoshikawa
Lecture Notes in Computer Science English 2010 Mining and explaining relationships between objects are challenging tasks in the field of knowledge search. We propose a new approach for the tasks using disjoint paths formed by links in Wikipedia. To realizing this approach, we propose a naive and a generalized flow based method, and a technique of avoiding flow confluences for forcing a generalized flow to be disjoint as possible. We also apply the approach to classification of relationships. Our experiments reveal that the generalized flow based method can mine many disjoint paths important for a relationship, and the classification is effective for explaining relationships. 0 0
Mining and explaining relationships in wikipedia Xinpeng Zhang
Yasuhito Asano
Masatoshi Yoshikawa
DEXA English 2010 0 0
Mining the factors affecting the quality of Wikipedia articles Wu K.
Qinghua Zhu
Yang Zhao
Hua Zheng
Proceedings - 2010 International Conference of Information Science and Management Engineering, ISME 2010 English 2010 In order to observe the variation of factors affecting the quality of Wikipedia articles during the information quality improvement process, we proposed 28 metrics from four aspects, including lingual, structural, historical and reputational features, and then weighted each metrics in different stages by using neural network. We found lingual features weighted more in the lower quality stages, and structural features, along with historical features, became more important while article quality improved. However, reputational features did not act as important as expected. The findings indicate that the information quality is mainly affected by completeness, and well-written is a basic requirement in the initial stage. Reputation of authors or editors is not so important in Wikipedia because of its horizontal structure. 0 0
Not so creepy crawler: Easy crawler generation with standard XML queries Von Dem Bussche F.
Weiand K.
Linse B.
Furche T.
Bry F.
Proceedings of the 19th International Conference on World Wide Web, WWW '10 English 2010 Web crawlers are increasingly used for focused tasks such as the extraction of data from Wikipedia or the analysis of social networks like In these cases, pages are far more uniformly structured than in the general Web and thus crawlers can use the structure of Web pages for more precise data extraction and more expressive analysis. In this demonstration, we present a focused, structure-based crawler generator, the "Not so Creepy Crawler" (nc2 ). What sets nc2 apart, is that all analysis and decision tasks of the crawling process are delegated to an (arbitrary) XML query engine such as XQuery or Xcerpt. Customizing crawlers just means writing (declarative) XML queries that can access the currently crawled document as well as the metadata of the crawl process. We identify four types of queries that together sufice to realize a wide variety of focused crawlers. We demonstrate nc2 with two applications: The first extracts data about cities from Wikipedia with a customizable set of attributes for selecting and reporting these cities. It illustrates the power of nc2 where data extraction from Wiki-style, fairly homogeneous knowledge sites is required. In contrast, the second use case demonstrates how easy nc2 makes even complex analysis tasks on social networking sites, here exemplified by 0 0
Similarity search and locality sensitive hashing using ternary content addressable memories Shinde R.
Goel A.
Gupta P.
Dutta D.
Proceedings of the ACM SIGMOD International Conference on Management of Data English 2010 Similarity search methods are widely used as kernels in various data mining and machine learning applications including those in computational biology, web search/clustering. Nearest neighbor search (NNS) algorithms are often used to retrieve similar entries, given a query. While there exist efficient techniques for exact query lookup using hashing, similarity search using exact nearest neighbors suffers from a "curse of dimensionality", i.e. for high dimensional spaces, best known solutions offer little improvement over brute force search and thus are unsuitable for large scale streaming applications. Fast solutions to the approximate NNS problem include Locality Sensitive Hashing (LSH) based techniques, which need storage polynomial in n with exponent greater than 1, and query time sublinear, but still polynomial in n, where n is the size of the database. In this work we present a new technique of solving the approximate NNS problem in Euclidean space using a Ternary Content Addressable Memory (TCAM), which needs near linear space and has O(1) query time. In fact, this method also works around the best known lower bounds in the cell probe model for the query time using a data structure near linear in the size of the data base. TCAMs are high performance associative memories widely used in networking applications such as address lookups and access control lists. A TCAM can query for a bit vector within a database of ternary vectors, where every bit position represents 0, 1 or*. The*is a wild card representing either a 0 or a 1. We leverage TCAMs to design a variant of LSH, called Ternary Locality Sensitive Hashing (TLSH) wherein we hash database entries represented by vectors in the Euclidean space into {0,1,*}. By using the added functionality of a TLSH scheme with respect to the*character, we solve an instance of the approximate nearest neighbor problem with 1 TCAM access and storage nearly linear in the size of the database. We validate our claims with extensive simulations using both real world (Wikipedia) as well as synthetic (but illustrative) datasets. We observe that using a TCAM of width 288 bits, it is possible to solve the approximate NNS problem on a database of size 1 million points with high accuracy. Finally, we design an experiment with TCAMs within an enterprise ethernet switch (Cisco Catalyst 4500) to validate that TLSH can be used to perform 1.5 million queries per second per 1Gb/s port. We believe that this work can open new avenues in very high speed data mining. 0 0
Text clustering via term semantic units Liping Jing
Jiali Yun
Jian Yu
Houkuan Huang
Proceedings - 2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010 English 2010 How best to represent text data is an important problem in text mining tasks including information retrieval, clustering, classification and etc. In this paper, we proposed a compact document representation with term semantic units which are identified from the implicit and explicit semantic information. Among it, the implicit semantic information is extracted from syntactic content via statistical methods such as latent semantic indexing and information bottleneck. The explicit semantic information is mined from the external semantic resource (Wikipedia). The proposed compact representation model can map a document collection in a low-dimension space (term semantic units which are much less than the number of all unique terms). Experimental results on real data sets have shown that the compact representation efficiently improve the performance of text clustering. 0 0
The study of individuation products customization systems based on Wiki Dameng D.
Wu C.
3rd International Conference on Knowledge Discovery and Data Mining, WKDD 2010 English 2010 With the globalization of economy, the competition of marked is more and more intense, the individuation demand of customers has improved and changed. To meet more customers' requirements in production customization, the individuation production customization design is the important component of customization marketing and received many researchers' interesting. In this paper the individuation production customization design system is migrated to the open wiki which satisfied the customers' individuation requirements have great significance. 0 0
Towards community discovery in signed collaborative interaction networks Bogdanov P.
Larusso N.D.
Amit Singh
Proceedings - IEEE International Conference on Data Mining, ICDM English 2010 We propose a framework for discovery of collaborative community structure in Wiki-based knowledge repositories based on raw-content generation analysis. We leverage topic modelling in order to capture agreement and opposition of contributors and analyze these multi-modal relations to map communities in the contributor base. The key steps of our approach include (i) modeling of pair wise variable-strength contributor interactions that can be both positive and negative, (ii) synthesis of a global network incorporating all pair wise interactions, and (iii) detection and analysis of community structure encoded in such networks. The global community discovery algorithm we propose outperforms existing alternatives in identifying coherent clusters according to objective optimality criteria. Analysis of the discovered community structure reveals coalitions of common interest editors who back each other in promoting some topics and collectively oppose other coalitions or single authors. We couple contributor interactions with content evolution and reveal the global picture of opposing themes within the self-regulated community base for both controversial and featured articles in Wikipedia. 0 0
Using encyclopaedic knowledge for query classification Richard Khoury Proceedings of the 2010 International Conference on Artificial Intelligence, ICAI 2010 English 2010 Identifying the intended topic that underlies a user's queiy can benefit a large range of applications, from search engines to question-answering systems. However, query classification remains a difficult challenge due to the variety of queries a user can ask, the wide range of topics users can ask about, and the limited amount of information that can be mined from the queiy. In this paper, we develop a new query classification system that accounts for these three challenges. Our system relies on encyclopaedic knowledge to understand the user's queiy and fill in the gaps of missing information. Specifically, we use the freely-available online encyclopaedia Wikipedia as a natural-language knowledge base, and exploit Wikipedia's structure to infer the correct classification of any user queiy. 0 0
Wikipedia2Onto - building concept ontology automatically, experimenting with web image retrieval Haofen Wang
Xing Jiang
Chia L.-T.
Tan A.-H.
Informatica (Ljubljana) English 2010 Given its effectiveness to better understand data, ontology has been used in various domains including cartificial intelligence, biomedical informatics and library science. What we have tried to promote is the use of ontology to better understand media (in particular, images) on the World Wide Web. This paper describes our preliminary attempt to construct a large-scale multi-modality ontology, called AutoMMOnto, for web image classification. Particularly, to enable the automation of text ontology construction, we take advantage of both structural and content features of Wikipedia and formalize real world objects in terms of concepts and relationships. For visual part, we train classifiers according to both global and local features, and generate middle-level concepts from the training images. A variant of the association rule mining algorithm is further developed to refine the built ontology. Our experimental results show that our method allows automatic construction of large-scale multi-modality ontology with high accuracy from challenging web image data set. 0 0
A web recommender system based on dynamic sampling of user information access behaviors Jilin Chen
Shtykh R.Y.
Jin Q.
Proceedings - IEEE 9th International Conference on Computer and Information Technology, CIT 2009 English 2009 In this study, we propose a Gradual Adaption Model for a Web recommender system. This model is used to track users' focus of interests and its transition by analyzing their information access behaviors, and recommend appropriate information. A set of concept classes are extracted from Wikipedia. The pages accessed by users are classified by the concept classes, and grouped into three terms of short, medium and long periods, and two categories of remarkable and exceptional for each concept class, which are used to describe users' focus of interests, and to establish reuse probability of each concept class in each term for each user by Full Bayesian Estimation as well. According to the reuse probability and period, the information that a user is likely to be interested in is recommended. In this paper, we propose a new approach by which short and medium periods are determined based on dynamic sampling of user information access behaviors. We further present experimental simulation results, and show the validity and effectiveness of the proposed system. 0 0
An empirical study on criteria for assessing information quality in corporate wikis Friberg T.
Reinhardt W.
Proceedings of the 2009 International Conference on Information Quality, ICIQ 2009 English 2009 Wikis gain more and more attention as tool for corporate knowledge management. The usage of corporate wikis differs from public wikis like the Wikipedia as there are hardly any wiki wars or copyright issues. Nevertheless the quality of the available articles is of high importance in corporate wikis as well as in public ones. This paper presents the results from an empirical study on criteria for assessing information quality of articles in corporate wikis. Therefore existing approaches for assessing information quality are evaluated and a specific wikiset of criteria is defined. This wiki-set was examined in a study with participants from 21 different German companies using wikis as essential part of their knowledge management toolbox. Furthermore this paper discusses various ways for the automatic and manual rating of information quality and the technical implementation of such an IQ-profile for wikis. 0 0
Automatic link detection: A sequence labeling approach Gardner J.J.
Xiong L.
International Conference on Information and Knowledge Management, Proceedings English 2009 The popularity of Wikipedia and other online knowledge bases has recently produced an interest in the machine learning community for the problem of automatic linking. Automatic hyperlinking can be viewed as two sub problems - link detection which determines the source of a link, and link disambiguation which determines the destination of a link. Wikipedia is a rich corpus with hyperlink data provided by authors. It is possible to use this data to train classifiers to be able to mimic the authors in some capacity. In this paper, we introduce automatic link detection as a sequence labeling problem. Conditional random fields (CRFs) are a probabilistic framework for labeling sequential data. We show that training a CRF with different types of features from the Wikipedia dataset can be used to automatically detect links with almost perfect precision and high recall. Copyright 2009 ACM. 0 0
Automatic multilingual lexicon generation using wikipedia as a resource Shahid A.R.
Kazakov D.
ICAART 2009 - Proceedings of the 1st International Conference on Agents and Artificial Intelligence English 2009 This paper proposes a method for creating a multilingual dictionary by taking the titles of Wikipedia pages in English and then finding the titles of the corresponding articles in other languages. The creation of such multilingual dictionaries has become possible as a result of exponential increase in the size of multilingual information on the web. Wikipedia is a prime example of such multilingual source of information on any conceivable topic in the world, which is edited by the readers. Here, a web crawler has been used to traverse Wikipedia following the links on a given page. The crawler takes out the title along with the titles of the corresponding pages in other targeted languages. The result is a set of words and phrases that are translations of each other. For efficiency, the URLs are organized using hash tables. A lexicon has been constructed which contains 7-tuples corresponding to 7 different languages, namely: English, German, French, Polish, Bulgarian, Greek and Chinese. 0 0
Completing Wikipedia's hyperlink structure through dimensionality reduction Robert West
Doina Precup
Joelle Pineau
International Conference on Information and Knowledge Management, Proceedings English 2009 Wikipedia is the largest monolithic repository of human knowledge. In addition to its sheer size, it represents a new encyclopedic paradigm by interconnecting articles through hyperlinks. However, since these links are created by human authors, links one would expect to see are often missing. The goal of this work is to detect such gaps automatically. In this paper, we propose a novel method for augmenting the structure of hyperlinked document collections such as Wikipedia. It does not require the extraction of any manually defined features from the article to be augmented. Instead, it is based on principal component analysis, a well-founded mathematical generalization technique, and predicts new links purely based on the statistical structure of the graph formed by the existing links. Our method does not rely on the textual content of articles; we are exploiting only hyperlinks. A user evaluation of our technique shows that it improves the quality of top link suggestions over the state of the art and that the best predicted links are significantly more valuable than the 'average' link already present in Wikipedia. Beyond link prediction, our algorithm can potentially be used to point out topics an article misses to cover and to cluster articles semantically. Copyright 2009 ACM. 0 0
Easiest-first search: Towards comprehension-based web search Makoto Nakatani
Adam Jatowt
Katsumi Tanaka
International Conference on Information and Knowledge Management, Proceedings English 2009 Although Web search engines have become information gateways to the Internet, for queries containing technical terms, search results often contain pages that are difficult to be understood by non-expert users. Therefore, re-ranking search results in a descending order of their comprehensibility should be effective for non-expert users. In our approach, the comprehensibility of Web pages is estimated considering both the document readability and the difficulty of technical terms in the domain of search queries. To extract technical terms, we exploit the domain knowledge extracted from Wikipedia. Our proposed method can be applied to general Web search engines as Wikipedia includes nearly every field of human knowledge. We demonstrate the usefulness of our approach by user experiments. Copyright 2009 ACM. 0 0
Entity extraction via ensemble semantics Pennacchiotti M.
Pantel P.
EMNLP 2009 - Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009 English 2009 Combining information extraction systems yields significantly higher quality resources than each system in isolation. In this paper, we generalize such a mixing of sources and features in a framework called Ensemble Semantics. We show very large gains in entity extraction by combining state-of-the-art distributional and patternbased systems with a large set of features from a webcrawl, query logs, and Wikipedia. Experimental results on a webscale extraction of actors, athletes and musicians show significantly higher mean average precision scores (29% gain) compared with the current state of the art. 0 0
Exploiting Wikipedia as a knowledge base: Towards and ontology of movies Alarcon R.
Sanchez O.
Mijangos V.
CEUR Workshop Proceedings English 2009 Wikipedia is a huge knowledge base growing every day due to the contribution of people all around the world. Some part of the information of each article is kept in a special, consistently and formatted table called infobox. In this article, we analyze the Wikipedia infoboxes of movies articles; we describe some of the problems that can make extracting information from these tables a difficult task. We also present a methodology to automatically extract information that could be useful towards the building of an ontology of movies from Wikipedia in Spanish. 0 0
Exploring wikipedia and DMoz as knowledge bases for engineering a user interests hierarchy for social network applications Mandar Haridas
Doina Caragea
Lecture Notes in Computer Science English 2009 The outgrowth of social networks in the recent years has resulted in opportunities for interesting data mining problems, such as interest or friendship recommendations. A global ontology over the interests specified by the users of a social network is essential for accurate recommendations. We propose, evaluate and compare three approaches to engineering a hierarchical ontology over user interests. The proposed approaches make use of two popular knowledge bases, Wikipedia and Directory Mozilla, to extract interest definitions and/or relationships between interests. More precisely, the first approach uses Wikipedia to find interest definitions, the latent semantic analysis technique to measure the similarity between interests based on their definitions, and an agglomerative clustering algorithm to group similar interests into higher level concepts. The second approach uses the Wikipedia Category Graph to extract relationships between interests, while the third approach uses Directory Mozilla to extract relationships between interests. Our results show that the third approach, although the simplest, is the most effective for building a hierarchy over user interests. 0 0
Improving the extraction of bilingual terminology from Wikipedia Maike Erdmann
Kotaro Nakayama
Takahiro Hara
Shojiro Nishio
ACM Trans. Multimedia Comput. Commun. Appl. English 2009 Research on the automatic construction of bilingual dictionaries has achieved impressive results. Bilingual dictionaries are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well. In this article, we want to further pursue the idea of using Wikipedia as a corpus for bilingual terminology extraction. We propose a method that extracts term-translation pairs from different types of Wikipedia link information. After that, an {SVM} classifier trained on the features of manually labeled training data determines the correctness of unseen term-translation pairs. 2009 {ACM. 0 0
Measuring Wikipedia: A hands-on tutorial Luca de Alfaro
Felipe Ortega
WikiSym English 2009 This tutorial is an introduction to the best methodologies, tools and practices for Wikipedia research. The tutorial will be led by Luca de Alfaro (Wiki Lab at UCSC, California, USA) and Felipe Ortega (Libresoft, URJC, Madrid, Spain). Both cumulate several years of practical experience exploring and processing Wikipedia data [1], [2], [3]. As well, their respective research groups have led the development of two cutting-edge software tools (WikiTrust and WikiXRay), for analyzing Wikipedia. WikiTrust implements an author reputation system, and a text trust system, for wikis. WikiXRay is a tool automating the quantitative analysis of any language version of Wikipedia (in general, any wiki based on MediaWiki). Copyright 0 0
Measuring Wikipedia: a hands-on tutorial Luca de Alfaro
Felipe Ortega
WikiSym English 2009 0 0
Mining meaning from Wikipedia Olena Medelyan
David N. Milne
Catherine Legg
Ian H. Witten
Int. J. Hum.-Comput. Stud.
International Journal of Human Computer Studies
English 2009 Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced. 2009 Elsevier Ltd. All rights reserved. 0 4
Named entity resolution using automatically extracted semantic information Pilz A.
Paass G.
LWA 2009 - Workshop-Woche: Lernen-Wissen-Adaptivitat - Learning, Knowledge, and Adaptivity English 2009 One major problem in text mining and semantic retrieval is that detected entity mentions have to be assigned to the true underlying entity. The ambiguity of a name results from both the polysemy and synonymy problem, as the name of a unique entity may be written in variant ways and different unique entities may have the same name. The term "bush" for instance may refer to a woody plant, a mechanical fixing, a nocturnal primate, 52 persons and 8 places covered in Wikipedia and thousands of other persons. For the first time, according to our knowledge we apply a kernel entity resolution approach to the German Wikipedia as reference for named entities. We describe the context of named entities in Wikipedia and the context of a detected name phrase in a new document by a context vector of relevant features. These are designed from automatically extracted topic indicators generated by an LDA topic model. We use kernel classifiers, e.g. rank classifiers, to determine the right matching entity but also to detect uncovered entities. In comparison to a baseline approach using only text similarity the addition of topics approach gives a much higher f-value, which is comparable to the results published for English. It turns out that the procedure also is able to detect with high reliability if a person is not covered by the Wikipedia. 0 0
QuWi: Quality control in Wikipedia Alberto Cusinato
Vincenzo Della Mea
Francesco Di Salvatore
Stefano Mizzaro
WICOW'09 - Proceedings of the 3rd Workshop on Information Credibility on the Web, Co-located with WWW 2009 English 2009 We propose and evaluate QuWi (Quality in Wikipedia), a framework for quality control in Wikipedia. We build upon a previous proposal by Mizzaro [11], who proposed a method for substituting and/or complementing peer review in scholarly publishing. Since articles in Wikipedia are never finished, and their authors change continuously, we define a modified algorithm that takes into account the different domain, with particular attention to the fact that authors contribute identifiable pieces of information that can be further modified by other authors. The algorithm assigns quality scores to articles and contributors. The scores assigned to articles can be used, e.g., to let the reader understand how reliable are the articles he or she is looking at, or to help contributors in identifying low quality articles to be enhanced. The scores assigned to users measure the average quality of their contributions to Wikipedia and can be used, e.g., for conflict resolution policies based on the quality of involved users. Our proposed algorithm is experimentally evaluated by analyzing the obtained quality scores on articles for deletion and featured articles, also on six temporal Wikipedia snapshots. Preliminary results demonstrate that the proposed algorithm seems to appropriately identify high and low quality articles, and that high quality authors produce more long-lived contributions than low quality authors Copyright 200X ACM. 0 0
Quality Evaluation of Search Results by Typicality and Speciality of Terms Extracted from Wikipedia Makoto Nakatani
Adam Jatowt
Hiroaki Ohshima
Katsumi Tanaka
DASFAA English 2009 0 0
Towards a universal text classifier: Transfer learning using encyclopedic knowledge Pu Wang
Carlotta Domeniconi
ICDM Workshops 2009 - IEEE International Conference on Data Mining English 2009 Document classification is a key task for many text mining applications. However, traditional text classification requires labeled data to construct reliable and accurate classifiers. Unfortunately, labeled data are seldom available. In this work, we propose a universal text classifier, which does not require any labeled document. Our approach simulates the capability of people to classify documents based on background knowledge. As such, we build a classifier that can effectively group documents based on their content, under the guidance of few words describing the classes of interest. Background knowledge is modeled using encyclopedic knowledge, namely Wikipedia. The universal text classifier can also be used to perform document retrieval. In our experiments with real data we test the feasibility of our approach for both the classification and retrieval tasks. 0 0
Wiki-enabled semantic data mining - Task design, evaluation and refinement Atzmueller M.
Lemmerich F.
Jochen Reutelshoefer
Frank Puppe
CEUR Workshop Proceedings English 2009 Complementing semantic data mining systems by wikis and especially semantic wikis yield a flexible knowledge-rich method. This paper describes a system architecture of a collaborative approach for semantic data mining. The goal is to enhance the design, evaluation and refinement of data mining tasks using semantic technology. Collaborative aspects are introduced by utilizing wiki technology. We present the components and describe their interaction and application in detail. 0 0
A Search Engine for Browsing the Wikipedia Thesaurus Kotaro Nakayama
Takahiro Hara
Sojiro Nishio
13th International Conference on Database Systems for Advanced Applications, Demo session (DASFAA) 2008 Wikipedia has become a huge phenomenon on the WWW. As a corpus for knowledge extraction, it has various impressive characteristics such as a huge amount of articles, live updates, a dense link structure, brief link texts and URL identification for concepts. In our previous work, we proposed link structure mining algorithms to extract a huge scale and accurate association thesaurus from Wikipedia. The association thesaurus covers almost 1.3 million concepts and the significant accuracy is proved in detailed experiments. To prove its practicality, we implemented three features on the association thesaurus; a search engine for browsing Wikipedia Thesaurus, an XML Web service for the thesaurus and a Semantic Web support feature. We show these features in this demonstration. 0 0
Constructing a Global Ontology by Concept Mapping using Wikipedia Thesaurus Minghua Pei
Kotaro Nakayama
Takahiro Hara
Sojiro Nishio
International Symposium on Mining And Web (IEEE MAW) conjunction with IEEE AINA 2008 0 0
Gazetiki: Automatic creation of a geographical gazetteer Adrian Popescu
Gregory Grefenstette
Moellic P.-A.
Proceedings of the ACM International Conference on Digital Libraries English 2008 Geolocalized databases are becoming necessary in a wide variety of application domains. Thus far, the creation of such databases has been a costly, manual process. This drawback has stimulated interest in automating their construction, for example, by mining geographical information from the Web. Here we present and evaluate a new automated technique for creating and enriching a geographical gazetteer, called Gazetiki. Our technique merges disparate information from Wikipedia, Panoramio, and web search, engines in order to identify geographical names, categorize these names, find their geographical coordinates and rank them. The information produced in Gazetiki enhances and complements the Geonames database, using a similar domain model. We show that our method provides a richer structure and an improved coverage compared to another known attempt at automatically building a geographic database and, where possible, we compare our Gazetiki to Geonames. Copyright 2008 ACM. 0 0
Growing Fields of Interest - Using an Expand and Reduce Strategy for Domain Model Extraction Christopher Thomas
Pankaj Mehra
Roger Brooks
Amit Sheth
IEEE/WIC International Conference on Web Intelligence, Sydney, Australia 2008 Domain hierarchies are widely used as models underlying information retrieval tasks. Formal ontologies and taxonomies enrich such hierarchies further with properties and relationships associated with concepts and categories but require manual effort; therefore they are costly to maintain, and often stale. Folksonomies and vocabularies lack rich category structure and are almost entirely devoid of properties and relationships. Classification and extraction require the coverage of vocabularies and the alterability of folksonomies and can largely benefit from category relationships and other properties. With Doozer, a program for building conceptual models of information domains, we want to bridge the gap between the vocabularies and Folksonomies on the one side and the rich, expert-designed ontologies and taxonomies on the other. Doozer mines Wikipedia to produce tight domain hierarchies, starting with simple domain descriptions. It also adds relevancy scores for use in automated classification of information. The output model is described as a hierarchy of domain terms that can be used immediately for classifiers and IR systems or as a basis for manual or semi-automatic creation of formal ontologies. 0 0
Handling implicit geographic evidence for geographic IR Nuno Cardoso
Silva M.J.
Diana Santos
International Conference on Information and Knowledge Management, Proceedings English 2008 Most geographic information retrieval systems depend on the detection and disambiguation of place names in documents, assuming that the documents with a specific geographic scope contain explicit place names in the text that are strongly related to the document scopes. However, some non-geographic names such as companies, monuments or sport events, may also provide indirect relevant evidence that can significantly contribute to the assignment of geographic scopes to documents. In this paper, we analyze the amount of implicit and explicit geographic evidence in newspaper documents, and measure its impact on geographic information retrieval by evaluating the performance of a retrieval system using the GeoCLEF evaluation data. 0 0
Information extraction from Wikipedia: Moving down the long tail Fei Wu
Raphael Hoffmann
Weld D.S.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining English 2008 Not only is Wikipedia a comprehensive source of quality information, it has several kinds of internal structure (e.g., relational summaries known as infoboxes), which enable self-supervised information extraction. While previous efforts at extraction from Wikipedia achieve high precision and recall on well-populated classes of articles, they fail in a larger number of cases, largely because incomplete articles and infrequent use of infoboxes lead to insufficient training data. This paper presents three novel techniques for increasing recall from Wikipedia's long tail of sparse classes: (1) shrinkage over an automatically-learned subsumption taxonomy, (2) a retraining technique for improving the training data, and (3) supplementing results by extracting from the broader Web. Our experiments compare design variations and show that, used in concert, these techniques increase recall by a factor of 1.76 to 8.71 while maintaining or increasing precision. 0 0
Meliorated approach for extracting Bilingual terminology from wikipedia Ajay Gupta
Goyal A.
Bindal A.
Proceedings of 11th International Conference on Computer and Information Technology, ICCIT 2008 English 2008 With the demand of accurate and domain specific bilingual dictionaries, research in the field of automatic dictionary extraction has become popular. Due to lack of domain specific terminology in parallel corpora, extraction of bilingual terminology from Wikipedia (a corpus for knowledge extraction having a huge amount of articles, links within different languages, a dense link structure and a number of redirect pages) has taken up a new research in the field of bilingual dictionary creation. Our method not only analyzes interlanguage links along with redirect page titles and linktext titles but also filters out inaccurate translation candidates using pattern matching. Score of each translation candidate is calculated using page parameters and then setting an appropriate threshold as compared to previous approach, which was solely, based on backward links. In our experiment, we proved the advantages of our approach compared to the traditional approach. 0 0
Mining Wikipedia Resources for Discovering Answers to List Questions in Web Snippets Alejandro Figueroa SKG English 2008 0 0
Mining Wikipedia for Discovering Multilingual Definitions on the Web Alejandro Figueroa SKG English 2008 0 0
Public chemical compound databases Williams A.J. Current Opinion in Drug Discovery and Development English 2008 The internet has rapidly become the first port of call for all information searches. The increasing array of chemistry-related resources that are now available provides chemists with a direct path to the information that was previously accessed via library services and was limited by commercial and costly resources. The diversity of the information that can be accessed online is expanding at a dramatic rate, and the support for publicly available resources offers significant opportunities in terms of the benefits to science and society. While the data online do not generally meet the quality standards of manually curated sources, there are efforts underway to gather scientists together and 'crowdsource' an improvement in the quality of the available data. This review discusses the types of public compound databases that are available online and provides a series of examples. Focus is also given to the benefits and disruptions associated with the increased availability of such data and the integration of technologies to data mine this information. 0 0
Remote sensing ontology development for data interoperability Nagai M.
Ono M.
Shibasaki R.
29th Asian Conference on Remote Sensing 2008, ACRS 2008 English 2008 Remote sensing ontology is developed for not only integrating earth observation data, but also knowledge sharing and information transfer. Ontological information is used for data sharing service such as support of metadata deign, structuring of data contents, support of text mining. Remote sensing ontology is constructed based on Semantic MediaWiki. Ontological information are added to the dictionary by digitalizing text based dictionaries, developing "knowledge writing tool" for experts, and extracting semantic relations from authoritative documents by applying natural language processing technique. The ontology system containing the dictionary is developed as lexicographic ontology. Also, constructed ontological information is used for the reverse dictionary. 0 0
Tagpedia: A semantic reference to describe and search for Web resources Francesco Ronzano
Andrea Marchetti
Maurizio Tesconi
CEUR Workshop Proceedings English 2008 Nowadays the Web represents a growing collection of an enormous amount of contents where the need for better ways to find and organize the available data is becoming a fundamental issue, in order to deal with information overload. Keyword based Web searches are actually the preferred mean to seek for contents related to a specific topic. Search engines and collaborative tagging systems make possible the search for information thanks to the association of descriptive keywords to Web resources. All of them show problems of inconsistency and consequent reduction of recall and precision of searches, due to polysemy, synonymy and in general all the different lexical forms that can be used to refer to a particular meaning. A possible way to face or at least reduce these problems is represented by the introduction of semantics to characterize the contents of Web resources: each resource is described by one or more concepts instead of simple and often ambiguous keywords. To support these task the availability of a global semantic resource of reference is fundamental. On the basis of our past experience with the semantic tagging of Web resources and the SemKey Project, we are developing Tagpedia, a general-domain "encyclopedia" of tags, semantically structured for generating semantic descriptions of contents over the Web, created by mining Wikipedia. In this paper, starting from an analysis of the weak points of non-semantic keyword based Web searches, we introduce our idea of semantic characterization of Web resources describing the structure and organization of Tagpedia. We introduce our first realization of Tagpedia, suggesting all the possible improvements that can be carried out in order to exploit its full potential. 0 0
Wikipedia Mining for Huge Scale Japanese Association Thesaurus Construction Kotaro Nakayama
Masahiro Ito
Takahiro Hara
Shojiro Nishio
AINAW English 2008 0 0
Wikipedia Mining: Wikipedia as a Corpus por Knowledge Extraction Kotaro Nakayama
Minghua Pei
Maike Erdmann
Masahiro Ito
Masumi Shirakawa
Takahiro Hara
Shojiro Nishio
Wikimania English 2008 Wikipedia, a collaborative Wiki-based encyclopedia, has become a huge phenomenon among Internet users. It covers a huge number of concepts of various fields such as Arts, Geography, History, Science, Sports and Games. As a corpus for knowledge extraction, Wikipedia's impressive characteristics are not limited to the scale, but also include the dense link structure, word sense disambiguation based on URL and brief anchor texts. Because of these characteristics, Wikipedia has become a promising corpus and a big frontier for researchers. A considerable number of researches on Wikipedia Mining such as semantic relatedness measurement, bilingual dictionary construction, and ontology construction have been conducted. In this paper, we take a comprehensive, panoramic view of Wikipedia as a Web corpus since almost all previous researches are just exploiting parts of the Wikipedia characteristics. The contribution of this paper is triple-sum. First, we unveil the characteristics of Wikipedia as a corpus for knowledge extraction in detail. In particular, we describe the importance of anchor texts with special emphasis since it is helpful information for both disambiguation and synonym extraction. Second, we introduce some of our Wikipedia mining researches as well as researches conducted by other researches in order to prove the worth of Wikipedia. Finally, we discuss possible directions of Wikipedia research. 0 0
Wikipedia link structure and text mining for semantic relation extraction towards a huge scale global web ontology Kotaro Nakayama
Takahiro Hara
Shojiro Nishio
CEUR Workshop Proceedings English 2008 Wikipedia, a collaborative Wiki-based encyclopedia, has become a huge phenomenon among Internet users. It covers huge number of concepts of various fields such as Arts, Geography, History, Science, Sports and Games. Since it is becoming a database storing all human knowledge, Wikipedia mining is a promising approach that bridges the Semantic Web and the Social Web (a. k. a. Web 2.0). In fact, in the previous researches on Wikipedia mining, it is strongly proved that Wikipedia has a remarkable capability as a corpus for knowledge extraction, especially for relatedness measurement among concepts. However, semantic relatedness is just a numerical strength of a relation but does not have an explicit relation type. To extract inferable semantic relations with explicit relation types, we need to analyze not only the link structure but also texts in Wikipedia. In this paper, we propose a consistent approach of semantic relation extraction from Wikipedia. The method consists of three sub-processes highly optimized for Wikipedia mining; 1) fast preprocessing, 2) POS (Part Of Speech) tag tree analysis, and 3) mainstay extraction. Furthermore, our detailed evaluation proved that link structure mining improves both the accuracy and the scalability of semantic relations extraction. 0 0
A Knowledge-Based Search Engine Powered by Wikipedia David N. Milne
Ian H. Witten
David M. Nichols
CIKM ‘07 2007 This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide a vast amount of structured world knowledge about the terms of interest. Our system, the Wikipedia Link Vector Model or WLVM, is unique in that it does so using only the hyperlink structure of Wikipedia rather than its full textual content. To evaluate the algorithm we use a large, widely used test set of manually defined measures of semantic relatedness as our bench-mark. This allows direct comparison of our system with other similar techniques. 0 1
A Thesaurus Construction Method from Large Scale Web Dictionaries Kotaro Nakayama
Takahiro Hara
Sojiro Nishio
21st IEEE International Conference on Advanced Information Networking and Applications (AINA) 2007 Web-based dictionaries, such as Wikipedia, have become dramatically popular among the internet users in past several years. The important characteristic of Web-based dictionary is not only the huge amount of articles, but also hyperlinks. Hyperlinks have various information more than just providing transfer function between pages. In this paper, we propose an efficient method to analyze the link structure of Web-based dictionaries to construct an association thesaurus. We have already applied it to Wikipedia, a huge scale Web-based dictionary which has a dense link structure, as a corpus. We developed a search engine for evaluation, then conducted a number of experiments to compare our method with other traditional methods such as co-occurrence analysis. 0 0
Beyond Ubiquity: Co-creating Corporate Knowledge with a Wiki H. Hasan
J. A. Meloche
C. C. Pfaff
D. Willis
Proceedings - International Conference on Mobile Ubiquitous Computing, Systems, Services and Technologies, UBICOMM 2007 English 2007 Despite their reputation as an evolving shared knowledge repository, Wikis are often treated with suspicion in organizations for management, social and legal reasons. Following studies of unsuccessful Wiki projects, a field study was undertaken of a corporate Wiki that has been developed to capture, and make available, organizational knowledge for a large manufacturing company as an initiative of their Knowledge Management program. A Q Methodology research approach was selected to uncover employees' subjective attitudes to the Wiki so that the firm could more fully exploit the potential of the Wiki as a ubiquitous tool for tacit knowledge management. 0 1
Computing Semantic Relatedness using Wikipedia Link Structure David N. Milne Proc. of NZCSRSC, 2007 2007 This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide a vast amount of structured world knowledge about the terms of interest. Our system, the Wikipedia Link Vector Model or WLVM, is unique in that it does so using only the hyperlink structure of Wikipedia rather than its full textual content. To evaluate the algorithm we use a large, widely used test set of manually defined measures of semantic relatedness as our bench-mark. This allows direct comparison of our system with other similar techniques. 0 2
Computing semantic relatedness using Wikipedia Link structure Milne D. Proceedings of NZCSRSC 2007, the 5th New Zealand Computer Science Research Student Conference English 2007 This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide a vast amount of structured world knowledge about the terms of interest. Our system, the Wikipedia Link Vector Model or WLVM, is unique in that it does so using only the hyperlink structure of Wikipedia rather than its full textual content. To evaluate the algorithm we use a large, widely used test set of manually defined measures of semantic relatedness as our bench-mark. This allows direct comparison of our system with other similar techniques. 0 2
Exploiting web 2.0 forallknowledge-based information retrieval Milne D.N. International Conference on Information and Knowledge Management, Proceedings English 2007 This paper describes ongoing research into obtaining and using knowledge bases to assist information retrieval. These structures are prohibitively expensive to obtain manually, yet automatic approaches have been researched for decades with limited success. This research investigates a potential shortcut: a way to provide knowledge bases automatically, without expecting computers to replace expert human indexers. Instead we aim to replace the professionals with thousands or even millions of amateurs: with the growing community of contributors who form the core of Web 2.0. Specifically we focus on Wikipedia, which represents a rich tapestry of topics and semantics and a huge investment of human effort and judgment. We show how this can be directly exploited to provide manually-defined yet inexpensive knowledge-bases that are specifically tailored to expose the topics, terminology and semantics of individual document collections. We are also concerned with how best to make these structures available to users, and aim to produce a complete knowledge-based retrieval system-both the knowledge base and the tools to apply it-that can be evaluated by how well it assists real users in performing realistic and practical information retrieval tasks. To this end we have developed Koru, a new search engine that offers concrete evidence of the effectiveness of our Web 2.0 based techniques for assisting information retrieval. 0 0
Exploring wikipedia and query log's ability for text feature representation Li B.
Chen Q.-C.
Yeung D.S.
Ng W.W.Y.
Wang X.-L.
Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, ICMLC 2007 English 2007 The rapid increase of internet technology requires a better management of web page contents. Many text mining researches has been conducted, like text categorization, information retrieval, text clustering. When machine learning methods or statistical models are applied to such a large scale of data, the first step we have to solve is to represent a text document into the way that computers could handle. Traditionally, single words are always employed as features in Vector Space Model, which make up the feature space for all text documents. The single-word based representation is based on the word independence and doesn't consider their relations, which may cause information missing. This paper proposes Wiki-Query segmented features to text classification, in hopes of better using the text information. The experiment results show that a much better F1 value has been achieved than that of classical single-word based text representation. This means that Wikipedia and query segmented feature could better represent a text document. 0 0
Generating Educational Tourism Narratives from Wikipedia Brent Hecht
Nicole Starosielski
Drew Dara-Abrams
Association for the Advancement of Artificial Intelligence Fall Symposium on Intelligent Narrative Technologies (AAAI-INT) 2007 We present a narrative theory-based approach to data mining that generates cohesive stories from a Wikipedia corpus. This approach is based on a data mining-friendly view of narrative derived from narratology, and uses a prototype mining algorithm that implements this view. Our initial test case and focus is that of field-based educational tour narrative generation, for which we have successfully implemented a proof-of-concept system called Minotour. This system operates on a client-server model, in which the server mines a Wikipedia database dump to generate narratives between any two spatial features that have associated Wikipedia articles. The server then delivers those narratives to mobile device clients. 0 0
Generating educational tourism narratives from wikipedia Brent Hecht
Starosielski N.
Dara-Abrams D.
AAAI Fall Symposium - Technical Report English 2007 We present a narrative theory-based approach to data mining that generates cohesive stories from a Wikipedia corpus. This approach is based on a data mining-friendly view of narrative derived from narratology, and uses a prototype mining algorithm that implements this view. Our initial test case and focus is that of field-based educational tour narrative generation, for which we have successfully implemented a proof-of-concept system called Minotour. This system operates on a client-server model, in which the server mines a Wikipedia database dump to generate narratives between any two spatial features that have associated Wikipedia articles. The server then delivers those narratives to mobile device clients. 0 0
Improving text classification by using encyclopedia knowledge Pu Wang
Jian Hu
Zeng H.-J.
Long Chen
Zheng Chen
Proceedings - IEEE International Conference on Data Mining, ICDM English 2007 The exponential growth of text documents available on the Internet has created an urgent need for accurate, fast, and general purpose text classification algorithms. However, the "bag of words" representation used for these classification methods is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. In order to deal with this problem, we integrate background knowledge - in our application: Wikipedia - into the process of classifying text documents. The experimental evaluation on Reuters newsfeeds and several other corpus shows that our classification results with encyclopedia knowledge are much better than the baseline "bag of words" methods. 0 0
Relation extraction from Wikipedia using subtree mining Nguyen D.P.T.
Yutaka Matsuo
Mitsuru Ishizuka
Proceedings of the National Conference on Artificial Intelligence English 2007 The exponential growth and reliability of Wikipedia have made it a promising data source for intelligent systems. The first challenge of Wikipedia is to make the encyclopedia machine-processable. In this study, we address the problem of extracting relations among entities from Wikipedia's English articles, which in turn can serve for intelligent systems to satisfy users' information needs. Our proposed method first anchors the appearance of entities in Wikipedia articles using some heuristic rules that are supported by their encyclopedic style. Therefore, it uses neither the Named Entity Recognizer (NER) nor the Coreference Resolution tool, which are sources of errors for relation extraction. It then classifies the relationships among entity pairs using SVM with features extracted from the web structure and subtrees mined from the syntactic structure of text. The innovations behind our work are the following: a) our method makes use of Wikipedia characteristics for entity allocation and entity classification, which are essential for relation extraction; b) our algorithm extracts a core tree, which accurately reflects a relationship between a given entity pair, and subsequently identifies key features with respect to the relationship from the core tree. We demonstrate the effectiveness of our approach through evaluation of manually annotated data from actual Wikipedia articles. Copyright © 2007, Association for the Advancement of Artificial Intelligence ( All rights reserved. 0 0