Text mining

From WikiPapers
Jump to: navigation, search

text mining is included as keyword or extra keyword in 0 datasets, 0 tools and 45 publications.

Datasets

There is no datasets for this keyword.

Tools

There is no tools for this keyword.


Publications

Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
A generic framework and methodology for extracting semantics from co-occurrences Rachakonda A.R.
Srinivasa S.
Sayali Kulkarni
Srinivasan M.S.
Data and Knowledge Engineering English 2014 Extracting semantic associations from text corpora is an important problem with several applications. It is well understood that semantic associations from text can be discerned by observing patterns of co-occurrences of terms. However, much of the work in this direction has been piecemeal, addressing specific kinds of semantic associations. In this work, we propose a generic framework, using which several kinds of semantic associations can be mined. The framework comprises a co-occurrence graph of terms, along with a set of graph operators. A methodology for using this framework is also proposed, where the properties of a given semantic association can be hypothesized and tested over the framework. To show the generic nature of the proposed model, four different semantic associations are mined over a corpus comprising of Wikipedia articles. The design of the proposed framework is inspired from cognitive science - specifically the interplay between semantic and episodic memory in humans. © 2014 Elsevier B.V. All rights reserved. 0 0
Collective memory in Poland: A reflection in street names Radoslaw Nielek
Wawer A.
Adam Wierzbicki
Lecture Notes in Computer Science English 2014 Our article starts with an observation that street names fall into two general types: generic and historically inspired. We analyse street names distributions (of the second type) as a window to nation-level collective memory in Poland. The process of selecting street names is determined socially, as the selections reflect the symbols considered important to the nation-level society, but has strong historical motivations and determinants. In the article, we seek for these relationships in the available data sources. We use Wikipedia articles to match street names with their textual descriptions and assign them to the time points. We then apply selected text mining and statistical techniques to reach quantitative conclusions. We also present a case study: the geographical distribution of two particular street names in Poland to demonstrate the binding between history and political orientation of regions. 0 0
Towards linking libraries and Wikipedia: Aautomatic subject indexing of library records with Wikipedia concepts Joorabchi A.
Mahdi A.E.
Journal of Information Science English 2014 In this article, we first argue the importance and timely need of linking libraries and Wikipedia for improving the quality of their services to information consumers, as such linkage will enrich the quality of Wikipedia articles and at the same time increase the visibility of library resources which are currently overlooked to a large degree. We then describe the development of an automatic system for subject indexing of library metadata records with Wikipedia concepts as an important step towards library-Wikipedia integration. The proposed system is based on first identifying all Wikipedia concepts occurring in the metadata elements of library records. This is then followed by training and deploying generic machine learning algorithms to automatically select those concepts which most accurately reflect the core subjects of the library materials whose records are being indexed. We have assessed the performance of the developed system using standard information retrieval measures of precision, recall and F-score on a dataset consisting of 100 library metadata records manually indexed with a total of 469 Wikipedia concepts. The evaluation results show that the developed system is capable of achieving an averaged F-score as high as 0.92. 0 0
A method for recommending the most appropriate expansion of acronyms using wikipedia Choi D.
Shin J.
Lee E.
Kim P.
Proceedings - 7th International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, IMIS 2013 English 2013 Over the years, many researchers have been studied to detect expansions of acronyms in texts by using linguistic and syntactical approaches in order to overcome disambiguation problems. Acronym is an abbreviation formed which is composed of initial components of single or multiple words. These initial components bring huge mistakes when a machine conducts experiments to find meaning from given texts. Detecting expansions of acronyms is not a big issue now days. The problem is that a polysemous acronym. In order to solve this problem, this paper proposes a method to recommend the most related expansion of acronym through analyzing co-occurrence words by using Wikipedia. Our goal is not finding acronym definition or expansion but recommending the most appropriate expansion of given acronyms. 0 0
Automatic keyphrase annotation of scientific documents using Wikipedia and genetic algorithms Joorabchi A.
Mahdi A.E.
Journal of Information Science English 2013 Topical annotation of documents with keyphrases is a proven method for revealing the subject of scientific and research documents to both human readers and information retrieval systems. This article describes a machine learning-based keyphrase annotation method for scientific documents that utilizes Wikipedia as a thesaurus for candidate selection from documents' content. We have devised a set of 20 statistical, positional and semantical features for candidate phrases to capture and reflect various properties of those candidates that have the highest keyphraseness probability. We first introduce a simple unsupervised method for ranking and filtering the most probable keyphrases, and then evolve it into a novel supervised method using genetic algorithms. We have evaluated the performance of both methods on a third-party dataset of research papers. Reported experimental results show that the performance of our proposed methods, measured in terms of consistency with human annotators, is on a par with that achieved by humans and outperforms rival supervised and unsupervised methods. 0 0
Discovering stakeholders' interests in Wiki-based architectural documentation Nicoletti M.
Diaz-Pace J.A.
Schiaffino S.
CIbSE 2013: 16th Ibero-American Conference on Software Engineering - Memorias de la 16th Conferencia Iberoamericana de Ingenieria de Software, CIbSE 2013 English 2013 The Software Architecture Document (SAD) is an important artifact in the early stages of software development, as it serves to share and discuss key design and quality-attribute concerns among the stakeholders of the project. Nowadays, architectural documentation is commonly hosted in Wikis in order to favor communication and interactions among stakeholders. However, the SAD is still a large and complex document, in which stakeholders often have difficulties in finding information that is relevant to their interests or daily tasks. We argue that the discovery of stakeholders' interests is helpful to tackle this information overload problem, because a recommendation tool can leverage on those interests to provide each stakeholder with SAD sections that match his/her profile. In this work, we propose an approach to infer stakeholders' interests, based on applying a combination of Natural Language Processing and User Profiling techniques. The interests are partially inferred by monitoring the stakeholders' behavior as they browse a Wiki-based SAD. A preliminary evaluation of our approach has shown its potential for making recommendations to stakeholders with different profiles and support them in architectural tasks. 0 0
Document analytics through entity resolution Santos J.
Martins B.
Batista D.S.
Lecture Notes in Computer Science English 2013 We present a prototype system for resolving named entities, mentioned in textual documents, into the corresponding Wikipedia entities. This prototype can aid in document analysis, by using the disambiguated references to provide useful information in context. 0 0
Extraction of biographical data from Wikipedia Viseur R. DATA 2013 - Proceedings of the 2nd International Conference on Data Technologies and Applications English 2013 Using the content of Wikipedia articles is common in academic research. However the practicalities are rarely analysed. Our research focuses on extracting biographical information about personalities from Belgium. Our research is divided into three sections. The first section describes the state of the art for data extraction from Wikipedia. A second section presents the case study about data extraction for biographies of Belgian personalities. Different solutions are discussed and the solution adopted is implemented. In the third section, the quality of the extraction is discussed. Practical recommendations for researchers wishing to use Wikipedia are also proposed on the basis of our case study. 0 0
Term extraction from sparse, ungrammatical domain-specific documents Ittoo A.
Gosse Bouma
Expert Systems with Applications English 2013 Existing term extraction systems have predominantly targeted large and well-written document collections, which provide reliable statistical and linguistic evidence to support term extraction. In this article, we address the term extraction challenges posed by sparse, ungrammatical texts with domain-specific contents, such as customer complaint emails and engineers' repair notes. To this aim, we present ExtTerm, a novel term extraction system. Specifically, as our core innovations, we accurately detect rare (low frequency) terms, overcoming the issue of data sparsity. These rare terms may denote critical events, but they are often missed by extant TE systems. ExtTerm also precisely detects multi-word terms of arbitrarily lengths, e.g. with more than 2 words. This is achieved by exploiting fundamental theoretical notions underlying term formation, and by developing a technique to compute the collocation strength between any number of words. Thus, we address the limitation of existing TE systems, which are primarily designed to identify terms with 2 words. Furthermore, we show that open-domain (general) resources, such as Wikipedia, can be exploited to support domain-specific term extraction. Thus, they can be used to compensate for the unavailability of domain-specific knowledge resources. Our experimental evaluations reveal that ExtTerm outperforms a state-of-the-art baseline in extracting terms from a domain-specific, sparse and ungrammatical real-life text collection. © 2012 Elsevier B.V. All rights reserved. 0 0
Wiki3C: Exploiting wikipedia for context-aware concept categorization Jiang P.
Hou H.
Long Chen
Shun-ling Chen
Conglei Yao
Chenliang Li
Wang M.
WSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining English 2013 Wikipedia is an important human generated knowledge base containing over 21 million articles organized by millions of categories. In this paper, we exploit Wikipedia for a new task of text mining: Context-aware Concept Categorization. In the task, we focus on categorizing concepts according to their context. We exploit article link feature and category structure in Wikipedia, followed by introducing Wiki3C, an unsupervised and domain independent concept categorization approach based on context. In the approach, we investigate two strategies to select and filter Wikipedia articles for the category representation. Besides, a probabilistic model is employed to compute the semantic relatedness between two concepts in Wikipedia. Experimental evaluation using manually labeled ground truth shows that our proposed Wiki3C can achieve a noticeable improvement over the baselines without considering contextual information. 0 0
Automatic Document Topic Identification using Wikipedia Hierarchical Ontology Hassan M.M.
Fakhri Karray
Kamel M.S.
2012 11th International Conference on Information Science, Signal Processing and their Applications, ISSPA 2012 English 2012 The rapid growth in the number of documents available to end users from around the world has led to a greatly-increased need for machine understanding of their topics, as well as for automatic grouping of related documents. This constitutes one of the main current challenges in text mining. In this work, a novel technique is proposed, to automatically construct a background knowledge structure in the form of a hierarchical ontology, using one of the largest online knowledge repositories: Wikipedia. Then, a novel approach is presented to automatically identify the documents' topics based on the proposed Wikipedia Hierarchical Ontology (WHO). Results show that the proposed model is efficient in identifying documents' topics, and promising, as it outperforms the accuracy of the other conventional algorithms for document clustering. 0 0
Automatic subject metadata generation for scientific documents using wikipedia and genetic algorithms Joorabchi A.
Mahdi A.E.
Lecture Notes in Computer Science English 2012 Topical annotation of documents with keyphrases is a proven method for revealing the subject of scientific and research documents. However, scientific documents that are manually annotated with keyphrases are in the minority. This paper describes a machine learning-based automatic keyphrase annotation method for scientific documents, which utilizes Wikipedia as a thesaurus for candidate selection from documents' content and deploys genetic algorithms to learn a model for ranking and filtering the most probable keyphrases. Reported experimental results show that the performance of our method, evaluated in terms of inter-consistency with human annotators, is on a par with that achieved by humans and outperforms rival supervised methods. 0 0
BiCWS: Mining cognitive differences from bilingual web search results Xiangji Huang
Wan X.
Jie Xiao
Lecture Notes in Computer Science English 2012 In this paper we propose a novel comparative web search system - BiCWS, which can mine cognitive differences from web search results in a multi-language setting. Given a topic represented by two queries (they are the translations of each other) in two languages, the corresponding web search results for the two queries are firstly retrieved by using a general web search engine, and then the bilingual facets for the topic are mined by using a bilingual search results clustering algorithm. The semantics in Wikipedia are leveraged to improve the bilingual clustering performance. After that, the semantic distributions of the search results over the mined facets are visually presented, which can reflect the cognitive differences in the bilingual communities. Experimental results show the effectiveness of our proposed system. 0 0
Collaboratively constructed knowledge repositories as a resource for domain independent concept extraction Kerschbaumer J.
Reichhold M.
Winkler C.
Fliedl G.
Proceedings of the 10th Terminology and Knowledge Engineering Conference: New Frontiers in the Constructive Symbiosis of Terminology and Knowledge Engineering, TKE 2012 English 2012 To achieve a domain independent text management, a flexible and adaptive knowledge repository is indispensable and represents the key resource for solving many challenges in natural language processing. Especially for real world applications, the needed resources cannot be provided for technical disciplines, like engineering in the energy or the automotive domain. We therefore propose in this paper, a new approach for knowledge (concept) acquisition based on collaboratively constructed knowledge repositories like Wikipedia and enterprise Wikis. 0 0
Discovery of novel term associations in a document collection Hynonen T.
Mahler S.
Toivonen H.
Lecture Notes in Computer Science English 2012 We propose a method to mine novel, document-specific associations between terms in a collection of unstructured documents. We believe that documents are often best described by the relationships they establish. This is also evidenced by the popularity of conceptual maps, mind maps, and other similar methodologies to organize and summarize information. Our goal is to discover term relationships that can be used to construct conceptual maps or so called BisoNets. The model we propose, tpf-idf-tpu, looks for pairs of terms that are associated in an individual document. It considers three aspects, two of which have been generalized from tf-idf to term pairs: term pair frequency (tpf; importance for the document), inverse document frequency (idf; uniqueness in the collection), and term pair uncorrelation (tpu; independence of the terms). The last component is needed to filter out statistically dependent pairs that are not likely to be considered novel or interesting by the user. We present experimental results on two collections of documents: one extracted from Wikipedia, and one containing text mining articles with manually assigned term associations. The results indicate that the tpf-idf-tpu method can discover novel associations, that they are different from just taking pairs of tf-idf keywords, and that they match better the subjective associations of a reader. 0 0
Explanatory semantic relatedness and explicit spatialization for exploratory search Brent Hecht
Carton S.H.
Mahmood Quaderi
Johannes Schoning
Raubal M.
Darren Gergle
Doug Downey
SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval English 2012 Exploratory search, in which a user investigates complex concepts, is cumbersome with today's search engines. We present a new exploratory search approach that generates interactive visualizations of query concepts using thematic cartography (e.g. choropleth maps, heat maps). We show how the approach can be applied broadly across both geographic and non-geographic contexts through explicit spatialization, a novel method that leverages any figure or diagram - from a periodic table, to a parliamentary seating chart, to a world map - as a spatial search environment. We enable this capability by introducing explanatory semantic relatedness measures. These measures extend frequently-used semantic relatedness measures to not only estimate the degree of relatedness between two concepts, but also generate human-readable explanations for their estimates by mining Wikipedia's text, hyperlinks, and category structure. We implement our approach in a system called Atlasify, evaluate its key components, and present several use cases. 0 0
Happy or not: Generating topic-based emotional heatmaps for culturomics using CyberGIS Shook E.
Kalev Leetaru
Cao G.
Padmanabhan A.
Se Wang
2012 IEEE 8th International Conference on E-Science, e-Science 2012 English 2012 The field of Culturomics exploits "big data" to explore human society at population scale. Culturomics increasingly needs to consider geographic contexts and, thus, this research develops a geospatial visual analytical approach that transforms vast amounts of textual data into emotional heatmaps with fine-grained spatial resolution. Fulltext geocoding and sentiment mining extract locations and latent "tone" from text-based data, which are combined with spatial analysis methods - kernel density estimation and spatial interpolation - to generate heatmaps that capture the interplay of location, topic, and tone toward narrative impacts. To demonstrate the effectiveness of the approach, the complete English edition of Wikipedia is processed using a supercomputer to extract all locations and tone associated with the year of 2003. An emotional heatmap ofWikipedia's discussion of "armed conflict" for that year is created using the spatial analysis methods. Unlike previous research, our approach is designed for exploratory spatial analysis of topics in text archives by incorporating multiple attributes including the prominence of each location mentioned in the text, the density of a topic at each location compared to other topics, and the tone of the topics of interest into a single analysis. The generation of such fine-grained emotional heatmaps is computationally intensive particularly when accounting for the multiple attributes at fine scales. Therefore a CyberGIS platform based on national cyberinfrastructure in the United States is used to enable the computationally intensive visual analytics. 0 0
Monitoring propagations in the blogosphere for viral marketing Meihui Chen
Rubens N.
Anma F.
Okamoto T.
Journal of Emerging Technologies in Web Intelligence English 2012 Even though blog contents vary a lot in quality, the disclosure of personal opinions and the huge blogging population always attracts marketing..s attention on blog information. In this paper, we investigate how marketers can identify the information propagation in degree among blog communities. In this way, topic similarity, relatedness, and word repetition between leader and followers.. writing products are considered as the propagated information. The contribution of this paper is twofold. The work presented here is to show how blog content can be economically and feasibly analyzed by existing internet sources such as Wikipedia database and the usage of page return from a Japanese search engine. To this extent, this system, which combined in-link algorithms and text mining analyzes, tracing propagation channels and propagateable information allows analyzing the power of influences in viral marketing. We demonstrated the effectiveness of the system by applying blogger identification, topic identification, and the topic propagations. 0 0
Omnipedia: Bridging the Wikipedia Language Gap Patti Bao
Brent Hecht
Samuel Carton
Mahmood Quaderi
Michael Horn
Darren Gergle
International Conference on Human Factors in Computing Systems English 2012 We present Omnipedia, a system that allows Wikipedia readers to gain insight from up to 25 language editions ofWikipedia simultaneously. Omnipedia highlights the similarities and differences that exist among Wikipedia language editions, and makes salient information that is unique to each language as well as that which is shared more widely. We detail solutions to numerous front-end and algorithmic challenges inherent to providing users with a multilingual Wikipedia experience. These include visualizing content in a language-neutral way and aligning data in the face of diverse information organization strategies. We present a study of Omnipedia that characterizes how people interact with information using a multilingual lens. We found that users actively sought information exclusive to unfamiliar language editions and strategically compared how language editions defined concepts. Finally, we briefly discuss how Omnipedia generalizes to other domains facing language barriers. 0 0
Omnipedia: Bridging the Wikipedia language gap Patti Bao
Brent Hecht
Samuel Carton
Mahmood Quaderi
Michael Horn
Darren Gergle
Conference on Human Factors in Computing Systems - Proceedings English 2012 We present Omnipedia, a system that allows Wikipedia readers to gain insight from up to 25 language editions of Wikipedia simultaneously. Omnipedia highlights the similarities and differences that exist among Wikipedia language editions, and makes salient information that is unique to each language as well as that which is shared more widely. We detail solutions to numerous front-end and algorithmic challenges inherent to providing users with a multilingual Wikipedia experience. These include visualizing content in a language-neutral way and aligning data in the face of diverse information organization strategies. We present a study of Omnipedia that characterizes how people interact with information using a multilingual lens. We found that users actively sought information exclusive to unfamiliar language editions and strategically compared how language editions defined concepts. Finally, we briefly discuss how Omnipedia generalizes to other domains facing language barriers. Copyright 2012 ACM. 0 0
Self-organization with additional learning based on category mapping and its application to dynamic news clustering Toyota T.
Nobuhara H.
IEEJ Transactions on Electronics, Information and Systems Japanese; English 2012 The Internet news are texts which involve from various fields, therefore, when a text data that will show a rapid increase of the number of dimensions of feature vectors of Self-OrganizingMap (SOM) is added, these results cannot be reflected to learning. Furthermore, it is difficult for users to recognize the learning results because SOM can not produce any label information by each cluster. In order to solve these problems, we propose SOM with additional learning and dimensional by category mapping which is based on the category structure of Wikipedia. In this method, input vector is generated from each text and the correspondingWikipedia categories extracted fromWikipedia articles. Input vectors are formed in the common category taking the hierarchical structure of Wikipedia category into consideration. By using the proposed method, the problem of reconfiguration of vector elements caused by dynamic changes in the text can be solved. Moreover, information loss in newly obtained index term can be prevented. 0 0
TCSST: Transfer classification of short & sparse text using external data Long G.
Long Chen
Zhu X.
Zhang C.
ACM International Conference Proceeding Series English 2012 Short & sparse text is becoming more prevalent on the web, such as search snippets, micro-blogs and product reviews. Accurately classifying short & sparse text has emerged as an important while challenging task. Existing work has considered utilizing external data (e.g. Wikipedia) to alleviate data sparseness, by appending topics detected from external data as new features. However, training a classifier on features concatenated from different spaces is not easy considering the features have different physical meanings and different significance to the classification task. Moreover, it exacerbates the "curse of dimensionality" problem. In this study, we propose a transfer classification method, TCSST, to exploit the external data to tackle the data sparsity issue. The transfer classifier will be learned in the original feature space. Considering that the labels of the external data may not be readily available or sufficiently enough, TCSST further exploits the unlabeled external data to aid the transfer classification. We develop novel strategies to allow TCSST to iteratively select high quality unlabeled external data to help with the classification. We evaluate the performance of TCSST on both benchmark as well as real-world data sets. Our experimental results demonstrate that the proposed method is effective in classifying very short & sparse text, consistently outperforming existing and baseline methods. 0 0
WikiSent: Weakly supervised sentiment analysis through extractive summarization with Wikipedia Saswati Mukherjee
Prantik Bhattacharyya
Lecture Notes in Computer Science English 2012 This paper describes a weakly supervised system for sentiment analysis in the movie review domain. The objective is to classify a movie review into a polarity class, positive or negative, based on those sentences bearing opinion on the movie alone, leaving out other irrelevant text. Wikipedia incorporates the world knowledge of movie-specific features in the system which is used to obtain an extractive summary of the review, consisting of the reviewer's opinions about the specific aspects of the movie. This filters out the concepts which are irrelevant or objective with respect to the given movie. The proposed system, WikiSent, does not require any labeled data for training. It achieves a better or comparable accuracy to the existing semi-supervised and unsupervised systems in the domain, on the same dataset. We also perform a general movie review trend analysis using WikiSent. 0 0
A novel approach to sentence alignment from comparable corpora Li M.-H.
Vitaly Klyuev
Wu S.-H.
Proceedings of the 6th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS'2011 English 2011 This paper introduces a new technique to select candidate sentences for alignment from bilingual comparable corpora. Tests were done utilizing Wikipedia as a source for bilingual data. Our test languages are English and Chinese. A high quality of sentence alignment is illustrated by a machine translation application. 0 0
A statistical approach for automatic keyphrase extraction Abulaish M.
Jahiruddin
Dey L.
Proceedings of the 5th Indian International Conference on Artificial Intelligence, IICAI 2011 English 2011 Due to availability of voluminous textual data either on the World Wide Web or in textual databases automatic keyphrase extraction has gained increasing popularity in recent past to summarize and characterize text documents. Consequently, a number of machine learning techniques, mostly supervised, have been proposed to mine keyphrases in an automatic way. But, the non-availability of annotated corpus for training such systems is the main hinder for their success. In this paper, we propose the design of an automatic keyphrase extraction system which uses NLP and statistical approach to mine keyphrases from unstructured text documents. The efficacy of the proposed system is established over texts crawled from Wikipedia server. On evaluation we found that the proposed method outperforms KEA which uses naïve Bayes classification technique for keyphrase extraction. 0 0
Combining multiple disambiguation methods for gene mention normalization Xia N.
Hong Lin
Zhenglu Yang
Yanyan Li
Expert Systems with Applications English 2011 The rapid growth of biomedical literature prompts pervasive concentrations of biomedical text mining community to explore methodology for accessing and managing this ever-increasing knowledge. One important task of text mining in biomedical literature is gene mention normalization which recognizes the biomedical entities in biomedical texts and maps each gene mention discussed in the text to unique organic database identifiers. In this work, we employ an information retrieval based method which extracts gene mention's semantic profile from PubMed abstracts for gene mention disambiguation. This disambiguation method focuses on generating a more comprehensive representation of gene mention rather than the organic clues such as gene ontology which has fewer co-occurrences with the gene mention. Furthermore, we use an existing biomedical resource as another disambiguation method. Then we extract features from gene mention detection system's outcome to build a false positive filter according to Wikipedia's retrieved documents. Our system achieved F-measure of 83.1% on BioCreative II GN test data. © 2011 Elsevier Ltd. All rights reserved. 0 0
Editing knowledge resources: The wiki way Francesco Ronzano
Andrea Marchetti
Maurizio Tesconi
International Conference on Information and Knowledge Management, Proceedings English 2011 The creation, customization, and maintenance of knowledge resources are essential for fostering the full deployment of Language Technologies. The definition and refinement of knowledge resources are time- and resource-consuming activities. In this paper we explore how the Wiki paradigm for online collaborative content editing can be exploited to gather massive social contributions from common Web users in editing knowledge resources. We discuss the Wikyoto Knowledge Editor, also called Wikyoto. Wikyoto is a collaborative Web environment that enables users with no knowledge engineering background to edit the multilingual network of knowledge resources exploited by KYOTO, a cross-lingual text mining system developed in the context of the KYOTO European Project. 0 0
Extracting multi-dimensional relations: A generative model of groups of entities in a corpus Au Yeung C.-M.
Iwata T.
International Conference on Information and Knowledge Management, Proceedings English 2011 Extracting relations among different entities from various data sources has been an important topic in data mining. While many methods focus only on a single type of relations, real world entities maintain relations that contain much richer information. We propose a hierarchical Bayesian model for extracting multi-dimensional relations among entities from a text corpus. Using data from Wikipedia, we show that our model can accurately predict the relevance of an entity given the topic of the document as well as the set of entities that are already mentioned in that document. 0 0
Graph-based named entity linking with Wikipedia Ben Hachey
Will Radford
Curran J.R.
Lecture Notes in Computer Science English 2011 Named entity linking (NEL) grounds entity mentions to their corresponding Wikipedia article. State-of-the-art supervised NEL systems use features over the rich Wikipedia document and link-graph structure. Graph-based measures have been effective over WordNet for word sense disambiguation (wsd). We draw parallels between NEL and (wsd), motivating our unsupervised NEL approach that exploits the Wikipedia article and category link graphs. Our system achieves 85.5% accuracy on the TAC 2010 shared task - competitive with the best supervised and unsupervised systems. 0 0
Graph-based named entity linking with wikipedia Ben Hachey
Will Radford
James R. Curran
WISE English 2011 0 0
High-order co-clustering text data on semantics-based representation model Liping Jing
Jiali Yun
Jian Yu
Jiao-Sheng Huang
Lecture Notes in Computer Science English 2011 The language modeling approach is widely used to improve the performance of text mining in recent years because of its solid theoretical foundation and empirical effectiveness. In essence, this approach centers on the issue of estimating an accurate model by choosing appropriate language models as well as smooth techniques. Semantic smoothing, which incorporates semantic and contextual information into the language models, is effective and potentially significant to improve the performance of text mining. In this paper, we proposed a high-order structure to represent text data by incorporating background knowledge, Wikipedia. The proposed structure consists of three types of objects, term, document and concept. Moreover, we firstly combined the high-order co-clustering algorithm with the proposed model to simultaneously cluster documents, terms and concepts. Experimental results on benchmark data sets (20Newsgroups and Reuters-21578) have shown that our proposed high-order co-clustering on high-order structure outperforms the general co-clustering algorithm on bipartite text data, such as document-term, document-concept and document-(term+concept). 0 0
No free lunch: Brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity Ture F.
Elsayed T.
Lin J.
SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2011 This work explores the problem of cross-lingual pairwise similarity, where the task is to extract similar pairs of documents across two different languages. Solutions to this problem are of general interest for text mining in the multilingual context and have specific applications in statistical machine translation. Our approach takes advantage of cross-language information retrieval (CLIR) techniques to project feature vectors from one language into another, and then uses locality-sensitive hashing (LSH) to extract similar pairs. We show that effective cross-lingual pairwise similarity requires working with similarity thresholds that are much lower than in typical monolingual applications, making the problem quite challenging. We present a parallel, scalable MapReduce implementation of the sort-based sliding window algorithm, which is compared to a brute-force approach on German and English Wikipedia collections. Our central finding can be summarized as "no free lunch": there is no single optimal solution. Instead, we characterize effectiveness-efficiency tradeoffs in the solution space, which can guide the developer to locate a desirable operating point based on application- and resource-specific constraints. 0 0
Ontology enhancement and concept granularity learning: Keeping yourself current and adaptive Jiang S.
Bing L.
Sun B.
YanChun Zhang
Lam W.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining English 2011 As a well-known semantic repository, WordNet is widely used in many applications. However, due to costly edit and maintenance, WordNet's capability of keeping up with the emergence of new concepts is poor compared with on-line encyclopedias such as Wikipedia. To keep WordNet current with folk wisdom, we propose a method to enhance WordNet automatically by merging Wikipedia entities into WordNet, and construct an enriched ontology, named as WorkiNet. WorkiNet keeps the desirable structure of WordNet. At the same time, it captures abundant information from Wikipedia. We also propose a learning approach which is able to generate a tailor-made semantic concept collection for a given document collection. The learning process takes the characteristics of the given document collection into consideration and the semantic concepts in the tailor-made collection can be used as new features for document representation. The experimental results show that the adaptively generated feature space can outperform a static one significantly in text mining tasks, and WorkiNet dominates WordNet most of the time due to its high coverage. Copyright 2011 ACM. 1 0
Simultaneous joint and conditional modeling of documents tagged from two perspectives Das P.
Srihari R.
Fu Y.
International Conference on Information and Knowledge Management, Proceedings English 2011 This paper explores correspondence and mixture topic modeling of documents tagged from two different perspectives. There has been ongoing work in topic modeling of documents with tags (tag-topic models) where words and tags typically reflect a single perspective, namely document content. However, words in documents can also be tagged from different perspectives, for example, syntactic perspective as in part-of-speech tagging or an opinion perspective as in sentiment tagging. The models proposed in this paper are novel in: (i) the consideration of two different tag perspectives - a document level tag perspective that is relevant to the document as a whole and a word level tag perspective pertaining to each word in the document; (ii) the attribution of latent topics with word level tags and labeling latent topics with images in case of multimedia documents; and (iii) discovering the possible correspondence of the words to document level tags. The proposed correspondence tag-topic model shows better predictive power i.e. higher likelihood on heldout test data than all existing tag topic models and even a supervised topic model. To evaluate the models in practical scenarios, quantitative measures between the outputs of the proposed models and the ground truth domain knowledge have been explored. Manually assigned (gold standard) document category labels in Wikipedia pages are used to validate model-generated tag suggestions using a measure of pairwise concept similarity within an ontological hierarchy like WordNet. Using a news corpus, automatic relationship discovery between person names was performed and compared to a robust baseline. 0 0
Text clustering using a multiset model Takumi S.
Miyamoto S.
Proceedings - 2011 IEEE International Conference on Granular Computing, GrC 2011 English 2011 The aim of this paper is to study methods of agglomerative hierarchical clustering which are based on the model of bag of words with text mining applications. In particular, a multiset theoretical model is used and an asymmetric similarity measure is studied in addition to two symmetric similarities. The dendrogram which is the output of hierarchical clustering often has reversals. If we have a reversal, to obtain clusters from the dendrogram becomes difficult. Then, we show the condition that dendrogram have no reversals. It is proved that the proposed methods have no reversals in the dendrograms. Examples based on Twitter and Wikipedia data show how the methods work. 0 0
Enhancing Short Text Clustering with Small External Repositories Petersen H.
Poon J.
Conferences in Research and Practice in Information Technology Series English 2010 The automatic clustering of textual data according to their semantic concepts is a challenging, yet important task. Choosing an appropriate method to apply when clustering text depends on the nature of the documents being analysed. For example, traditional clustering algorithms can struggle to correctly model collections of very short text due to their extremely sparse nature. In recent times, much attention has been directed to finding methods for adequately clustering short text. Many popular approaches employ large, external document repositories, such as Wikipedia or the Open Directory Project, to incorporate additional world knowledge into the clustering process. However the sheer size of many of these external collections can make these techniques difficult or time consuming to apply. This paper also employs external document collections to aid short text clustering performance. The external collections are referred to in this paper as Background Knowledge. In contrast to most previous literature a separate collection of Background Knowledge is obtained for each short text dataset. However, this Background Knowledge contains several orders of magnitude fewer documents than commonly used repositories like Wikipedia. A simple approach is described where the Background Knowledge is used to re-express short text in terms of a much richer feature space. A discussion of how best to cluster documents in this feature space is presented. A solution is proposed, and an experimental evaluation is performed that demonstrates significant improvement over clustering based on standard metrics with several publicly available datasets represented in the richer feature space. 0 0
TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) Paolo Ferragina
Ugo Scaiella
CIKM English 2010 0 0
Text clustering via term semantic units Liping Jing
Jiali Yun
Jian Yu
Houkuan Huang
Proceedings - 2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010 English 2010 How best to represent text data is an important problem in text mining tasks including information retrieval, clustering, classification and etc. In this paper, we proposed a compact document representation with term semantic units which are identified from the implicit and explicit semantic information. Among it, the implicit semantic information is extracted from syntactic content via statistical methods such as latent semantic indexing and information bottleneck. The explicit semantic information is mined from the external semantic resource (Wikipedia). The proposed compact representation model can map a document collection in a low-dimension space (term semantic units which are much less than the number of all unique terms). Experimental results on real data sets have shown that the compact representation efficiently improve the performance of text clustering. 0 0
Mining meaning from Wikipedia Olena Medelyan
David N. Milne
Catherine Legg
Ian H. Witten
Int. J. Hum.-Comput. Stud.
International Journal of Human Computer Studies
English 2009 Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced. 2009 Elsevier Ltd. All rights reserved. 0 4
Named entity resolution using automatically extracted semantic information Pilz A.
Paass G.
LWA 2009 - Workshop-Woche: Lernen-Wissen-Adaptivitat - Learning, Knowledge, and Adaptivity English 2009 One major problem in text mining and semantic retrieval is that detected entity mentions have to be assigned to the true underlying entity. The ambiguity of a name results from both the polysemy and synonymy problem, as the name of a unique entity may be written in variant ways and different unique entities may have the same name. The term "bush" for instance may refer to a woody plant, a mechanical fixing, a nocturnal primate, 52 persons and 8 places covered in Wikipedia and thousands of other persons. For the first time, according to our knowledge we apply a kernel entity resolution approach to the German Wikipedia as reference for named entities. We describe the context of named entities in Wikipedia and the context of a detected name phrase in a new document by a context vector of relevant features. These are designed from automatically extracted topic indicators generated by an LDA topic model. We use kernel classifiers, e.g. rank classifiers, to determine the right matching entity but also to detect uncovered entities. In comparison to a baseline approach using only text similarity the addition of topics approach gives a much higher f-value, which is comparable to the results published for English. It turns out that the procedure also is able to detect with high reliability if a person is not covered by the Wikipedia. 0 0
Revealing hidden community structures and identifying bridges in complex networks: An application to analyzing contents of web pages for browsing Zaidi F.
Sallaberry A.
Melancon G.
Proceedings - 2009 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2009 English 2009 The emergence of scale free and small world properties in real world complex networks has stimulated lots of activity in the field of network analysis. An example of such a network comes from the field of Content Analysis (CA) and Text Mining where the goal is to analyze the contents of a set of web pages. The Network can be represented by the words appearing in the web pages as nodes and the edges representing a relation between two words if they appear in a document together. In this paper we present a CA system that helps users analyze these networks representing the textual contents of a set of web pages visually. Major contributions include a methodology to cluster complex networks based on duplication of nodes and identification of bridges i.e. words that might be of user interest but have a low frequency in the document corpus. We have tested this system with a number of data sets and users have found it very useful for the exploration of data. One of the case studies is presented in detail which is based on browsing a collection of web pages on Wikipedia 1. 0 0
Temporal analysis of text data using latent variable models Molgaard L.L.
Larsen J.
Goutte C.
Machine Learning for Signal Processing XIX - Proceedings of the 2009 IEEE Signal Processing Society Workshop, MLSP 2009 English 2009 Detecting and tracking of temporal data is an important task in multiple applications. In this paper we study temporal text mining methods for Music Information Retrieval. We compare two ways of detecting the temporal latent semantics of a corpus extracted from Wikipedia, using a stepwise Probabilistic Latent Semantic Analysis (PLSA) approach and a global multiway PLSA method. The analysis indicates that the global analysis method is able to identify relevant trends which are difficult to get using a step-by-step approach. Furthermore we show that inspection of PLSA models with different number of factors may reveal the stability of temporal clusters making it possible to choose the relevant number of factors. 0 0
Towards a universal text classifier: Transfer learning using encyclopedic knowledge Pu Wang
Carlotta Domeniconi
ICDM Workshops 2009 - IEEE International Conference on Data Mining English 2009 Document classification is a key task for many text mining applications. However, traditional text classification requires labeled data to construct reliable and accurate classifiers. Unfortunately, labeled data are seldom available. In this work, we propose a universal text classifier, which does not require any labeled document. Our approach simulates the capability of people to classify documents based on background knowledge. As such, we build a classifier that can effectively group documents based on their content, under the guidance of few words describing the classes of interest. Background knowledge is modeled using encyclopedic knowledge, namely Wikipedia. The universal text classifier can also be used to perform document retrieval. In our experiments with real data we test the feasibility of our approach for both the classification and retrieval tasks. 0 0
Remote sensing ontology development for data interoperability Nagai M.
Ono M.
Shibasaki R.
29th Asian Conference on Remote Sensing 2008, ACRS 2008 English 2008 Remote sensing ontology is developed for not only integrating earth observation data, but also knowledge sharing and information transfer. Ontological information is used for data sharing service such as support of metadata deign, structuring of data contents, support of text mining. Remote sensing ontology is constructed based on Semantic MediaWiki. Ontological information are added to the dictionary by digitalizing text based dictionaries, developing "knowledge writing tool" for experts, and extracting semantic relations from authoritative documents by applying natural language processing technique. The ontology system containing the dictionary is developed as lexicographic ontology. Also, constructed ontological information is used for the reverse dictionary. 0 0
Extracting Named Entities and Relating Them over Time Based on Wikipedia A Bhole
B Fortuna
M Grobelnik
D Mladenic
Informatica, 2007 2007 This paper presents an approach to mining information relating people, places, organizations and events extracted from Wikipedia and linking them on a time scale. The approach consists of two phases: (1) identifying relevant categorizing the articles as containing people, places or organizations; (2) generating timeline - linking named entities and extracting events and their time frame. We illustrate the proposed approach on 1.7 million Wikipedia articles. 0 0