Natural Language Processing

From WikiPapers
Jump to: navigation, search

Natural Language Processing is included as keyword or extra keyword in 1 datasets, 1 tools and 54 publications.

Datasets

Dataset Size Language Description
Tamil Wikipedia word list English
Tamil

Tools

Tool Operating System(s) Language(s) Programming language(s) License Description Image
Wikokit Cross-platform Java EPLv1.0
LGPLv2.1
GPLv2
ALv2.0
New BSD License
wikokit (wiki tool kit) - several projects related to wiki.

wiwordik - machine-readable Wiktionary. A visual interface to the parsed English Wiktionary and Russian Wiktionary databases.
Java WebStart application + JavaFX, English interface.
742 languages extracted from the English Wiktionary.

423 languages extracted from the Russian Wiktionary.
Wiwordik-en.0.09.1094 scrollbox.jpg


Publications

Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Augmenting concept definition in gloss vector semantic relatedness measure using wikipedia articles Pesaranghader A.
Rezaei A.
Lecture Notes in Electrical Engineering English 2014 Semantic relatedness measures are widely used in text mining and information retrieval applications. Considering these automated measures, in this research paper we attempt to improve Gloss Vector relatedness measure for more accurate estimation of relatedness between two given concepts. Generally, this measure, by constructing concepts definitions (Glosses) from a thesaurus, tries to find the angle between the concepts' gloss vectors for the calculation of relatedness. Nonetheless, this definition construction task is challenging as thesauruses do not provide full coverage of expressive definitions for the particularly specialized concepts. By employing Wikipedia articles and other external resources, we aim at augmenting these concepts' definitions. Applying both definition types to the biomedical domain, using MEDLINE as corpus, UMLS as the default thesaurus, and a reference standard of 68 concept pairs manually rated for relatedness, we show exploiting available resources on the Web would have positive impact on final measurement of semantic relatedness. 0 0
Automatic extraction of property norm-like data from large text corpora Kelly C.
Devereux B.
Korhonen A.
Cognitive Science English 2014 Traditional methods for deriving property-based representations of concepts from text have focused on either extracting only a subset of possible relation types, such as hyponymy/hypernymy (e.g., car is-a vehicle) or meronymy/metonymy (e.g., car has wheels), or unspecified relations (e.g., car-petrol). We propose a system for the challenging task of automatic, large-scale acquisition of unconstrained, human-like property norms from large text corpora, and discuss the theoretical implications of such a system. We employ syntactic, semantic, and encyclopedic information to guide our extraction, yielding concept-relation-feature triples (e.g., car be fast, car require petrol, car cause pollution), which approximate property-based conceptual representations. Our novel method extracts candidate triples from parsed corpora (Wikipedia and the British National Corpus) using syntactically and grammatically motivated rules, then reweights triples with a linear combination of their frequency and four statistical metrics. We assess our system output in three ways: lexical comparison with norms derived from human-generated property norm data, direct evaluation by four human judges, and a semantic distance comparison with both WordNet similarity data and human-judged concept similarity ratings. Our system offers a viable and performant method of plausible triple extraction: Our lexical comparison shows comparable performance to the current state-of-the-art, while subsequent evaluations exhibit the human-like character of our generated properties. 0 0
Extracting and displaying temporal and geospatial entities from articles on historical events Chasin R.
Daryl Woodward
Jeremy Witmer
Jugal Kalita
Computer Journal English 2014 This paper discusses a system that extracts and displays temporal and geospatial entities in text. The first task involves identification of all events in a document followed by identification of important events using a classifier. The second task involves identifying named entities associated with the document. In particular, we extract geospatial named entities. We disambiguate the set of geospatial named entities and geocode them to determine the correct coordinates for each place name, often called grounding. We resolve ambiguity based on sentence and article context. Finally, we present a user with the key events and their associated people, places and organizations within a document in terms of a timeline and a map. For purposes of testing, we use Wikipedia articles about historical events, such as those describing wars, battles and invasions. We focus on extracting major events from the articles, although our ideas and tools can be easily used with articles from other sources such as news articles. We use several existing tools such as Evita, Google Maps, publicly available implementations of Support Vector Machines, Hidden Markov Model and Conditional Random Field, and the MIT SIMILE Timeline. 0 0
Extracting semantic concept relations from Wikipedia Arnold P.
Rahm E.
ACM International Conference Proceeding Series English 2014 Background knowledge as provided by repositories such as WordNet is of critical importance for linking or mapping ontologies and related tasks. Since current repositories are quite limited in their scope and currentness, we investigate how to automatically build up improved repositories by extracting semantic relations (e.g., is-a and part-of relations) from Wikipedia articles. Our approach uses a comprehensive set of semantic patterns, finite state machines and NLP-techniques to process Wikipedia definitions and to identify semantic relations between concepts. Our approach is able to extract multiple relations from a single Wikipedia article. An evaluation for different domains shows the high quality and effectiveness of the proposed approach. 0 0
Keyword extraction using multiple novel features Yang S.
Bin Zhang
Li S.
Yu C.
Hao Q.
Journal of Computational Information Systems English 2014 In this paper, we propose a novel approach for keyword extraction. Different from previous keyword extraction methods, which identify keywords based on the document alone, this approach introduces Wikipedia knowledge and document genre to extract keywords from the document. Keyword extraction is accomplished by a classification model utilizing not only traditional word based features but also features based on Wikipedia knowledge and document genre. In our experiment, this novel keyword extraction approach outperforms previous models for keyword extraction in terms of precision-recall metric and breaks through the plateau previously reached in the field. © 2014 Binary Information Press. 0 0
Tagging Scientific Publications Using Wikipedia and Natural Language Processing Tools Lopuszynski M.
Bolikowski L.
Communications in Computer and Information Science English 2014 In this work, we compare two simple methods of tagging scientific publications with labels reflecting their content. As a first source of labels Wikipedia is employed, second label set is constructed from the noun phrases occurring in the analyzed corpus. We examine the statistical properties and the effectiveness of both approaches on the dataset consisting of abstracts from 0.7 million of scientific documents deposited in the ArXiv preprint collection. We believe that obtained tags can be later on applied as useful document features in various machine learning tasks (document similarity, clustering, topic modelling, etc.). 0 0
3D Wikipedia: Using online text to automatically label and navigate reconstructed geometry Russell B.C.
Martin-Brualla R.
Butler D.J.
Seitz S.M.
Zettlemoyer L.
ACM Transactions on Graphics English 2013 We introduce an approach for analyzing Wikipedia and other text, together with online photos, to produce annotated 3D models of famous tourist sites. The approach is completely automated, and leverages online text and photo co-occurrences via Google Image Search. It enables a number of new interactions, which we demonstrate in a new 3D visualization tool. Text can be selected to move the camera to the corresponding objects, 3D bounding boxes provide anchors back to the text describing them, and the overall narrative of the text provides a temporal guide for automatically flying through the scene to visualize the world as you read about it. We show compelling results on several major tourist sites. 0 0
COLLEAP - COntextual Language LEArning Pipeline Wloka B.
Werner Winiwarter
Lecture Notes in Computer Science English 2013 In this paper we present a concept as well as a prototype of a tool pipeline to utilize the abundant information available on the World Wide Web for contextual, user driven creation and display of language learning material. The approach is to capture Wikipedia articles of the user's choice by crawling, to analyze the linguistic aspects of the text via natural language processing and to compile the gathered information into a visually appealing presentation of enriched language information. The tool is designed to address the Japanese language, with a focus on kanji, the pictographic characters used in Japanese scripture. 0 0
Determining leadership in contentious discussions Jain S.
Hovy E.
Electronic Proceedings of the 2013 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2013 English 2013 Participants in online decision making environments assume different roles. Especially in contentious discussions, the outcome often depends critically on the discussion leader(s). Recent work on automated leadership analysis has focused on collaborations where all the participants have the same goal. In this paper we focus on contentious discussions, in which the participants have different goals based on their opinion, which makes the notion of leader very different. We analyze discussions on the Wikipedia Articles for Deletion (AfD) forum. We define two complimentary models, Content Leader and SilentOut Leader. The models quantify the basic leadership qualities of participants and assign leadership points to them. We compare the correlation between the leaders' rank produced by the two models using the Spearman Coefficient. We also propose a method to verify the quality of the leaders identified by each model. 0 0
Development and evaluation of an ensemble resource linking medications to their indications Wei W.-Q.
Cronin R.M.
Xu H.
Lasko T.A.
Bastarache L.
Denny J.C.
Journal of the American Medical Informatics Association English 2013 Objective: To create a computable MEDication Indication resource (MEDI) to support primary and secondary use of electronic medical records (EMRs). Materials and methods: We processed four public medication resources, RxNorm, Side Effect Resource (SIDER) 2, MedlinePlus, and Wikipedia, to create MEDI. We applied natural language processing and ontology relationships to extract indications for prescribable, single-ingredient medication concepts and all ingredient concepts as defined by RxNorm. Indications were coded as Unified Medical Language System (UMLS) concepts and International Classification of Diseases, 9th edition (ICD9) codes. A total of 689 extracted indications were randomly selected for manual review for accuracy using dual-physician review. We identified a subset of medication-indication pairs that optimizes recall while maintaining high precision. Results: MEDI contains 3112 medications and 63 343 medication-indication pairs. Wikipedia was the largest resource, with 2608 medications and 34 911 pairs. For each resource, estimated precision and recall, respectively, were 94% and 20% for RxNorm, 75% and 33% for MedlinePlus, 67% and 31% for SIDER 2, and 56% and 51% for Wikipedia. The MEDI high-precision subset (MEDI-HPS) includes indications found within either RxNorm or at least two of the three other resources. MEDI-HPS contains 13 304 unique indication pairs regarding 2136 medications. The mean±SD number of indications for each medication in MEDI-HPS is 6.22±6.09. The estimated precision of MEDI-HPS is 92%. Conclusions: MEDI is a publicly available, computable resource that links medications with their indications as represented by concepts and billing codes. MEDI may benefit clinical EMR applications and reuse of EMR data for research. 0 0
Perspectives on crowdsourcing annotations for natural language processing Wang A.
Hoang C.D.V.
Kan M.-Y.
Language Resources and Evaluation English 2013 Crowdsourcing has emerged as a new method for obtaining annotations for training models for machine learning. While many variants of this process exist, they largely differ in their methods of motivating subjects to contribute and the scale of their applications. To date, there has yet to be a study that helps the practitioner to decide what form an annotation application should take to best reach its objectives within the constraints of a project. To fill this gap, we provide a faceted analysis of crowdsourcing from a practitioner's perspective, and show how our facets apply to existing published crowdsourced annotation applications. We then summarize how the major crowdsourcing genres fill different parts of this multi-dimensional space, which leads to our recommendations on the potential opportunities crowdsourcing offers to future annotation efforts. © 2012 Springer Science+Business Media B.V. 0 0
Term extraction from sparse, ungrammatical domain-specific documents Ittoo A.
Gosse Bouma
Expert Systems with Applications English 2013 Existing term extraction systems have predominantly targeted large and well-written document collections, which provide reliable statistical and linguistic evidence to support term extraction. In this article, we address the term extraction challenges posed by sparse, ungrammatical texts with domain-specific contents, such as customer complaint emails and engineers' repair notes. To this aim, we present ExtTerm, a novel term extraction system. Specifically, as our core innovations, we accurately detect rare (low frequency) terms, overcoming the issue of data sparsity. These rare terms may denote critical events, but they are often missed by extant TE systems. ExtTerm also precisely detects multi-word terms of arbitrarily lengths, e.g. with more than 2 words. This is achieved by exploiting fundamental theoretical notions underlying term formation, and by developing a technique to compute the collocation strength between any number of words. Thus, we address the limitation of existing TE systems, which are primarily designed to identify terms with 2 words. Furthermore, we show that open-domain (general) resources, such as Wikipedia, can be exploited to support domain-specific term extraction. Thus, they can be used to compensate for the unavailability of domain-specific knowledge resources. Our experimental evaluations reveal that ExtTerm outperforms a state-of-the-art baseline in extracting terms from a domain-specific, sparse and ungrammatical real-life text collection. © 2012 Elsevier B.V. All rights reserved. 0 0
The ReqWiki approach for collaborative software requirements engineering with integrated text analysis support Bahar Sateli
Angius E.
René Witte
Proceedings - International Computer Software and Applications Conference English 2013 The requirements engineering phase within a software project is a heavily knowledge-driven, collaborative process that typically involves the analysis and creation of a large number of textual artifacts. We know that requirements engineering has a large impact on the success of a project, yet sophisticated tool support, especially for small to mid-size enterprises, is still lacking. We present Reqwiki, a novel open source web-based approach based on a semantic wiki that includes natural language processing (NLP) assistants, which work collaboratively with humans on the requirements specification documents. We evaluated Reqwiki with a number of software engineers to investigate the impact of our novel semantic support on software requirements engineering. Our user studies prove that (i) software engineers unfamiliar with NLP can easily leverage these assistants and (ii) semantic assistants can help to significantly improve the quality of requirements specifications. 0 0
WikiDetect: Automatic vandalism detection for Wikipedia using linguistic features Cioiu D.
Rebedea T.
Lecture Notes in Computer Science English 2013 Vandalism of the content has always been one of the greatest problems for Wikipedia, yet only few completely automatic solutions for solving it have been developed so far. Volunteers still spend large amounts of time correcting vandalized page edits, instead of using this time to improve the quality of the content of articles. The purpose of this paper is to introduce a new vandalism detection system, that only uses natural language processing and machine learning techniques. The system has been evaluated on a corpus of real vandalized data in order to test its performance and justify the design choices. The same expert annotated wikitext, extracted from the encyclopedia's database, is used to evaluate different vandalism detection algorithms. The paper presents a critical analysis of the obtained results, comparing them to existing solutions, and suggests different statistical classification methods that bring several improvements to the task at hand. 0 0
Annotating words using wordnet semantic glosses Szymanski J.
Duch W.
Lecture Notes in Computer Science English 2012 An approach to the word sense disambiguation (WSD) relaying on the WordNet synsets is proposed. The method uses semantically tagged glosses to perform a process similar to the spreading activation in semantic network, creating ranking of the most probable meanings for word annotation. Preliminary evaluation shows quite promising results. Comparison with the state-of-the-art WSD methods indicates that the use of WordNet relations and semantically tagged glosses should enhance accuracy of word disambiguation methods. 0 0
Automatic taxonomy extraction in different languages using wikipedia and minimal language-specific information Dominguez Garcia R.
Schmidt S.
Rensing C.
Steinmetz R.
Lecture Notes in Computer Science English 2012 Knowledge bases extracted from Wikipedia are particularly useful for various NLP and Semantic Web applications due to their co- verage, actuality and multilingualism. This has led to many approaches for automatic knowledge base extraction from Wikipedia. Most of these approaches rely on the English Wikipedia as it is the largest Wikipedia version. However, each Wikipedia version contains socio-cultural knowledge, i.e. knowledge with relevance for a specific culture or language. In this work, we describe a method for extracting a large set of hyponymy relations from the Wikipedia category system that can be used to acquire taxonomies in multiple languages. More specifically, we describe a set of 20 features that can be used for for Hyponymy Detection without using additional language-specific corpora. Finally, we evaluate our approach on Wikipedia in five different languages and compare the results with the WordNet taxonomy and a multilingual approach based on interwiki links of the Wikipedia. 0 0
Fast and Accurate Annotation of Short Texts with Wikipedia Pages Paolo Ferragina
Ugo Scaiella
IEEE Softw. English 2012 0 0
Overview of the INEX 2011 question answering track (QA@INEX) SanJuan E.
Moriceau V.
Tannier X.
Bellot P.
Mothe J.
Lecture Notes in Computer Science English 2012 The INEX QA track aimed to evaluate complex question-answering tasks where answers are short texts generated from the Wikipedia by extraction of relevant short passages and aggregation into a coherent summary. In such a task, Question-answering, XML/passage retrieval and automatic summarization are combined in order to get closer to real information needs. Based on the groundwork carried out in 2009-2010 edition to determine the sub-tasks and a novel evaluation methodology, the 2011 edition experimented contextualizing tweets using a recent cleaned dump of the Wikipedia. Participants had to contextualize 132 tweets from the New York Times (NYT). Informativeness of answers has been evaluated, as well as their readability. 13 teams from 6 countries actively participated to this track. This tweet contextualization task will continue in 2012 as part of the CLEF INEX lab with same methodology and baseline but on a much wider range of tweet types. 0 0
Pattern for python De Smedt T.
Daelemans W.
Journal of Machine Learning Research English 2012 Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/ pattern. 0 0
ReqWiki: A semantic system for collaborative software requirements engineering Bahar Sateli
Rajivelu S.S.
Angius E.
René Witte
WikiSym 2012 English 2012 The requirements engineering phase within a software project is a heavily knowledge-driven, collaborative process that typically involves the analysis and creation of a large number of textual artifacts. We know that requirements engineering has a large impact on the success of a project, yet sophisticated tool support, especially for small to mid-size enterprises, is still lacking. We present ReqWiki, a novel open source web-based approach based on a semantic wiki that includes Natural Language Processing (NLP) assistants, which work collaboratively with humans on the requirements specification documents. 0 0
Supporting wiki users with natural language processing Bahar Sateli
René Witte
WikiSym 2012 English 2012 We present a "self-aware" wiki system, based on the MediaWiki engine, that can develop and organize its content using state-of-art techniques from the Natural Language Processing (NLP) and Semantic Computing domains. This is achieved with an architecture that integrates novel NLP solutions within the MediaWiki environment to allow wiki users to benefit from modern text mining techniques. As concrete applications, we present how the enhanced MediaWiki engine can be used for biomedical literature curation, cultural heritage data management, and software requirements engineering. 0 0
The people's encyclopedia under the gaze of the sages: a systematic review of scholarly research on Wikipedia Chitu Okoli
Mohamad Mehdi
Mostafa Mesgari
Finn Årup Nielsen
Arto Lanamäki
English 2012 Wikipedia has become one of the ten most visited sites on the Web, and the world’s leading source of Web reference information. Its rapid success has inspired hundreds of scholars from various disciplines to study its content, communication and community dynamics from various perspectives. This article presents a systematic review of scholarly research on Wikipedia. We describe our detailed, rigorous methodology for identifying over 450 scholarly studies of Wikipedia. We present the WikiLit website (http wikilit dot referata dot com), where most of the papers reviewed here are described in detail. In the major section of this article, we then categorize and summarize the studies. An appendix features an extensive list of resources useful for Wikipedia researchers. 15 1
Using Wikipedia and conceptual graph structures to generate questions for academic writing support Liu M.
Calvo R.A.
Aditomo A.
Pizzato L.A.
IEEE Transactions on Learning Technologies English 2012 In this paper, we present a novel approach for semiautomatic question generation to support academic writing. Our system first extracts key phrases from students' literature review papers. Each key phrase is matched with a Wikipedia article and classified into one of five abstract concept categories: Research Field, Technology, System, Term, and Other. Using the content of the matched Wikipedia article, the system then constructs a conceptual graph structure representation for each key phrase and the questions are then generated based the structure. To evaluate the quality of the computer generated questions, we conducted a version of the Bystander Turing test, which involved 20 research students who had written literature reviews for an IT methods course. The pedagogical values of generated questions were evaluated using a semiautomated process. The results indicate that the students had difficulty distinguishing between computer-generated and supervisor-generated questions. Computer-generated questions were also rated as being as pedagogically useful as supervisor-generated questions, and more useful than generic questions. The findings also suggest that the computer-generated questions were more useful for the first-year students than for second or third-year students. 0 0
Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features B. Thomas Adler
Luca de Alfaro
Santiago M. Mola Velasco
Paolo Rosso
Andrew G. West
Lecture Notes in Computer Science English February 2011 Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, about 7% are acts of vandalism. Such behavior is characterized by modifications made in bad faith; introducing spam and other inappropriate content. In this work, we present the results of an effort to integrate three of the leading approaches to Wikipedia vandalism detection: a spatio-temporal analysis of metadata (STiki), a reputation-based system (WikiTrust), and natural language processing features. The performance of the resulting joint system improves the state-of-the-art from all previous methods and establishes a new baseline for Wikipedia vandalism detection. We examine in detail the contribution of the three approaches, both for the task of discovering fresh vandalism, and for the task of locating vandalism in the complete set of Wikipedia revisions. 0 1
A statistical approach for automatic keyphrase extraction Abulaish M.
Jahiruddin
Dey L.
Proceedings of the 5th Indian International Conference on Artificial Intelligence, IICAI 2011 English 2011 Due to availability of voluminous textual data either on the World Wide Web or in textual databases automatic keyphrase extraction has gained increasing popularity in recent past to summarize and characterize text documents. Consequently, a number of machine learning techniques, mostly supervised, have been proposed to mine keyphrases in an automatic way. But, the non-availability of annotated corpus for training such systems is the main hinder for their success. In this paper, we propose the design of an automatic keyphrase extraction system which uses NLP and statistical approach to mine keyphrases from unstructured text documents. The efficacy of the proposed system is established over texts crawled from Wikipedia server. On evaluation we found that the proposed method outperforms KEA which uses naïve Bayes classification technique for keyphrase extraction. 0 0
Clasificación de textos en lenguaje natural usando la wikipedia Quinteiro-Gonzalez J.M.
Martel-Jordan E.
Hernandez-Morera P.
Ligero-Fleitas J.A.
Lopez-Rodriguez A.
RISTI - Revista Iberica de Sistemas e Tecnologias de Informacao Spanish 2011 Automatic Text Classifiers are needed in environments where the amount of data to handle is so high that human classification would be ineffective. In our study, the proposed classifier takes advantage of the Wikipedia to generate the corpus defining each category. The text is then analyzed syntactically using Natural Language Processing software. The proposed classifier is highly accurate and outperforms Machine Learning trained classifiers. 0 0
Cultural Configuration of Wikipedia: measuring Autoreferentiality in Different Languages Marc Miquel
Horacio Rodríguez
Proceedings of Recent Advances in Natural Language Processing, 2011, pg. 316--322 2011 Among the motivations to write in Wikipedia given by the current literature there is often coincidence, but none of the studies presents the hypothesis of contributing for the visibility of the own national or language related content. Similar to topical coverage studies, we outline a method which allows collecting the articles of this content, to later analyse them in several dimensions. To prove its universality, the tests are repeated for up to twenty language editions of Wikipedia. Finally, through the best indicators from each dimension we obtain an index which represents the degree of autoreferentiality of the encyclopedia. Last, we point out the impact of this fact and the risk of not considering its existence in the design of applications based on user generated content. 0 0
Exploring wikipedia's category graph for query classification Milad Alemzadeh
Richard Khoury
Fakhri Karray
AIS English 2011 0 0
Extracting events from Wikipedia as RDF triples linked to widespread semantic web datasets Carlo Aliprandi
Francesco Ronzano
Andrea Marchetti
Maurizio Tesconi
Salvatore Minutoli
Lecture Notes in Computer Science English 2011 Many attempts have been made to extract structured data from Web resources, exposing them as RDF triples and interlinking them with other RDF datasets: in this way it is possible to create clouds of highly integrated Semantic Web data collections. In this paper we describe an approach to enhance the extraction of semantic contents from unstructured textual documents, in particular considering Wikipedia articles and focusing on event mining. Starting from the deep parsing of a set of English Wikipedia articles, we produce a semantic annotation compliant with the Knowledge Annotation Format (KAF). We extract events from the KAF semantic annotation and then we structure each event as a set of RDF triples linked to both DBpedia and WordNet. We point out examples of automatically mined events, providing some general evaluation of how our approach may discover new events and link them to existing contents. 0 0
Extracting events from wikipedia as RDF triples linked to widespread semantic web datasets Carlo Aliprandi
Francesco Ronzano
Andrea Marchetti
Maurizio Tesconi
Salvatore Minutoli
OCSC English 2011 0 0
Natural language processing neural network considering deep cases Sagara T.
Hagiwara M.
IEEJ Transactions on Electronics, Information and Systems Japanese 2011 In this paper, we propose a novel neural network considering deep cases. It can learn knowledge from natural language documents and can perform recall and inference. Various techniques of natural language processing using Neural Network have been proposed. However, natural language sentences used in these techniques consist of about a few words, and they cannot handle complicated sentences. In order to solve these problems, the proposed network divides natural language sentences into a sentence layer, a knowledge layer, ten kinds of deep case layers and a dictionary layer. It can learn the relations among sentences and among words by dividing sentences. The advantages of the method are as follows: (1) ability to handle complicated sentences; (2) ability to restructure sentences; (3) usage of the conceptual dictionary, Goi-Taikei, as the long term memory in a brain. Two kinds of experiments were carried out by using goo dictionary and Wikipedia as knowledge sources. Superior performance of the proposed neural network has been confirmed. 0 0
Wikipedia vandalism detection Santiago M. Mola Velasco World Wide Web English 2011 0 0
Collaborative knowledge discovery & marshalling for intelligence & security applications Cowell A.J.
Jensen R.S.
Gregory M.L.
Ellis P.
Fligg K.
McGrath L.R.
O'Hara K.
Bell E.
ISI 2010 - 2010 IEEE International Conference on Intelligence and Security Informatics: Public Safety and Security English 2010 This paper discusses the Knowledge Encapsulation Framework, a flexible, extensible evidence-marshalling environment built upon a natural language processing pipeline and exposed to users via an open-source semantic wiki. We focus our discussion on applications of the framework to intelligence and security applications, specifically, an instantiation of the KEF environment for researching illicit trafficking in nuclear materials. 0 0
Information extraction from Wikipedia using pattern learning Mihaltz M. Acta Cybernetica English 2010 In this paper we present solutions for the crucial task of extracting structured information from massive free-text resources, such as Wikipedia, for the sake of semantic databases serving upcoming Semantic Web technologies. We demonstrate both a verb frame-based approach using deep natural language processing techniques with extraction patterns developed by human knowledge experts and machine learning methods using shallow linguistic processing. We also propose a method for learning verb frame-based extraction patterns automatically from labeled data. We show that labeled training data can be produced with only minimal human effort by utilizing existing semantic resources and the special characteristics of Wikipedia. Custom solutions for named entity recognition are also possible in this scenario. We present evaluation and comparison of the different approaches for several different relations. 0 0
Language homogeneity in the Japanese Wikipedia Skevik K.-A. PACLIC 24 - Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation English 2010 Wikipedia is a potentially very useful source of information, but intuitively it is difficult to have confidence in the quality of an encyclopedia that anyone can modify. One aspect of correctness is writing style, which we examine in a computer based study of the full Japanese Wikipedia. This is possible because Japanese is a language with clearly distinct writing styles using e.g., different verb forms. We find that the writing style of the Japanese Wikipedia is largely consistent with the style guidelines for the project. Exceptions appear to occur primarily in articles with a small number of changes and editors. 0 0
Morpheus: A deep web question answering system Grant C.
George C.P.
Gumbs J.-D.
Wilson J.N.
Dobbins P.J.
IiWAS2010 - 12th International Conference on Information Integration and Web-Based Applications and Services English 2010 When users search the deep web, the essence of their search is often found in a previously answered query. The Morpheus question answering system reuses prior searches to answer similar user queries. Queries are represented in a semistructured format that contains query terms and referenced classes within a specific ontology. Morpheus answers questions by using methods from prior successful searches. The system ranks stored methods based on a similarity quasimetric defined on assigned classes of queries. Similarity depends on the class heterarchy in an ontology and its associated text corpora. Morpheus revisits the prior search pathways of the stored searches to construct possible answers. Realm-based ontologies are created using Wikipedia pages, associated categories, and the synset heterarchy of WordNet. This paper describes the entire process with emphasis on the matching of user queries to stored answering methods. Copyright 2010 ACM. 0 0
Relation extraction between related concepts by combining Wikipedia and web information for Japanese language Masumi Shirakawa
Kotaro Nakayama
Eiji Aramaki
Takahiro Hara
Shojiro Nishio
Lecture Notes in Computer Science English 2010 Construction of a huge scale ontology covering many named entities, domain-specific terms and relations among these concepts is one of the essential technologies in the next generation Web based on semantics. Recently, a number of studies have proposed automated ontology construction methods using the wide coverage of concepts in Wikipedia. However, since they tried to extract formal relations such as is-a and a-part-of relations, generated ontologies have only a narrow coverage of the relations among concepts. In this work, we aim at automated ontology construction with a wide coverage of both concepts and these relations by combining information on the Web with Wikipedia. We propose a relation extraction method which receives pairs of co-related concepts from an association thesaurus extracted from Wikipedia and extracts their relations from the Web. 0 0
STiki: An anti-vandalism tool for wikipedia using spatio-temporal analysis of revision metadata West A.G.
Sampath Kannan
Insup Lee
WikiSym 2010 English 2010 STiki is an anti-vandalism tool for Wikipedia. Unlike similar tools, STiki does not rely on natural language processing (NLP) over the article or diff text to locate vandalism. Instead, STiki leverages spatio-temporal properties of revision metadata. The feasibility of utilizing such properties was demonstrated in our prior work, which found they perform comparably to NLP-efforts while being more efficient, robust to evasion, and language independent. STiki is a real-time, on-Wikipedia implementation based on these properties. It consists of, (1) a server-side processing engine that examines revisions, scoring the likelihood each is vandalism, and, (2) a client-side GUI that presents likely vandalism to end-users for definitive classification (and if necessary, reversion on Wikipedia). Our demonstration will provide an introduction to spatio-temporal properties, demonstrate the STiki software, and discuss alternative research uses for the open-source code. 0 0
Using encyclopaedic knowledge for query classification Richard Khoury Proceedings of the 2010 International Conference on Artificial Intelligence, ICAI 2010 English 2010 Identifying the intended topic that underlies a user's queiy can benefit a large range of applications, from search engines to question-answering systems. However, query classification remains a difficult challenge due to the variety of queries a user can ask, the wide range of topics users can ask about, and the limited amount of information that can be mined from the queiy. In this paper, we develop a new query classification system that accounts for these three challenges. Our system relies on encyclopaedic knowledge to understand the user's queiy and fill in the gaps of missing information. Specifically, we use the freely-available online encyclopaedia Wikipedia as a natural-language knowledge base, and exploit Wikipedia's structure to infer the correct classification of any user queiy. 0 0
Wisdom of crowds versus wisdom of linguists - Measuring the semantic relatedness of words Torsten Zesch
Iryna Gurevych
Natural Language Engineering English 2010 In this article, we present a comprehensive study aimed at computing semantic relatedness of word pairs. We analyze the performance of a large number of semantic relatedness measures proposed in the literature with respect to different experimental conditions, such as (i) the datasets employed, (ii) the language (English or German), (iii) the underlying knowledge source, and (iv) the evaluation task (computing scores of semantic relatedness, ranking word pairs, solving word choice problems). To our knowledge, this study is the first to systematically analyze semantic relatedness on a large number of datasets with different properties, while emphasizing the role of the knowledge source compiled either by the wisdom of linguists (i.e., classical wordnets) or by the wisdom of crowds (i.e., collaboratively constructed knowledge sources like Wikipedia). The article discusses benefits and drawbacks of different approaches to evaluating semantic relatedness. We show that results should be interpreted carefully to evaluate particular aspects of semantic relatedness. For the first time, we employ a vector based measure of semantic relatedness, relying on a concept space built from documents, to the first paragraph of Wikipedia articles, to English WordNet glosses, and to GermaNet based pseudo glosses. Contrary to previous research (Strube and Ponzetto 2006; Gabrilovich and Markovitch 2007; Zesch et al. 2007), we find that wisdom of crowds based resources are not superior to wisdom of linguists based resources. We also find that using the first paragraph of a Wikipedia article as opposed to the whole article leads to better precision, but decreases recall. Finally, we present two systems that were developed to aid the experiments presented herein and are freely available1 for research purposes: (i) DEXTRACT, a software to semi-automatically construct corpus-driven semantic relatedness datasets, and (ii) JWPL, a Java-based high-performance Wikipedia Application Programming Interface (API) for building natural language processing (NLP) applications. Copyright 0 0
Workshop on current issues in predictive approaches to intelligence and security analytics: Fostering the creation of decision advantage through model integration and evaluation Sanfilippo A. ISI 2010 - 2010 IEEE International Conference on Intelligence and Security Informatics: Public Safety and Security English 2010 The increasing asymmetric nature of threats to the security, health and sustainable growth of our society requires that anticipatory reasoning become an everyday activity. Currently, the use of anticipatory reasoning is hindered by the lack of systematic methods for combining knowledge- and evidence-based models, integrating modeling algorithms, and assessing model validity, accuracy and utility. The workshop addresses these gaps with the intent of fostering the creation of a community of interest on model integration and evaluation that may serve as an aggregation point for existing efforts and a launch pad for new approaches. 0 0
An architecture to support intelligent user interfaces for Wikis by means of Natural Language Processing Johannes Hoffart
Torsten Zesch
Iryna Gurevych
WikiSym English 2009 0 0
Automatic multilingual lexicon generation using wikipedia as a resource Shahid A.R.
Kazakov D.
ICAART 2009 - Proceedings of the 1st International Conference on Agents and Artificial Intelligence English 2009 This paper proposes a method for creating a multilingual dictionary by taking the titles of Wikipedia pages in English and then finding the titles of the corresponding articles in other languages. The creation of such multilingual dictionaries has become possible as a result of exponential increase in the size of multilingual information on the web. Wikipedia is a prime example of such multilingual source of information on any conceivable topic in the world, which is edited by the readers. Here, a web crawler has been used to traverse Wikipedia following the links on a given page. The crawler takes out the title along with the titles of the corresponding pages in other targeted languages. The result is a set of words and phrases that are translations of each other. For efficiency, the URLs are organized using hash tables. A lexicon has been constructed which contains 7-tuples corresponding to 7 different languages, namely: English, German, French, Polish, Bulgarian, Greek and Chinese. 0 0
Exploiting Wikipedia as a knowledge base: Towards and ontology of movies Alarcon R.
Sanchez O.
Mijangos V.
CEUR Workshop Proceedings English 2009 Wikipedia is a huge knowledge base growing every day due to the contribution of people all around the world. Some part of the information of each article is kept in a special, consistently and formatted table called infobox. In this article, we analyze the Wikipedia infoboxes of movies articles; we describe some of the problems that can make extracting information from these tables a difficult task. We also present a methodology to automatically extract information that could be useful towards the building of an ontology of movies from Wikipedia in Spanish. 0 0
Inducing gazetteer for Chinese named entity recognition based on local high-frequent strings Pang W.
Fan X.
2009 2nd International Conference on Future Information Technology and Management Engineering, FITME 2009 English 2009 Gazetteers, or entity dictionaries, are important for named entity recognition (NER). Although the dictionaries extracted automatically by the previous methods from a corpus, web or Wikipedia are very huge, they also misses some entities, especially the domain-specific entities. We present a novel method of automatic entity dictionary induction, which is able to construct a dictionary more specific to the processing text at a much lower computational cost than the previous methods. It extracts the local high-frequent strings in a document as candidate entities, and filters the invalid candidates with the accessor variety (AV) as our entity criterion. The experiments show that the obtained dictionary can effectively improve the performance of a high-precision baseline of NER. 0 0
Key phrase extraction: A hybrid assignment and extraction approach Nguyen C.Q.
Phan T.T.
IiWAS2009 - The 11th International Conference on Information Integration and Web-based Applications and Services English 2009 Automatic key phrase extraction is fundamental to the success of many recent digital library applications and semantic information retrieval techniques and a difficult and essential problem in Vietnamese natural language processing (NLP). In this work, we propose a novel method for key phrase extracting of Vietnamese text that combines assignment and extraction approaches. We also explore NLP techniques that we propose for the analysis of Vietnamese texts, focusing on the advanced candidate phrases recognition phase as well as part-of-speech (POS) tagging. Then we propose a method that exploits specific characteristics of the Vietnamese language and exploits the Vietnamese Wikipedia as an ontology for key phrase ambiguity resolution. Finally, we show the results of several experiments that have examined the impacts of strategies chosen for Vietnamese key phrase extracting. 0 0
Mining meaning from Wikipedia Olena Medelyan
David N. Milne
Catherine Legg
Ian H. Witten
Int. J. Hum.-Comput. Stud.
International Journal of Human Computer Studies
English 2009 Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced. 2009 Elsevier Ltd. All rights reserved. 0 4
Semi-automatic extraction and modeling of ontologies using wikipedia XML corpus De Silva L.
Lakshman Jayaratne
2nd International Conference on the Applications of Digital Information and Web Technologies, ICADIWT 2009 English 2009 This paper introduces WikiOnto: a system that assists in the extraction and modeling of topic ontologies in a semi-automatic manner using a preprocessed document corpus derived from Wikipedia. Based on the Wikipedia XML Corpus, we present a three-tiered framework for extracting topic ontologies in quick time and a modeling environment to refine these ontologies. Using Natural Language Processing (NLP) and other Machine Learning (ML) techniques along with a very rich document corpus, this system proposes a solution to a task that is generally considered extremely cumbersome. The initial results of the prototype suggest strong potential of the system to become highly successful in ontology extraction and modeling and also inspire further research on extracting ontologies from other semi-structured document corpora as well. 0 0
WikiOnto: A system for semi-automatic extraction and modeling of ontologies using Wikipedia XML corpus De Silva L.
Lakshman Jayaratne
ICSC 2009 - 2009 IEEE International Conference on Semantic Computing English 2009 This paper introduces WikiOnto: a system that assists in the extraction and modeling of topic ontologies in a semi-automatic manner using a preprocessed document corpus of one of the largest knowledge bases in the world - the Wikipedia. Based on the Wikipedia XML Corpus, we present a three-tiered framework for extracting topic ontologies in quick time and a modeling environment to refine these ontologies. Using Natural Language Processing (NLP) and other Machine Learning (ML) techniques along with a very rich document corpus, this system proposes a solution to a task that is generally considered extremely cumbersome. The initial results of the prototype suggest strong potential of the system to become highly successful in ontology extraction and modeling and also inspire further research on extracting ontologies from other semi-structured document corpora as well. 0 0
A model for Ranking entities and its application to Wikipedia Gianluca Demartini
Firan C.S.
Tereza Iofciu
Ralf Krestel
Wolfgang Nejdl
Proceedings of the Latin American Web Conference, LA-WEB 2008 English 2008 Entity Ranking (ER) is a recently emerging search task in Information Retrieval, where the goal is not finding documents matching the query words, but instead finding entities which match types and attributes mentioned in the query. In this paper we propose a formal model to define entities as well as a complete ER system, providing examples of its application to enterprise, Web, and Wikipedia scenarios. Since searching for entities on Web scale repositories is an open challenge as the effectiveness of ranking is usually not satisfactory, we present a set of algorithms based on our model and evaluate their retrieval effectiveness. The results show that combining simple Link Analysis, Natural Language Processing, and Named Entity Recognition methods improves retrieval performance of entity search by over 53% for P@ 10 and 35% for MAP. 0 0
Extracting concept hierarchy knowledge from the Web based on Property Inheritance and Aggregation Hattori S.
Katsumi Tanaka
Proceedings - 2008 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2008 English 2008 Concept hierarchy knowledge, such as hyponymy and meronymy, is very important for various natural language processing systems. While WordNet and Wikipedia are being manually constructed and maintained as lexical ontologies, many researchers have tackled how to extract concept hierarchies from very large corpora of text documents such as the Web not manually but automatically. However, their methods are mostly based on lexico-syntactic patterns as not necessary but sufficient conditions of hyponymy and meronymy, so they can achieve high precision but low recall when using stricter patterns or they can achieve high recall but low precision when using looser patterns. Therefore, we need necessary conditions of hyponymy and meronymy to achieve high recall and not low precision. In this paper, not only "Property Inheritance "from a target concept to its hyponyms but also "Property Aggregation" from its hyponyms to the target concept is assumed to be necessary and sufficient conditions of hyponymy, and we propose a method to extract concept hierarchy knowledge from the Web based on property inheritance and property aggregation. 0 0
GeoSR: Geographically explore semantic relations in world knowledge Brent Hecht
Raubal M.
Lecture Notes in Geoinformation and Cartography English 2008 Methods to determine the semantic relatedness (SR) value between two lexically expressed entities abound in the field of natural language processing (NLP). The goal of such efforts is to identify a single measure that summarizes the number and strength of the relationships between the two entities. In this paper, we present GeoSR, the first adaptation of SR methods to the context of geographic data exploration. By combining the first use of a knowledge repository structure that is replete with non-classical relations, a new means of explaining those relations to users, and the novel application of SR measures to a geographic reference system, GeoSR allows users to geographically navigate and investigate the world knowledge encoded in Wikipedia. There are numerous visualization and interaction paradigms possible with GeoSR; we present one implementation as a proof-of-concept and discuss others. Although, Wikipedia is used as the knowledge repository for our implementation, GeoSR will also work with any knowledge repository having a similar set of properties. 0 0
Lexical and semantic resources for NLP: From words to meanings Gentile A.L.
Pierpaolo Basile
Iaquinta L.
Giovanni Semeraro
Lecture Notes in Computer Science English 2008 A user expresses her information need through words with a precise meaning, but from the machine point of view this meaning does not come with the word. A further step is needful to automatically associate it to the words. Techniques that process human language are required and also linguistic and semantic knowledge, stored within distinct and heterogeneous resources, which play an important role during all Natural Language Processing (NLP) steps. Resources management is a challenging problem, together with the correct association between URIs coming from the resources and meanings of the words. This work presents a service that, given a lexeme (an abstract unit of morphological analysis in linguistics, which roughly corresponds to a set of words that are different forms of the same word), returns all syntactic and semantic information collected from a list of lexical and semantic resources. The proposed strategy consists in merging data with origin from stable resources, such as WordNet, with data collected dynamically from evolving sources, such as the Web or Wikipedia. That strategy is implemented in a wrapper to a set of popular linguistic resources that provides a single point of access to them, in a transparent way to the user, to accomplish the computational linguistic problem of getting a rich set of linguistic and semantic annotations in a compact way. 0 0
Analysis of the Wikipedia Category Graph for NLP Applications. Iryna Gurevych Torsten Zesch Proceedings of the TextGraphs-2 Workshop (NAACL-HLT) 2007 In this paper, we discuss two graphs in Wikipedia (i) the article graph, and (ii) the category graph. We perform a graphtheoretic analysis of the category graph, and show that it is a scale-free, small world graph like other well-known lexical semantic networks. We substantiate our findings by transferring semantic relatedness algorithms defined on WordNet to the Wikipedia category graph. To assess the usefulness of the category graph as an NLP resource, we analyze its coverage and the performance of the transferred semantic relatedness algorithms. 0 0