Information extraction

From WikiPapers
Jump to: navigation, search

Information extraction is included as keyword or extra keyword in 0 datasets, 0 tools and 82 publications.


There is no datasets for this keyword.


There is no tools for this keyword.


Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Building distant supervised relation extractors Nunes T.
Schwabe D.
Proceedings - 2014 IEEE International Conference on Semantic Computing, ICSC 2014 English 2014 A well-known drawback in building machine learning semantic relation detectors for natural language is the lack of a large number of qualified training instances for the target relations in multiple languages. Even when good results are achieved, the datasets used by the state-of-the-art approaches are rarely published. In order to address these problems, this work presents an automatic approach to build multilingual semantic relation detectors through distant supervision combining two of the largest resources of structured and unstructured content available on the Web, DBpedia and Wikipedia. We map the DBpedia ontology back to the Wikipedia text to extract more than 100.000 training instances for more than 90 DBpedia relations for English and Portuguese languages without human intervention. First, we mine the Wikipedia articles to find candidate instances for relations described in the DBpedia ontology. Second, we preprocess and normalize the data filtering out irrelevant instances. Finally, we use the normalized data to construct regularized logistic regression detectors that achieve more than 80% of F-Measure for both English and Portuguese languages. In this paper, we also compare the impact of different types of features on the accuracy of the trained detector, demonstrating significant performance improvements when combining lexical, syntactic and semantic features. Both the datasets and the code used in this research are available online. 0 0
Development of a semantic and syntactic model of natural language by means of non-negative matrix and tensor factorization Anisimov A.
Marchenko O.
Taranukha V.
Vozniuk T.
Lecture Notes in Computer Science English 2014 A method for developing a structural model of natural language syntax and semantics is proposed. Syntactic and semantic relations between parts of a sentence are presented in the form of a recursive structure called a control space. Numerical characteristics of these data are stored in multidimensional arrays. After factorization, the arrays serve as the basis for the development of procedures for analyses of natural language semantics and syntax. 0 0
Extracting and displaying temporal and geospatial entities from articles on historical events Chasin R.
Daryl Woodward
Jeremy Witmer
Jugal Kalita
Computer Journal English 2014 This paper discusses a system that extracts and displays temporal and geospatial entities in text. The first task involves identification of all events in a document followed by identification of important events using a classifier. The second task involves identifying named entities associated with the document. In particular, we extract geospatial named entities. We disambiguate the set of geospatial named entities and geocode them to determine the correct coordinates for each place name, often called grounding. We resolve ambiguity based on sentence and article context. Finally, we present a user with the key events and their associated people, places and organizations within a document in terms of a timeline and a map. For purposes of testing, we use Wikipedia articles about historical events, such as those describing wars, battles and invasions. We focus on extracting major events from the articles, although our ideas and tools can be easily used with articles from other sources such as news articles. We use several existing tools such as Evita, Google Maps, publicly available implementations of Support Vector Machines, Hidden Markov Model and Conditional Random Field, and the MIT SIMILE Timeline. 0 0
Extracting semantic concept relations from Wikipedia Arnold P.
Rahm E.
ACM International Conference Proceeding Series English 2014 Background knowledge as provided by repositories such as WordNet is of critical importance for linking or mapping ontologies and related tasks. Since current repositories are quite limited in their scope and currentness, we investigate how to automatically build up improved repositories by extracting semantic relations (e.g., is-a and part-of relations) from Wikipedia articles. Our approach uses a comprehensive set of semantic patterns, finite state machines and NLP-techniques to process Wikipedia definitions and to identify semantic relations between concepts. Our approach is able to extract multiple relations from a single Wikipedia article. An evaluation for different domains shows the high quality and effectiveness of the proposed approach. 0 0
Identifying the topic of queries based on domain specify ontology ChienTa D.C.
Thi T.P.
WIT Transactions on Information and Communication Technologies English 2014 In order to identify the topic of queries, a large number of past researches have relied on lexicon-syntactic and handcrafted knowledge sources in Machine Learning and Natural Language Processing (NLP). Conversely, in this paper, we introduce the application system that detects the topic of queries based on domain-specific ontology. On this system, we work hard on building this domainspecific ontology, which is composed of instances automatically extracted from available resources such as Wikipedia, WordNet, and ACM Digital Library. The experimental evaluation with many cases of queries related to information technology area shows that this system considerably outperforms a matching and identifying approach. 0 0
A hybrid method for detecting outdated information in Wikipedia infoboxes Thanh Tran
Cao T.H.
Proceedings - 2013 RIVF International Conference on Computing and Communication Technologies: Research, Innovation, and Vision for Future, RIVF 2013 English 2013 Wikipedia has grown fast and become a major information resource for users as well as for many knowledge bases derived from it. However it is still edited manually while the world is changing rapidly. In this paper, we propose a method to detect outdated attribute values in Wikipedia infoboxes by using facts extracted from the general Web. Our proposed method extracts new information by combining pattern-based approach with entity-search-based approach to deal with the diversity of natural language presentation forms of facts on the Web. Our experimental results show that the achieved accuracies of the proposed method are 70% and 82% respectively on the chief-executive-officer attribute and the number-of-employees attribute in company infoboxes. It significantly improves the accuracy of the single pattern-based or entity-search-based method. The results also reveal the striking truth about the outdated status of Wikipedia. 0 0
A method for recommending the most appropriate expansion of acronyms using wikipedia Choi D.
Shin J.
Lee E.
Kim P.
Proceedings - 7th International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, IMIS 2013 English 2013 Over the years, many researchers have been studied to detect expansions of acronyms in texts by using linguistic and syntactical approaches in order to overcome disambiguation problems. Acronym is an abbreviation formed which is composed of initial components of single or multiple words. These initial components bring huge mistakes when a machine conducts experiments to find meaning from given texts. Detecting expansions of acronyms is not a big issue now days. The problem is that a polysemous acronym. In order to solve this problem, this paper proposes a method to recommend the most related expansion of acronym through analyzing co-occurrence words by using Wikipedia. Our goal is not finding acronym definition or expansion but recommending the most appropriate expansion of given acronyms. 0 0
An automatic approach for ontology-based feature extraction from heterogeneous textualresources Vicient C.
Sanchez D.
Moreno A.
Engineering Applications of Artificial Intelligence English 2013 Data mining algorithms such as data classification or clustering methods exploit features of entities to characterise, group or classify them according to their resemblance. In the past, many feature extraction methods focused on the analysis of numerical or categorical properties. In recent years, motivated by the success of the Information Society and the WWW, which has made available enormous amounts of textual electronic resources, researchers have proposed semantic data classification and clustering methods that exploit textual data at a conceptual level. To do so, these methods rely on pre-annotated inputs in which text has been mapped to their formal semantics according to one or several knowledge structures (e.g. ontologies, taxonomies). Hence, they are hampered by the bottleneck introduced by the manual semantic mapping process. To tackle this problem, this paper presents a domain-independent, automatic and unsupervised method to detect relevant features from heterogeneous textual resources, associating them to concepts modelled in a background ontology. The method has been applied to raw text resources and also to semi-structured ones (Wikipedia articles). It has been tested in the Tourism domain, showing promising results. © 2012 Elsevier Ltd. All rights reserved. 0 0
Building, maintaining, and using knowledge bases: A report from the trenches Deshpande O.
Lamba D.S.
Tourn M.
Sanmay Das
Subramaniam S.
Rajaraman A.
Harinarayan V.
Doan A.
Proceedings of the ACM SIGMOD International Conference on Management of Data English 2013 A knowledge base (KB) contains a set of concepts, instances, and relationships. Over the past decade, numerous KBs have been built, and used to power a growing array of applications. Despite this flurry of activities, however, surprisingly little has been published about the end-to-end process of building, maintaining, and using such KBs in industry. In this paper we describe such a process. In particular, we describe how we build, update, and curate a large KB at Kosmix, a Bay Area startup, and later at WalmartLabs, a development and research lab of Walmart. We discuss how we use this KB to power a range of applications, including query understanding, Deep Web search, in-context advertising, event monitoring in social media, product search, social gifting, and social mining. Finally, we discuss how the KB team is organized, and the lessons learned. Our goal with this paper is to provide a real-world case study, and to contribute to the emerging direction of building, maintaining, and using knowledge bases for data management applications. Copyright 0 0
Designing a chat-bot that simulates an historical figure Haller E.
Rebedea T.
Proceedings - 19th International Conference on Control Systems and Computer Science, CSCS 2013 English 2013 There are many applications that are incorporating a human appearance and intending to simulate human dialog, but in most of the cases the knowledge of the conversational bot is stored in a database created by a human experts. However, very few researches have investigated the idea of creating a chat-bot with an artificial character and personality starting from web pages or plain text about a certain person. This paper describes an approach to the idea of identifying the most important facts in texts describing the life (including the personality) of an historical figure for building a conversational agent that could be used in middle-school CSCL scenarios. 0 0
Distant supervision learning of DBPedia relations Zajac M.
Przepiorkowski A.
Lecture Notes in Computer Science English 2013 This paper presents DBPediaExtender, an information extraction system that aims at extending an existing ontology of geographical entities by extracting information from text. The system uses distant supervision learning - the training data is constructed on the basis of matches between values from infoboxes (taken from the Polish DBPedia) and Wikipedia articles. For every relevant relation, a sentence classifier and a value extractor are trained; the sentence classifier selects sentences expressing a given relation and the value extractor extracts values from selected sentences. The results of manual evaluation for several selected relations are reported. 0 0
Document analytics through entity resolution Santos J.
Martins B.
Batista D.S.
Lecture Notes in Computer Science English 2013 We present a prototype system for resolving named entities, mentioned in textual documents, into the corresponding Wikipedia entities. This prototype can aid in document analysis, by using the disambiguated references to provide useful information in context. 0 0
Evaluating entity linking with wikipedia Ben Hachey
Will Radford
Joel Nothman
Matthew Honnibal
Curran J.R.
Artificial Intelligence English 2013 Named Entity Linking (nel) grounds entity mentions to their corresponding node in a Knowledge Base (kb). Recently, a number of systems have been proposed for linking entity mentions in text to Wikipedia pages. Such systems typically search for candidate entities and then disambiguate them, returning either the best candidate or nil. However, comparison has focused on disambiguation accuracy, making it difficult to determine how search impacts performance. Furthermore, important approaches from the literature have not been systematically compared on standard data sets. We reimplement three seminal nel systems and present a detailed evaluation of search strategies. Our experiments find that coreference and acronym handling lead to substantial improvement, and search strategies account for much of the variation between systems. This is an interesting finding, because these aspects of the problem have often been neglected in the literature, which has focused largely on complex candidate ranking algorithms. © 2012 Elsevier B.V. All rights reserved. 0 0
Knowledge base population and visualization using an ontology based on semantic roles Siahbani M.
Vadlapudi R.
Whitney M.
Sarkar A.
AKBC 2013 - Proceedings of the 2013 Workshop on Automated Knowledge Base Construction, Co-located with CIKM 2013 English 2013 This paper extracts facts using "micro-reading" of text in contrast to approaches that extract common-sense knowledge using "macro-reading" methods. Our goal is to extract detailed facts about events from natural language using a predicate-centered view of events (who did what to whom, when and how). We exploit semantic role labels in order to create a novel predicate-centric ontology for entities in our knowledge base. This allows users to find uncommon facts easily. To this end, we tightly couple our knowledge base and ontology to an information visualization system that can be used to explore and navigate events extracted from a large natural language text collection. We use our methodology to create a web-based visual browser of history events in Wikipedia. 0 0
Learning multilingual named entity recognition from Wikipedia Joel Nothman
Nicky Ringland
Will Radford
Tara Murphy
Curran J.R.
Artificial Intelligence English 2013 We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify and classify names of people, locations and organisations in text. This dependence on expensive annotation is the knowledge bottleneck our work overcomes. We first classify each Wikipedia article into named entity (ne) types, training and evaluating on 7200 manually-labelled Wikipedia articles across nine languages. Our cross-lingual approach achieves up to 95% accuracy. We transform the links between articles into ne annotations by projecting the target articles classifications onto the anchor text. This approach yields reasonable annotations, but does not immediately compete with existing gold-standard data. By inferring additional links and heuristically tweaking the Wikipedia corpora, we better align our automatic annotations to gold standards. We annotate millions of words in nine languages, evaluating English, German, Spanish, Dutch and Russian Wikipedia-trained models against conll shared task data and other gold-standard corpora. Our approach outperforms other approaches to automatic ne annotation (Richman and Schone, 2008 [61], Mika et al., 2008 [46]) competes with gold-standard training when tested on an evaluation corpus from a different source; and performs 10% better than newswire-trained models on manually-annotated Wikipedia text. © 2012 Elsevier B.V. All rights reserved. 0 0
Social relation extraction based on Chinese Wikipedia articles Liu M.
Xiao Y.
Lei C.
Xiaofeng Zhou
Lecture Notes in Computer Science English 2013 Our work in this paper pays more attention to information extraction about social relations from Chinese Wikipedia articles and construction of social relation network. After obtaining the Chinese Wikipedia articles according to the provided person name, locating the relationship description sentences in the Chinese Wikipedia articles and extracting the social relation information based on the sentence semantic parser, we can construct the social network centered with the provided person name, using the social relation information. The relation set also can be iteratively expanded based on the person names associated with the provided person name in the related Chinese Wikipedia articles. 0 0
Ukrainian WordNet: Creation and filling Anisimov A.
Marchenko O.
Nikonenko A.
Porkhun E.
Taranukha V.
Lecture Notes in Computer Science English 2013 This paper deals with the process of developing a lexical semantic database for Ukrainian language - UkrWordNet. The architecture of the developed system is described in detail. The data storing structure and mechanisms of access to knowledge are reviewed along with the internal logic of the system and some key software modules. The article is also concerned with the research and development of automated techniques of UkrWordNet Semantic Network replenishment and extension. 0 0
Unsupervised gazette creation using information distance Patil S.
Pawar S.
Palshikar G.K.
Bhat S.
Srivastava R.
Lecture Notes in Computer Science English 2013 Named Entity extraction (NEX) problem consists of automatically constructing a gazette containing instances for each NE of interest. NEX is important for domains which lack a corpus with tagged NEs. In this paper, we propose a new unsupervised (bootstrapping) NEX technique, based on a new variant of the Multiword Expression Distance (MED)[1] and information distance [2]. Efficacy of our method is shown using comparison with BASILISK and PMI in agriculture domain. Our method discovered 8 new diseases which are not found in Wikipedia. 0 0
Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning Bing L.
Lam W.
Wong T.-L.
WSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining English 2013 We develop a new framework to achieve the goal of Wikipedia entity expansion and attribute extraction from the Web. Our framework takes a few existing entities that are automatically collected from a particular Wikipedia category as seed input and explores their attribute infoboxes to obtain clues for the discovery of more entities for this category and the attribute content of the newly discovered entities. One characteristic of our framework is to conduct discovery and extraction from desirable semi-structured data record sets which are automatically collected from the Web. A semi-supervised learning model with Conditional Random Fields is developed to deal with the issues of extraction learning and limited number of labeled examples derived from the seed entities. We make use of a proximate record graph to guide the semi-supervised learning process. The graph captures alignment similarity among data records. Then the semi-supervised learning process can leverage the unlabeled data in the record set by controlling the label regularization under the guidance of the proximate record graph. Extensive experiments on different domains have been conducted to demonstrate its superiority for discovering new entities and extracting attribute content. 0 0
YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia Johannes Hoffart
Suchanek F.M.
Berberich K.
Gerhard Weikum
Artificial Intelligence English 2013 We present YAGO2, an extension of the YAGO knowledge base, in which entities, facts, and events are anchored in both time and space. YAGO2 is built automatically from Wikipedia, GeoNames, and WordNet. It contains 447 million facts about 9.8 million entities. Human evaluation confirmed an accuracy of 95% of the facts in YAGO2. In this paper, we present the extraction methodology, the integration of the spatio-temporal dimension, and our knowledge representation SPOTL, an extension of the original SPO-triple model to time and space. © 2012 Elsevier B.V. All rights reserved. 0 0
A hybrid QA system with focused IR and automatic summarization for INEX 2011 Bhaskar P.
Somnath Banerjee
Neogi S.
Bandyopadhyay S.
Lecture Notes in Computer Science English 2012 The article presents the experiments carried out as part of the participation in the QA track of INEX 2011. We have submitted two runs. The INEX QA task has two main sub tasks, Focused IR and Automatic Summarization. In the Focused IR system, we first preprocess the Wikipedia documents and then index them using Nutch. Stop words are removed from each query tweet and all the remaining tweet words are stemmed using Porter stemmer. The stemmed tweet words form the query for retrieving the most relevant document using the index. The automatic summarization system takes as input the query tweet along with the tweet's text and the title from the most relevant text document. Most relevant sentences are retrieved from the associated document based on the TF-IDF of the matching query tweet, tweet's text and title words. Each retrieved sentence is assigned a ranking score in the Automatic Summarization system. The answer passage includes the top ranked retrieved sentences with a limit of 500 words. The two unique runs differ in the way in which the relevant sentences are retrieved from the associated document. Our first run got the highest score of 432.2 in Relaxed metric of Readability evaluation among all the participants. 0 0
A knowledge-extraction approach to identify and present verbatim quotes in free text Paass G.
Bergholz A.
Pilz A.
ACM International Conference Proceeding Series English 2012 In news stories verbatim quotes of persons play a very important role, as they carry reliable information about the opinion of that person concerning specific aspects. As thousands of new quotes are published every hour it is very difficult to keep track of them. In this paper we describe a set of algorithms to solve the knowledge management problem of identifying, storing and accessing verbatim quotes. We handle the verbatim quote task as a relation extraction problem from unstructured text. Using a workflow of knowledge extraction algorithms we provide the required features for the relation extraction algorithm. The central relation extraction procedures is trained using manually annotated documents. It turns out that structural grammatical information is able to improve the F-vale for verbatim quote detection to 84.1%, which is sufficient for many exploratory applications. We present the results in a smartphone app connected to a web server, which employs a number of algorithms like linkage to Wikipedia, topics extraction and search engine indices to provide a flexible access to the extracted verbatim quotes. 0 0
Are human-input seeds good enough for entity set expansion? Seeds rewriting by leveraging Wikipedia semantic knowledge Qi Z.
Kang Liu
Jun Zhao
Lecture Notes in Computer Science English 2012 Entity Set Expansion is an important task for open information extraction, which refers to expanding a given partial seed set to a more complete set that belongs to the same semantic class. Many previous researches have proved that the quality of seeds can influence expansion performance a lot since human-input seeds may be ambiguous, sparse etc. In this paper, we propose a novel method which can generate new, high-quality seeds and replace original, poor-quality ones. In our method, we leverage Wikipedia as a semantic knowledge to measure semantic relatedness and ambiguity of each seed. Moreover, to avoid the sparseness of the seed, we use web resources to measure its population. Then new seeds are generated to replace original, poor-quality seeds. Experimental results show that new seed sets generated by our method can improve entity expansion performance by up to average 9.1% over original seed sets. 0 0
Chinese relation extraction using web features and HNC theory Wang J.
Cheng X.
Gu X.
Journal of Information and Computational Science English 2012 Chinese named-entity relation extraction is a key step in the task of Chinese information extraction. Feature-based method is one of the main methods of Chinese relation extraction. In this new method, Web co-occurrence feature and Bag-of-word (BoW) correlation feature are introduced, and the words similarity is defined based on HNC theory. Experimental results showed that the F-score was improved by this method, and both of the features are effective on Chinese relation extraction. 1548-7741/Copyright 0 0
Choosing better seeds for entity set expansion by leveraging wikipedia semantic knowledge Qi Z.
Kang Liu
Jun Zhao
Communications in Computer and Information Science English 2012 Entity Set Expansion, which refers to expanding a human-input seed set to a more complete set which belongs to the same semantic category, is an important task for open information extraction. Because human-input seeds may be ambiguous, sparse etc., the quality of seeds has a great influence on expansion performance, which has been proved by many previous researches. To improve seeds quality, this paper proposes a novel method which can choose better seeds from original input ones. In our method, we leverage Wikipedia semantic knowledge to measure semantic relatedness and ambiguity of each seed. Moreover, to avoid the sparseness of the seed, we use web corpus to measure its population. Lastly, we use a linear model to combine these factors to determine the final selection. Experimental results show that new seed sets chosen by our method can improve expansion performance by up to average 13.4% over random selected seed sets. 0 0
Collaboratively constructed knowledge repositories as a resource for domain independent concept extraction Kerschbaumer J.
Reichhold M.
Winkler C.
Fliedl G.
Proceedings of the 10th Terminology and Knowledge Engineering Conference: New Frontiers in the Constructive Symbiosis of Terminology and Knowledge Engineering, TKE 2012 English 2012 To achieve a domain independent text management, a flexible and adaptive knowledge repository is indispensable and represents the key resource for solving many challenges in natural language processing. Especially for real world applications, the needed resources cannot be provided for technical disciplines, like engineering in the energy or the automotive domain. We therefore propose in this paper, a new approach for knowledge (concept) acquisition based on collaboratively constructed knowledge repositories like Wikipedia and enterprise Wikis. 0 0
Creating an extended named entity dictionary from wikipedia Ryuichiro Higashinaka
Tsu K.S.
Saito K.
Makino T.
Yutaka Matsuo
24th International Conference on Computational Linguistics - Proceedings of COLING 2012: Technical Papers English 2012 Automatic methods to create entity dictionaries or gazetteers have used only a small number of entity types (18 at maximum), which could pose a limitation for fine-grained information extraction. This paper aims to create a dictionary of 200 extended named entity (ENE) types. Using Wikipedia as a basic resource, we classify Wikipedia titles into ENE types to create an ENE dictionary. In our method, we derive a large number of features for Wikipedia titles and train a multiclass classifier by supervised learning. We devise an extensive list of features for the accurate classification into the ENE types, such as those related to the surface string of a title, the content of the article, and the meta data provided with Wikipedia. By experiments, we successfully show that it is possible to classify Wikipedia titles into ENE types with 79.63% accuracy. We applied our classifier to all Wikipedia titles and, by discarding low-confidence classification results, created an ENE dictionary of over one million entities covering 182 ENE types with an estimated accuracy of 89.48%. This is the first large scale ENE dictionary. 0 0
Exploiting web features in Chinese relation extraction Wang J.
Jilin Chen
Gu X.
CSAE 2012 - Proceedings, 2012 IEEE International Conference on Computer Science and Automation Engineering English 2012 Relation extraction is a form of information extraction, which finds predefined relations between pairs of entities in text. A Chinese relation extraction approach exploiting web features is proposed. Four web features are extracted from the web and the Wikipedia website. Experiments on the ACE 2005 Corpus show that the web features are effective, and high-quality websites generate more effective features. 0 0
Extraction of semantic relations between concepts with KNN algorithms on Wikipedia Panchenko A.
Adeykin S.
Romanov A.
Romanov P.
CEUR Workshop Proceedings English 2012 This paper presents methods for extraction of semantic relations between words. The methods rely on the k-nearest neighbor algorithms and two semantic similarity measures to extract relations from the abstracts of Wikipedia articles. We analyze the proposed methods and evaluate their performance. Precision of the extraction with the best method achieves 83%. We also present an open source system which effectively implements the described algorithms. 0 0
Extraction of temporal facts and events from Wikipedia Kuzey E.
Gerhard Weikum
ACM International Conference Proceeding Series English 2012 Recently, large-scale knowledge bases have been constructed by automatically extracting relational facts from text. Unfortunately, most of the current knowledge bases focus on static facts and ignore the temporal dimension. However, the vast majority of facts are evolving with time or are valid only during a particular time period. Thus, time is a significant dimension that should be included in knowledge bases. In this paper, we introduce a complete information extraction framework that harvests temporal facts and events from semi-structured data and free text of Wikipedia articles to create a temporal ontology. First, we extend a temporal data representation model by making it aware of events. Second, we develop an information extraction method which harvests temporal facts and events from Wikipedia infoboxes, categories, lists, and article titles in order to build a temporal knowledge base. Third, we show how the system can use its extracted knowledge for further growing the knowledge base. We demonstrate the effectiveness of our proposed methods through several experiments. We extracted more than one million temporal facts with precision over 90% for extraction from semi-structured data and almost 70% for extraction from text. 0 0
Measuring the quality of web content using factual information Lex E.
Voelske M.
Marcelo Errecalde
Edgardo Ferretti
Cagnina L.
Horn C.
Benno Stein
Michael Granitzer
ACM International Conference Proceeding Series English 2012 Nowadays, many decisions are based on information found in the Web. For the most part, the disseminating sources are not certified, and hence an assessment of the quality and credibility of Web content became more important than ever. With factual density we present a simple statistical quality measure that is based on facts extracted from Web content using Open Information Extraction. In a first case study, we use this measure to identify featured/good articles in Wikipedia. We compare the factual density measure with word count, a measure that has successfully been applied to this task in the past. Our evaluation corroborates the good performance of word count in Wikipedia since featured/good articles are often longer than non-featured. However, for articles of similar lengths the word count measure fails while factual density can separate between them with an F-measure of 90.4%. We also investigate the use of relational features for categorizing Wikipedia articles into featured/good versus non-featured ones. If articles have similar lengths, we achieve an F-measure of 86.7% and 84% otherwise. 0 0
Multilingual food and health ontology learning using semi-structured and structured web data sources Albukhitan S.
Helmy T.
Proceedings of the 2012 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops, WI-IAT 2012 English 2012 The availability of open Web-based information sources such as Wikipedia in semi-structured format made it possible to build or extend multilingual domain ontologies. This work is a part of a project where we are developing a framework for semantic manipulation of health and nutrition information. In this paper, we present ongoing work aiming to build such ontology utilizing Wikipedia and other multilingual online resources automatically. The constructed ontology consists of monolingual ontologies and language-agnostic ontology connecting them. The constructed ontology is built to capture the culture-relevant aspects for each language based on the available concepts for each language. The ontology then could be used in many applications such as cross-lingual information access and multilingual information extraction for the domains of Food and Health. An initial evaluation shows the effectiveness and correctness of the constructed ontology. 0 0
SIGA, a system to manage information retrieval evaluations Costa L.
Mota C.
Diana Santos
Lecture Notes in Computer Science English 2012 This paper provides an overview of the current version of SIGA, a system that supports the organization of information retrieval (IR) evaluations. SIGA was recently used in Págico, an evaluation contest where both automatic and human participants competed to find answers to 150 topics in the Portuguese Wikipedia, and we describe its new capabilities in this context as well as provide preliminary results from Págico. 0 0
The people's encyclopedia under the gaze of the sages: a systematic review of scholarly research on Wikipedia Chitu Okoli
Mohamad Mehdi
Mostafa Mesgari
Finn Årup Nielsen
Arto Lanamäki
English 2012 Wikipedia has become one of the ten most visited sites on the Web, and the world’s leading source of Web reference information. Its rapid success has inspired hundreds of scholars from various disciplines to study its content, communication and community dynamics from various perspectives. This article presents a systematic review of scholarly research on Wikipedia. We describe our detailed, rigorous methodology for identifying over 450 scholarly studies of Wikipedia. We present the WikiLit website (http wikilit dot referata dot com), where most of the papers reviewed here are described in detail. In the major section of this article, we then categorize and summarize the studies. An appendix features an extensive list of resources useful for Wikipedia researchers. 15 1
Using information extraction to generate trigger questions for academic writing support Liu M.
Calvo R.A.
Lecture Notes in Computer Science English 2012 Automated question generation approaches have been proposed to support reading comprehension. However, these approaches are not suitable for supporting writing activities. We present a novel approach to generate different forms of trigger questions (directive and facilitative) aimed at supporting deep learning. Useful semantic information from Wikipedia articles is extracted and linked to the key phrases in a students' literature review, particularly focusing on extracting information containing 3 types of relations (Kind of, Similar-to and Different-to) by using syntactic pattern matching rules. We collected literature reviews from 23 Engineering research students, and evaluated the quality of 306 computer generated questions and 115 generic questions. Facilitative questions are more useful when it comes to deep learning about the topic, while directive questions are clearer and useful for improving the composition. 0 0
WiSeNet: Building a Wikipedia-based semantic network with ontologized relations Moro A.
Roberto Navigli
ACM International Conference Proceeding Series English 2012 In this paper we present an approach for building a Wikipedia-based semantic network by integrating Open Information Extraction with Knowledge Acquisition techniques. Our algorithm extracts relation instances from Wikipedia page bodies and ontologizes them by, first, creating sets of synonymous relational phrases, called relation synsets, second, assigning semantic classes to the arguments of these relation synsets and, third, disambiguating the initial relation instances with relation synsets. As a result we obtain WiSeNet, a Wikipedia-based Semantic Network with Wikipedia pages as concepts and labeled, ontologized relations between them. 0 0
WikiSent: Weakly supervised sentiment analysis through extractive summarization with Wikipedia Saswati Mukherjee
Prantik Bhattacharyya
Lecture Notes in Computer Science English 2012 This paper describes a weakly supervised system for sentiment analysis in the movie review domain. The objective is to classify a movie review into a polarity class, positive or negative, based on those sentences bearing opinion on the movie alone, leaving out other irrelevant text. Wikipedia incorporates the world knowledge of movie-specific features in the system which is used to obtain an extractive summary of the review, consisting of the reviewer's opinions about the specific aspects of the movie. This filters out the concepts which are irrelevant or objective with respect to the given movie. The proposed system, WikiSent, does not require any labeled data for training. It achieves a better or comparable accuracy to the existing semi-supervised and unsupervised systems in the domain, on the same dataset. We also perform a general movie review trend analysis using WikiSent. 0 0
A methodology to discover semantic features from textual resources Vicient C.
Sanchez D.
Moreno A.
Proceedings - 2011 6th International Workshop on Semantic Media Adaptation and Personalization, SMAP 2011 English 2011 Data analysis algorithms focused on processing textual data rely on the extraction of relevant features from text and the appropriate association to their formal semantics. In this paper, a method to assist this task, annotating extracted textual features with concepts from a background ontology, is presented. The method is automatic and unsupervised and it has been designed in a generic way, so it can be applied to textual resources ranging from plain text to semi-structured resources (like Wikipedia articles). The system has been tested with tourist destinations and Wikipedia articles showing promising results. 0 0
A statistical approach for automatic keyphrase extraction Abulaish M.
Dey L.
Proceedings of the 5th Indian International Conference on Artificial Intelligence, IICAI 2011 English 2011 Due to availability of voluminous textual data either on the World Wide Web or in textual databases automatic keyphrase extraction has gained increasing popularity in recent past to summarize and characterize text documents. Consequently, a number of machine learning techniques, mostly supervised, have been proposed to mine keyphrases in an automatic way. But, the non-availability of annotated corpus for training such systems is the main hinder for their success. In this paper, we propose the design of an automatic keyphrase extraction system which uses NLP and statistical approach to mine keyphrases from unstructured text documents. The efficacy of the proposed system is established over texts crawled from Wikipedia server. On evaluation we found that the proposed method outperforms KEA which uses naïve Bayes classification technique for keyphrase extraction. 0 0
Creating and Exploiting a Hybrid Knowledge Base for Linked Data Zareen Syed
Tim Finin
Communications in Computer and Information Science English 2011 Twenty years ago Tim Berners-Lee proposed a distributed hypertext system based on standard Internet protocols. The Web that resulted fundamentally changed the ways we share information and services, both on the public Internet and within organizations. That original proposal contained the seeds of another effort that has not yet fully blossomed: a Semantic Web designed to enable computer programs to share and understand structured and semi-structured information easily. We will review the evolution of the idea and technologies to realize a Web of Data and describe how we are exploiting them to enhance information retrieval and information extraction. A key resource in our work is Wikitology, a hybrid knowledge base of structured and unstructured information extracted from Wikipedia. 0 0
Evaluating various linguistic features on semantic relation extraction Garcia M.
Gamallo P.
International Conference Recent Advances in Natural Language Processing, RANLP English 2011 Machine learning approaches for Information Extraction use different types of features to acquire semantically related terms from free text. These features may contain several kinds of linguistic knowledge: from orthographic or lexical to more complex features, like PoStags or syntactic dependencies. In this paper we select fourmain types of linguistic features and evaluate their performance in a systematic way. Despite the combination of some types of features allows us to improve the f-score of the extraction, we observed that by adjusting the positive and negative ratio of the training examples, we can build high quality classifiers with just a single type of linguistic feature, based on generic lexico-syntactic patterns. Experiments were performed on the Portuguese version of Wikipedia. 0 0
Extracting information about security vulnerabilities from Web text Mulwad V.
Li W.
Joshi A.
Tim Finin
Viswanathan K.
Proceedings - 2011 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT 2011 English 2011 The Web is an important source of information about computer security threats, vulnerabilities and cyberattacks. We present initial work on developing a framework to detect and extract information about vulnerabilities and attacks from Web text. Our prototype system uses Wikitology, a general purpose knowledge base derived from Wikipedia, to extract concepts that describe specific vulnerabilities and attacks, map them to related concepts from DBpedia and generate machine understandable assertions. Such a framework will be useful in adding structure to already existing vulnerability descriptions as well as detecting new ones. We evaluate our approach against vulnerability descriptions from the National Vulnerability Database. Our results suggest that it can be useful in monitoring streams of text from social media or chat rooms to identify potential new attacks and vulnerabilities or to collect data on the spread and volume of existing ones. 0 0
Knowledge Base Population: Successful approaches and challenges Ji H.
Grishman R.
ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies English 2011 In this paper we give an overview of the Knowledge Base Population (KBP) track at the 2010 Text Analysis Conference. The main goal of KBP is to promote research in discovering facts about entities and augmenting a knowledge base (KB) with these facts. This is done through two tasks, Entity Linking - linking names in context to entities in the KB -and Slot Filling - adding information about an entity to the KB. A large source collection of newswire and web documents is provided from which systems are to discover information. Attributes ("slots") derived from Wikipedia infoboxes are used to create the reference KB. In this paper we provide an overview of the techniques which can serve as a basis for a good KBP system, lay out the remaining challenges by comparison with traditional Information Extraction (IE) and Question Answering (QA) tasks, and provide some suggestions to address these challenges. 0 0
Linked open data: For NLP or by NLP? Choi K.-S. PACLIC 25 - Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation English 2011 If we call Wikipedia or Wiktionary as "web knowledge resource", the question is about whether they can contribute to NLP itself and furthermore to the knowledge resource for knowledge-leveraged computational thinking. Comparing with the structure insideWordNet from the view of its human- encoded precise classification scheme, such web knowledge resource has category structure based on collectively generated tags and structures like infobox. They are called also as "Collectively Generated Content" and its structuralized content based on collective intelligence. It is heavily based on linking among terms and we also say that it is one member of linked data. The problem is in whether such collectively generated knowledge resource can contribute to NLP and how much it can be effective. The more clean primitives of linked terms in web knowledge resources will be assumed, based on the essential property of Guarino (2000) or intrinsic property of Mizoguchi (2004). The number of entries in web knowledge resources increases very fast but their inter-relationships are indirectly calculated by their link structure. We can imagine that their entries could be mapped to one of instances under some structure of primitive concepts, like synsets of WordNet. Let's name such primitives to be "intrinsic tokens" that are derived from collectively generated knowledge resource under the principles of intrinsic properties. The procedure could be approximately proven and it will be a kind of statistical logic. We then go to the issues about what area of NLP can be solved by the so-called intrinsic tokens and their relations, a resultant approximately generated primitives. Can NLP contribute to the user generation process of content? Consider the structure of infobox in Wikipedia more closely. It will be discussed about how NLP can help the population of relevant entries, like the social network mechanism for multi-lingual environment and information extraction purpose. The traditional NLP starts from words in text but now also works have been undergoing on the web corpus with hyperlinks and html markups. In web knowledge resources, the words and chunks have underlying URIs, a kind of annotation. It signals a new paradigm of NLP. 0 0
Measuring comparability of multilingual corpora extracted from wikipedia Otero P.G.
Lopez I.G.
CEUR Workshop Proceedings English 2011 Comparable corpora can be used for many linguistic tasks such as bilingual lexicon extraction. By improving the quality of comparable corpora, we improve the quality of the extraction. This article describes some strategies to build comparable corpora from Wikipedia and proposes a measure of comparability. Experiments were performed on Portuguese, Spanish, and English Wikipedia. 0 0
Modelling provenance of DBpedia resources using Wikipedia contributions Fabrizio Orlandi
Alexandre Passant
Journal of Web Semantics English 2011 DBpedia is one of the largest datasets in the linked Open Data cloud. Its centrality and its cross-domain nature makes it one of the most important and most referred to knowledge bases on the Web of Data, generally used as a reference for data interlinking. Yet, in spite of its authoritative aspect, there is no work so far tackling the provenance aspect of DBpedia statements. By being extracted from Wikipedia, an open and collaborative encyclopedia, delivering provenance information about it would help to ensure trustworthiness of its data, a major need for people using DBpedia data for building applications. To overcome this problem, we propose an approach for modelling and managing provenance on DBpedia using Wikipedia edits, and making this information available on the Web of Data. In this paper, we describe the framework that we implemented to do so, consisting in (1) a lightweight modelling solution to semantically represent provenance of both DBpedia resources and Wikipedia content, along with mappings to popular ontologies such as the W7 - what, when, where, how, who, which, and why - and OPM - open provenance model - models, (2) an information extraction process and a provenance-computation system combining Wikipedia articles' history with DBpedia information, (3) a set of scripts to make provenance information about DBpedia statements directly available when browsing this source, as well as being publicly exposed in RDF for letting software agents consume it. © 2011 Elsevier B.V. 0 0
Ontology-based feature extraction Vicient C.
Sanchez D.
Moreno A.
Proceedings - 2011 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT 2011 English 2011 Knowledge-based data mining and classification algorithms require of systems that are able to extract textual attributes contained in raw text documents, and map them to structured knowledge sources (e.g. ontologies) so that they can be semantically analyzed. The system presented in this paper performs this tasks in an automatic way, relying on a predefined ontology which states the concepts in this the posterior data analysis will be focused. As features, our system focuses on extracting relevant Named Entities from textual resources describing a particular entity. Those are evaluated by means of linguistic and Web-based co-occurrence analyses to map them to ontological concepts, thereby discovering relevant features of the object. The system has been preliminary tested with tourist destinations and Wikipedia textual resources, showing promising results. 0 0
Query relaxation for entity-relationship search Elbassuoni S.
Maya Ramanath
Gerhard Weikum
Lecture Notes in Computer Science English 2011 Entity-relationship-structured data is becoming more important on the Web. For example, large knowledge bases have been automatically constructed by information extraction from Wikipedia and other Web sources. Entities and relationships can be represented by subject-property-object triples in the RDF model, and can then be precisely searched by structured query languages like SPARQL. Because of their Boolean-match semantics, such queries often return too few or even no results. To improve recall, it is thus desirable to support users by automatically relaxing or reformulating queries in such a way that the intention of the original user query is preserved while returning a sufficient number of ranked results. In this paper we describe comprehensive methods to relax SPARQL-like triple-pattern queries in a fully automated manner. Our framework produces a set of relaxations by means of statistical language models for structured RDF data and queries. The query processing algorithms merge the results of different relaxations into a unified result list, with ranking based on any ranking function for structured queries over RDF-data. Our experimental evaluation, with two different datasets about movies and books, shows the effectiveness of the automatically generated relaxations and the improved quality of query results based on assessments collected on the Amazon Mechanical Turk platform. 0 0
Sequential supervised learning for hypernym discovery from Wikipedia Litz B.
Langer H.
Malaka R.
Communications in Computer and Information Science English 2011 Hypernym discovery is an essential task for building and extending ontologies automatically. In comparison to the whole Web as a source for information extraction, online encyclopedias provide far more structuredness and reliability. In this paper we propose a novel approach that combines syntactic and lexical-semantic information to identify hypernymic relationships. We compiled semi-automatically and manually created training data and a gold standard for evaluation with the first sentences from the German version of Wikipedia. We trained a sequential supervised learner with a semantically enhanced tagset. The experiments showed that the cleanliness of the data is far more important than the amount of the same. Furthermore, it was shown that bootstrapping is a viable approach to ameliorate the results. Our approach outperformed the competitive lexico-syntactic patterns by 7% leading to an F1-measure of over .91. 0 0
Temporal knowledge for timely intelligence Gerhard Weikum
Bedathur S.
Ralf Schenkel
Lecture Notes in Business Information Processing English 2011 Knowledge bases about entities and their relationships are a great asset for business intelligence. Major advances in information extraction and the proliferation of knowledge-sharing communities like Wikipedia have enabled ways for the largely automated construction of rich knowledge bases. Such knowledge about entity-oriented facts can greatly improve the output quality and possibly also efficiency of processing business-relevant documents and event logs. This holds for information within the enterprise as well as in Web communities such as blogs. However, no knowledge base will ever be fully complete and real-world knowledge is continuously changing: new facts supersede old facts, knowledge grows in various dimensions, and completely new classes, relation types, or knowledge structures will arise. This leads to a number of difficult research questions regarding temporal knowledge and the life-cycle of knowledge bases. This short paper outlines challenging issues and research opportunities, and provides references to technical literature. 0 0
Towards a top-down and bottom-up bidirectional approach to joint information extraction Yu X.
King I.
Lyu M.R.
International Conference on Information and Knowledge Management, Proceedings English 2011 Most high-level information extraction (IE) consists of compound and aggregated subtasks. Such IE problems are generally challenging and they have generated increasing interest recently. We investigate two representative IE tasks: (1) entity identification and relation extraction from Wikipedia, and (2) citation matching, and we formally define joint optimization of information extraction. We propose a joint paradigm integrating three factors - segmentation, relation, and segmentation-relation joint factors, to solve all relevant subtasks simultaneously. This modeling offers a natural formalism for exploiting bidirectional rich dependencies and interactions between relevant subtasks to capture mutual benefits. Since exact parameter estimation is prohibitively intractable, we present a general, highly-coupled learning algorithm based on variational expectation maximization (VEM) to perform parameter estimation approximately in a top-down and bottom-up manner, such that information can flow bidirectionally and mutual benefits from different subtasks can be well exploited. In this algorithm, both segmentation and relation are optimized iteratively and collaboratively using hypotheses from each other. We conducted extensive experiments using two real-world datasets to demonstrate the promise of our approach. 0 0
Acquiring semantic context for events from online resources Oliveirinha J.
Pereira F.
Alves A.
Proceedings of the 3rd International Workshop on Location and the Web, LocWeb 2010 English 2010 During the last few years, the amount of online descriptive information about places and their dynamics has reached reasonable dimension for many cities in the world. Such enriched information can now support semantic analysis of space, particularly in which respects to what exists there and what happens there. We present a methodology to automatically label places according to events that happen there. To achieve this we use Information Extraction techniques applied to online Web 2.0 resources such as Zvents and Boston Calendar. Wikipedia is also used as a resource to semantically enrich the tag vectors initially extracted. We describe the process by which these semantic vectors are obtained, present results of experimental analysis, and validated these with Amazon Mechanical Turk and a set of algorithms. To conclude, we discuss the strengths and weaknesses of the methodology. Copyright 2010 ACM. 0 0
An efficient web-based wrapper and annotator for tabular data Amin M.S.
Jamil H.
International Journal of Software Engineering and Knowledge Engineering English 2010 In the last few years, several works in the literature have addressed the problem of data extraction from web pages. The importance of this problem derives from the fact that, once extracted, data can be handled in a way similar to instances of a traditional database, which in turn can facilitate application of web data integration and various other domain specific problems. In this paper, we propose a novel table extraction technique that works on web pages generated dynamically from a back-end database. The proposed system can automatically discover table structure by relevant pattern mining from web pages in an efficient way, and can generate regular expression for the extraction process. Moreover, the proposed system can assign intuitive column names to the columns of the extracted table by leveraging Wikipedia knowledge base for the purpose of table annotation. To improve accuracy of the assignment, we exploit the structural homogeneity of the column values and their co-location information to weed out less likely candidates. This approach requires no human intervention and experimental results have shown its accuracy to be promising. Moreover, the wrapper generation algorithm works in linear time. 0 0
Creating and exploiting a Web of semantic data Tim Finin
Zareen Syed
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence, Proceedings English 2010 Twenty years ago Tim Berners-Lee proposed a distributed hypertext system based on standard Internet protocols. The Web that resulted fundamentally changed the ways we share information and services, both on the public Internet and within organizations. That original proposal contained the seeds of another effort that has not yet fully blossomed: a Semantic Web designed to enable computer programs to share and understand structured and semi-structured information easily. We will review the evolution of the idea and technologies to realize a Web of Data and describe how we are exploiting them to enhance information retrieval and information extraction. A key resource in our work is Wikitology, a hybrid knowledge base of structured and unstructured information extracted from Wikipedia. 0 0
Extracting conceptual relations from Persian resources Fadaei H.
Shamsfard M.
ITNG2010 - 7th International Conference on Information Technology: New Generations English 2010 In this paper we present a relation extraction system which uses a combination of pattern based, structure based and statistical approaches. This system uses raw texts and Wikipedia articles to learn conceptual relations. Wikipedia structures are rich source of information in relation extraction and are well used in this system. A set of patterns are extracted for Persian language and are used to learn both taxonomic and non-taxonomic relations. This system is one of the few relation extraction systems designed for Persian language and is the first system among them which uses Wikipedia structures in the process of relation learning. 0 0
Extracting structured information from wikipedia articles to populate infoboxes Lange D.
Bohm C.
Naumann F.
International Conference on Information and Knowledge Management, Proceedings English 2010 Roughly every third Wikipedia article contains an infobox - a table that displays important facts about the subject in attribute-value form. The schema of an infobox, i.e., the attributes that can be expressed for a concept, is defined by an infobox template. Often, authors do not specify all template attributes, resulting in incomplete infoboxes. With iPopulator, we introduce a system that automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. In contrast to prior work, iPopulator detects and exploits the structure of attribute values to independently extract value parts. We have tested iPopulator on the entire set of infobox templates and provide a detailed analysis of its effectiveness. For instance, we achieve an average extraction precision of 91% for 1,727 distinct infobox template attributes. 0 0
From information to knowledge: Harvesting entities and relationships from web sources Gerhard Weikum
Martin Theobald
Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems English 2010 There are major trends to advance the functionality of search engines to a more expressive semantic level. This is enabled by the advent of knowledge-sharing communities such as Wikipedia and the progress in automatically extracting entities and relationships from semistructured as well as natural-language Web sources. Recent endeavors of this kind include DBpedia, EntityCube, KnowItAll, ReadTheWeb, and our own YAGO-NAGA project (and others). The goal is to automatically construct and maintain a comprehensive knowledge base of facts about named entities, their semantic classes, and their mutual relations as well as temporal contexts, with high precision and high recall. This tutorial discusses state-of-the-art methods, research opportunities, and open challenges along this avenue of knowledge harvesting. 0 0
Information extraction from Wikipedia using pattern learning Mihaltz M. Acta Cybernetica English 2010 In this paper we present solutions for the crucial task of extracting structured information from massive free-text resources, such as Wikipedia, for the sake of semantic databases serving upcoming Semantic Web technologies. We demonstrate both a verb frame-based approach using deep natural language processing techniques with extraction patterns developed by human knowledge experts and machine learning methods using shallow linguistic processing. We also propose a method for learning verb frame-based extraction patterns automatically from labeled data. We show that labeled training data can be produced with only minimal human effort by utilizing existing semantic resources and the special characteristics of Wikipedia. Custom solutions for named entity recognition are also possible in this scenario. We present evaluation and comparison of the different approaches for several different relations. 0 0
Proposal of Spatiotemporal Data Extraction and Visualization System Based on Wikipedia for Application to Earth Science Akihiro Okamoto
Shohei Yokoyama
Naoki Fukuta
Hiroshi Ishikawa
ICIS English 2010 0 0
Rich ontology extraction and wikipedia expansion using language resources Schonberg C.
Pree H.
Freitag B.
Lecture Notes in Computer Science English 2010 Existing social collaboration projects contain a host of conceptual knowledge, but are often only sparsely structured and hardly machine-accessible. Using the well known Wikipedia as a showcase, we propose new and improved techniques for extracting ontology data from the wiki category structure. Applications like information extraction, data classification, or consistency checking require ontologies of very high quality and with a high number of relationships. We improve upon existing approaches by finding a host of additional relevant relationships between ontology classes, leveraging multi-lingual relations between categories and semantic relations between terms. 0 0
Timely YAGO: Harvesting, querying, and visualizing temporal knowledge from Wikipedia Yafang Wang
Mingjie Zhu
Qu L.
Marc Spaniol
Gerhard Weikum
Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings English 2010 Recent progress in information extraction has shown how to automatically build large ontologies from high-quality sources like Wikipedia. But knowledge evolves over time; facts have associated validity intervals. Therefore, ontologies should include time as a first-class dimension. In this paper, we introduce Timely YAGO, which extends our previously built knowledge base YAGO with temporal aspects. This prototype system extracts temporal facts from Wikipedia infoboxes, categories, and lists in articles, and integrates these into the Timely YAGO knowledge base. We also support querying temporal facts, by temporal predicates in a SPARQL-style language. Visualization of query results is provided in order to better understand of the dynamic nature of knowledge. Copyright 2010 ACM. 0 0
Top-down and bottom-up: A combined approach to slot filling Zheng Chen
Tamang S.
Lee A.
Li X.
Passantino M.
Ji H.
Lecture Notes in Computer Science English 2010 The Slot Filling task requires a system to automatically distill information from a large document collection and return answers for a query entity with specified attributes ('slots'), and use them to expand the Wikipedia infoboxes. We describe two bottom-up Information Extraction style pipelines and a top-down Question Answering style pipeline to address this task. We propose several novel approaches to enhance these pipelines, including statistical answer re-ranking and Markov Logic Networks based cross-slot reasoning. We demonstrate that our system achieves state-of-the-art performance, with 3.1% higher precision and 2.6% higher recall compared with the best system in the KBP2009 evaluation. 0 0
Amplifying community content creation with mixed-initiative information extraction Raphael Hoffmann
Saleema Amershi
Kayur Patel
Fei Wu
James Fogarty
Weld D.S.
Conference on Human Factors in Computing Systems - Proceedings English 2009 Although existing work has explored both information extraction and community content creation, most research has focused on them in isolation. In contrast, we see the greatest leverage in the synergistic pairing of these methods as two interlocking feedback cycles. This paper explores the potential synergy promised if these cycles can be made to accelerate each other by exploiting the same edits to advance both community content creation and learning-based information extraction. We examine our proposed synergy in the context of Wikipedia infoboxes and the Kylin information extraction system. After developing and refining a set of interfaces to present the verification of Kylin extractions as a non-primary task in the context of Wikipedia articles, we develop an innovative use of Web search advertising services to study people engaged in some other primary task. We demonstrate our proposed synergy by analyzing our deployment from two complementary perspectives: (1) we show we accelerate community content creation by using Kylin's information extraction to significantly increase the likelihood that a person visiting a Wikipedia article as a part of some other primary task will spontaneously choose to help improve the article's infobox, and (2) we show we accelerate information extraction by using contributions collected from people interacting with our designs to significantly improve Kylin's extraction performance. Copyright 2009 ACM. 0 0
Building a semantic virtual museum: From wiki to semantic wiki using named entity recognition Alain Plantec
Vincent Ribaud
Vasudeva Varma
Proceedings of the Conference on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA English 2009 In this paper, we describe an approach for creating semantic wiki pages from regular wiki pages, in the domain of scientific museums, using information extraction methods in general and named entity recognition in particular. We make use of a domain specific ontology called CIDOC-CRM as a base structure for representing and processing knowledge. We have described major components of the proposed approach and a three-step process involving name entity recognition, identifying domain classes using the ontology and establishing the properties for the entities in order to generate semantic wiki pages. Our initial evaluation of the prototype shows promising results in terms of enhanced efficiency and time and cost benefits. 0 0
Exploiting Wikipedia as a knowledge base: Towards and ontology of movies Alarcon R.
Sanchez O.
Mijangos V.
CEUR Workshop Proceedings English 2009 Wikipedia is a huge knowledge base growing every day due to the contribution of people all around the world. Some part of the information of each article is kept in a special, consistently and formatted table called infobox. In this article, we analyze the Wikipedia infoboxes of movies articles; we describe some of the problems that can make extracting information from these tables a difficult task. We also present a methodology to automatically extract information that could be useful towards the building of an ontology of movies from Wikipedia in Spanish. 0 0
Harvesting, searching, and ranking knowledge on the web Gerhard Weikum Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, WSDM'09 English 2009 There are major trends to advance the functionality of search engines to a more expressive semantic level (e.g., [2, 4, 6, 7, 8, 9, 13, 14, 18]). This is enabled by employing large-scale information extraction [1, 11, 20] of entities and relationships from semistructured as well as natural-language Web sources. In addition, harnessing Semantic-Web-style ontologies [22] and reaching into Deep-Web sources [16] can contribute towards a grand vision of turning the Web into a comprehensive knowledge base that can be efficiently searched with high precision. This talk presents ongoing research towards this objective, with emphasis on our work on the YAGO knowledge base [23, 24] and the NAGA search engine [14] but also covering related projects. YAGO is a large collection of entities and relational facts that are harvested from Wikipedia and WordNet with high accuracy and reconciled into a consistent RDF-style "semantic" graph. For further growing YAGO from Web sources while retaining its high quality, pattern-based extraction is combined with logic-based consistency checking in a unified framework [25]. NAGA provides graph-template-based search over this data, with powerful ranking capabilities based on a statistical language model for graphs. Advanced queries and the need for ranking approximate matches pose efficiency and scalability challenges that are addressed by algorithmic and indexing techniques [15, 17]. YAGO is publicly available and has been imported into various other knowledge-management projects including DB-pedia. YAGO shares many of its goals and methodologies with parallel projects along related lines. These include Avatar [19], Cimple/DBlife [10, 21], DBpedia [3], Know-ItAll/TextRunner [12, 5], Kylin/KOG [26, 27], and the Libra technology [18, 28] (and more). Together they form an exciting trend towards providing comprehensive knowledge bases with semantic search capabilities. copyright 2009 ACM. 0 0
Inducing gazetteer for Chinese named entity recognition based on local high-frequent strings Pang W.
Fan X.
2009 2nd International Conference on Future Information Technology and Management Engineering, FITME 2009 English 2009 Gazetteers, or entity dictionaries, are important for named entity recognition (NER). Although the dictionaries extracted automatically by the previous methods from a corpus, web or Wikipedia are very huge, they also misses some entities, especially the domain-specific entities. We present a novel method of automatic entity dictionary induction, which is able to construct a dictionary more specific to the processing text at a much lower computational cost than the previous methods. It extracts the local high-frequent strings in a document as candidate entities, and filters the invalid candidates with the accessor variety (AV) as our entity criterion. The experiments show that the obtained dictionary can effectively improve the performance of a high-precision baseline of NER. 0 0
Information extraction in semantic wikis Smrz P.
Schmidt M.
CEUR Workshop Proceedings English 2009 This paper deals with information extraction technologies supporting semantic annotation and logical organization of textual content in semantic wikis. We describe our work in the context of the KiWi project which aims at developing a new knowledge management system motivated by the wiki way of collaborative content creation that is enhanced by the semantic web technology. The specific characteristics of semantic wikis as advanced community knowledge-sharing platforms are discussed from the perspective of the functionality providing automatic suggestions of semantic tags. We focus on the innovative aspects of the implemented methods. The interfaces of the user-interaction tools as well as the back-end web services are also tackled. We conclude that though there are many challenges related to the integration of information extraction into semantic wikis, this fusion brings valuable results. 0 0
Mining meaning from Wikipedia Olena Medelyan
David N. Milne
Catherine Legg
Ian H. Witten
Int. J. Hum.-Comput. Stud.
International Journal of Human Computer Studies
English 2009 Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced. 2009 Elsevier Ltd. All rights reserved. 0 4
Named entity network based on wikipedia Maskey S.
Dakka W.
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH English 2009 Named Entities (NEs) play an important role in many natural language and speech processing tasks. A resource that identifies relations between NEs could potentially be very useful. We present such automatically generated knowledge resource from Wikipedia, Named Entity Network (NE-NET), that provides a list of related Named Entities (NEs) and the degree of relation for any given NE. Unlike some manually built knowledge resource, NE-NET has a wide coverage consisting of 1.5 million NEs represented as nodes of a graph with 6.5 million arcs relating them. NE-NET also provides the ranks of the related NEs using a simple ranking function that we propose. In this paper, we present NE-NET and our experiments showing how NE-NET can be used to improve the retrieval of spoken (Broadcast News) and text documents. Copyright 0 0
Perspectives on semantics of the place from online resources Pereira F.
Alves A.
Oliveirinha J.
Biderman A.
ICSC 2009 - 2009 IEEE International Conference on Semantic Computing English 2009 We present a methodology for extraction of semantic indexes related to a given geo-referenced place. These lists of words correspond to the concepts that should be semantically related to that place, according to a number of perspectives. Each perspective is provided by a different online resource, namely, Flickr, Wikipedia or open web search (using Yahoo! search engine). We describe the process by which those lists are obtained, present experimental results and discuss the strengths and weaknesses of the methodology and of each perspective. 0 0
An Empirical Research on Extracting Relations from Wikipedia Text Jin-Xia Huang
Pum-Mo Ryu
Key-Sun Choi
IDEAL English 2008 A feature based relation classification approach is presented, in which probabilistic and semantic relatedness features between patterns and relation types are employed with other linguistic information. The importance of each feature set is evaluated with Chi-square estimator, and the experiments show that, the relatedness features have big impact on the relation classification performance. A series experiments are also performed to evaluate the different machine learning approaches on relation classification, among which Bayesian outperformed other approaches including Support Vector Machine (SVM). 0 0
An empirical research on extracting relations from Wikipedia text Huang J.-X.
Ryu P.-M.
Choi K.-S.
Lecture Notes in Computer Science English 2008 A feature based relation classification approach is presented, in which probabilistic and semantic relatedness features between patterns and relation types are employed with other linguistic information. The importance of each feature set is evaluated with Chi-square estimator, and the experiments show that, the relatedness features have big impact on the relation classification performance. A series experiments are also performed to evaluate the different machine learning approaches on relation classification, among which Bayesian outperformed other approaches including Support Vector Machine (SVM). 0 0
Augmenting wikipedia-extraction with results from the web Fei Wu
Raphael Hoffmann
Weld D.S.
AAAI Workshop - Technical Report English 2008 Not only is Wikipedia a comprehensive source of quality information, it has several kinds of internal structure (e.g., relational summaries known as infoboxes), which enable self-supervised information extraction. While previous efforts at extraction from Wikipedia achieve high precision and recall on well-populated classes of articles, they fail in a larger number of cases, largely because incomplete articles and infrequent use of infoboxes lead to insufficient training data. This paper explains and evaluates a method for improving recall by extracting from the broader Web. There are two key advances necessary to make Web supplementation effective: 1) a method to filter promising sentences from Web pages, and 2) a novel retraining technique to broaden extractor recall. Experiments show that, used in concert with shrinkage, our techniques increase recall by a factor of up to 8 while maintaining or increasing precision. Copyright 0 0
Employing a domain specific ontology to perform semantic search Morneau M.
Mineau G.W.
Lecture Notes in Computer Science English 2008 Increasing the relevancy of Web search results has been a major concern in research over the last years. Boolean search, metadata, natural language based processing and various other techniques have been applied to improve the quality of search results sent to a user. Ontology-based methods were proposed to refine the information extraction process but they have not yet achieved wide adoption by search engines. This is mainly due to the fact that the ontology building process is time consuming. An all inclusive ontology for the entire World Wide Web might be difficult if not impossible to construct, but a specific domain ontology can be automatically built using statistical and machine learning techniques, as done with our tool: SeseiOnto. In this paper, we describe how we adapted the SeseiOnto software to perform Web search on the Wikipedia page on climate change. SeseiOnto, by using conceptual graphs to represent natural language and an ontology to extract links between concepts, manages to properly answer natural language queries about climate change. Our tests show that SeseiOnto has the potential to be used in domain specific Web search as well as in corporate intranets. 0 0
Gazetiki: Automatic creation of a geographical gazetteer Adrian Popescu
Gregory Grefenstette
Moellic P.-A.
Proceedings of the ACM International Conference on Digital Libraries English 2008 Geolocalized databases are becoming necessary in a wide variety of application domains. Thus far, the creation of such databases has been a costly, manual process. This drawback has stimulated interest in automating their construction, for example, by mining geographical information from the Web. Here we present and evaluate a new automated technique for creating and enriching a geographical gazetteer, called Gazetiki. Our technique merges disparate information from Wikipedia, Panoramio, and web search, engines in order to identify geographical names, categorize these names, find their geographical coordinates and rank them. The information produced in Gazetiki enhances and complements the Geonames database, using a similar domain model. We show that our method provides a richer structure and an improved coverage compared to another known attempt at automatically building a geographic database and, where possible, we compare our Gazetiki to Geonames. Copyright 2008 ACM. 0 0
Information extraction from Wikipedia: Moving down the long tail Fei Wu
Raphael Hoffmann
Weld D.S.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining English 2008 Not only is Wikipedia a comprehensive source of quality information, it has several kinds of internal structure (e.g., relational summaries known as infoboxes), which enable self-supervised information extraction. While previous efforts at extraction from Wikipedia achieve high precision and recall on well-populated classes of articles, they fail in a larger number of cases, largely because incomplete articles and infrequent use of infoboxes lead to insufficient training data. This paper presents three novel techniques for increasing recall from Wikipedia's long tail of sparse classes: (1) shrinkage over an automatically-learned subsumption taxonomy, (2) a retraining technique for improving the training data, and (3) supplementing results by extracting from the broader Web. Our experiments compare design variations and show that, used in concert, these techniques increase recall by a factor of 1.76 to 8.71 while maintaining or increasing precision. 0 0
KiWi - Knowledge in a wiki Sebastian Schaffert
Eder J.
Samwald M.
Blumauer A.
CEUR Workshop Proceedings English 2008 The objective of the project KiWi is to develop an advanced knowledge management system (the "KiWi system") based on a semantic wiki. This poster describes the KiWi project, its technical approach, goals and the two use-cases which will be covered by the KiWi-System. 0 0
YAGO: A Large Ontology from Wikipedia and WordNet F. Suchanek
G. Kasneci
G. Weikum
Web Semantics: Science, Services and Agents on the World Wide Web English 2008 This article presents YAGO, a large ontology with high coverage and precision. YAGO has been automatically derived from Wikipedia and WordNet. It comprises entities and relations, and currently contains more than 1.7 million entities and 15 million facts. These include the taxonomic Is-A hierarchy as well as semantic relations between entities. The facts for YAGO have been extracted from the category system and the infoboxes of Wikipedia and have been combined with taxonomic relations from WordNet. Type checking techniques help us keep YAGO’s precision at 95%—as proven by an extensive evaluation study. YAGO is based on a clean logical model with a decidable consistency. Furthermore, it allows representing n-ary relations in a natural way while maintaining compatibility with RDFS. A powerful query model facilitates access to YAGO’s data. 0 1
Automatising the Learning of Lexical Patterns: an Application to the Enrichment of WordNet by Extracting Semantic Relationships from Wikipedia Maria Ruiz-Casado
Enrique Alfonseca and Pablo Castells
Data & Knowledge Engineering , Issue 3 (June 2007) 2007 This paper describes Koru, a new search interface that offers effective domain-independent knowledge-based information retrieval. Koru exhibits an understanding of the topics of both queries and documents. This allows it to (a) expand queries automatically and (b) help guide the user as they evolve their queries interactively. Its understanding is mined from the vast investment of manual effort and judgment that is Wikipedia. We show how this open, constantly evolving encyclopedia can yield inexpensive knowledge structures that are specifically tailored to expose the topics, terminology and semantics of individual document collections. We conducted a detailed user study with 12 participants and 10 topics from the 2005 TREC HARD track, and found that Koru and its underlying knowledge base offers significant advantages over traditional keyword search. It was capable of lending assistance to almost every query issued to it; making their entry more efficient, improving the relevance of the documents they return, and narrowing the gap between expert and novice seekers. 0 0
Automatising the learning of lexical patterns: An application to the enrichment of WordNet by extracting semantic relationships from Wikipedia Maria Ruiz-Casado
Enrique Alfonseca
Pablo Castells
Data Knowl. Eng. English 2007 0 0
Extracting Named Entities and Relating Them over Time Based on Wikipedia A Bhole
B Fortuna
M Grobelnik
D Mladenic
Informatica, 2007 2007 This paper presents an approach to mining information relating people, places, organizations and events extracted from Wikipedia and linking them on a time scale. The approach consists of two phases: (1) identifying relevant categorizing the articles as containing people, places or organizations; (2) generating timeline - linking named entities and extracting events and their time frame. We illustrate the proposed approach on 1.7 million Wikipedia articles. 0 0