From WikiPapers
Jump to: navigation, search

Wordnet is included as keyword or extra keyword in 0 datasets, 0 tools and 134 publications.


There is no datasets for this keyword.


There is no tools for this keyword.


Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Automatic extraction of property norm-like data from large text corpora Kelly C.
Devereux B.
Korhonen A.
Cognitive Science English 2014 Traditional methods for deriving property-based representations of concepts from text have focused on either extracting only a subset of possible relation types, such as hyponymy/hypernymy (e.g., car is-a vehicle) or meronymy/metonymy (e.g., car has wheels), or unspecified relations (e.g., car-petrol). We propose a system for the challenging task of automatic, large-scale acquisition of unconstrained, human-like property norms from large text corpora, and discuss the theoretical implications of such a system. We employ syntactic, semantic, and encyclopedic information to guide our extraction, yielding concept-relation-feature triples (e.g., car be fast, car require petrol, car cause pollution), which approximate property-based conceptual representations. Our novel method extracts candidate triples from parsed corpora (Wikipedia and the British National Corpus) using syntactically and grammatically motivated rules, then reweights triples with a linear combination of their frequency and four statistical metrics. We assess our system output in three ways: lexical comparison with norms derived from human-generated property norm data, direct evaluation by four human judges, and a semantic distance comparison with both WordNet similarity data and human-judged concept similarity ratings. Our system offers a viable and performant method of plausible triple extraction: Our lexical comparison shows comparable performance to the current state-of-the-art, while subsequent evaluations exhibit the human-like character of our generated properties. 0 0
Development of a semantic and syntactic model of natural language by means of non-negative matrix and tensor factorization Anisimov A.
Marchenko O.
Taranukha V.
Vozniuk T.
Lecture Notes in Computer Science English 2014 A method for developing a structural model of natural language syntax and semantics is proposed. Syntactic and semantic relations between parts of a sentence are presented in the form of a recursive structure called a control space. Numerical characteristics of these data are stored in multidimensional arrays. After factorization, the arrays serve as the basis for the development of procedures for analyses of natural language semantics and syntax. 0 0
Ontology construction using multiple concept lattices Wang W.C.
Lu J.
Advanced Materials Research English 2014 The paper proposes an ontology construction approach that combines Fuzzy Formal Concept Analysis, Wikipedia and WordNet in a process that constructs multiple concept lattices for sub-domains. Those sub-domains are divided from the target domain. The multiple concept lattices approach can mine concepts and determine relations between concepts automatically, and construct domain ontology accordingly. This approach is suitable for the large domain or complex domain which contains obvious sub-domains. 0 0
Semi-automatic construction of plane geometry ontology based-on WordNet and Wikipedia Fu H.-G.
LeBo Liu
Zhong X.-Q.
Jiang Y.
Sun Y.-Y.
Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China Chinese 2014 Ontology, as a member of the Semantic Web's hierarchical structure, is located in the central position. Regarding the current research situation of ontology construction, the manual construction is difficult to ensure its efficiency and scalability; and the automatic construction is hard to guarantee its interoperability. This paper presents a semi-automatic domain ontology construction method based on WordNet and Wikipedia. First, we construct the top-level ontology and then reuse WordNet structure to expand the terminology and terminology-level at the depth of the ontology. Furthermore, we expand the relationship and supplement the terminology at the width of the ontology by referring to page information of Wikipedia. Finally, this method of ontology construction is applied in elementary geometry domain. The experiments show that this method can greatly improve the efficiency of ontology construction and ensure the quality of the ontology to some degree. 0 0
Shrinking digital gap through automatic generation of WordNet for Indian languages Jain A.
Tayal D.K.
Rai S.
AI & SOCIETY English 2014 Hindi ranks fourth in terms of speaker's size in the world. In spite of that, it has <0.1 % presence on web due to lack of competent lexical resources, a key reason behind digital gap due to language barrier among Indian masses. In the footsteps of the renowned lexical resource English WordNet, 18 Indian languages initiated building WordNets under the project Indo WordNet. India is a multilingual country with around 122 languages and 234 mother tongues. Many Indian languages still do not have any reliable lexical resource, and the coverage of numerous WordNets under progress is still far from average value of 25,792. The tedious manual process and high cost are major reasons behind unsatisfactory coverage and limping progress. In this paper, we discuss the socio-cultural and economic impact of providing Internet accessibility and present an approach for the automatic generation of WordNets to tackle the lack of competent lexical resources. Problems such as accuracy, association of linguistics specific gloss/example and incorrect back-translations which arise while deviating from traditional approach of compilation by lexicographers are resolved by utilising Wikipedia available for Indian languages. © 2014 Springer-Verlag London. 0 0
Topic ontology-based efficient tag recommendation approach for blogs Subramaniyaswamy V.
Pandian S.C.
International Journal of Computational Science and Engineering English 2014 Efficient tag recommendation systems are required to help users in the task of searching, indexing and browsing appropriate blog content. Tag generation has become more popular to annotate web content, other blogs, photos, videos and music. Tag recommendation is an action of signifying valuable and informative tags to a budding item based on the content. We propose a novel approach based on topic ontology for tag recommendation. The proposed approach intelligently generates tag suggestions to blogs. In this paper, we effectively construct the technology entitled Ontology based on Wikipedia categories and WordNet semantic relationship to make the ontology more meaningful and reliable. Spreading activation algorithm is applied to assign interest scores to existing blog content and tags. High quality tags are suggested based on the significance of the interest score. Evaluation proves that the applicability of topic ontology with spreading activation algorithm helps tag recommendation more effective when compared to collaborative tag recommendations. Our proposed approach offers several solutions to tag spamming, sentiment analysis and popularity. Finally, we report the results of an experiment which improves the performance of tag recommendation approach. 0 0
Validating and extending semantic knowledge bases using video games with a purpose Vannella D.
Jurgens D.
Scarfini D.
Toscani D.
Roberto Navigli
52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference English 2014 Large-scale knowledge bases are important assets in NLP. Frequently, such resources are constructed through automatic mergers of complementary resources, such as WordNet and Wikipedia. However, manually validating these resources is prohibitively expensive, even when using methods such as crowdsourcing. We propose a cost-effective method of validating and extending knowledge bases using video games with a purpose. Two video games were created to validate conceptconcept and concept-image relations. In experiments comparing with crowdsourcing, we show that video game-based validation consistently leads to higher-quality annotations, even when players are not compensated. 0 0
A cloud of FAQ: A highly-precise FAQ retrieval system for the Web 2.0 Romero M.
Moreo A.
Castro J.L.
Knowledge-Based Systems English 2013 FAQ (Frequency Asked Questions) lists have attracted increasing attention for companies and organizations. There is thus a need for high-precision and fast methods able to manage large FAQ collections. In this context, we present a FAQ retrieval system as part of a FAQ exploiting project. Following the growing trend towards Web 2.0, we aim to provide users with mechanisms to navigate through the domain of knowledge and to facilitate both learning and searching, beyond classic FAQ retrieval algorithms. To this purpose, our system involves two different modules: an efficient and precise FAQ retrieval module and, a tag cloud generation module designed to help users to complete the comprehension of the retrieved information. Empirical results evidence the validity of our approach with respect to a number of state-of-the-art algorithms in terms of the most popular metrics in the field. © 2013 Elsevier B.V. All rights reserved. 0 0
A quick tour of BabelNet 1.1 Roberto Navigli Lecture Notes in Computer Science English 2013 In this paper we present BabelNet 1.1, a brand-new release of the largest "encyclopedic dictionary", obtained from the automatic integration of the most popular computational lexicon of English, i.e. WordNet, and the largest multilingual Web encyclopedia, i.e. Wikipedia. BabelNet 1.1 covers 6 languages and comes with a renewed Web interface, graph explorer and programmatic API. BabelNet is available online at http://www.babelnet.org. 0 0
Automatic topic ontology construction using semantic relations from wordnet and wikipedia Subramaniyaswamy V. International Journal of Intelligent Information Technologies English 2013 Due to the explosive growth of web technology, a huge amount of information is available as web resources over the Internet. Therefore, in order to access the relevant content from the web resources effectively, considerable attention is paid on the semantic web for efficient knowledge sharing and interoperability. Topic ontology is a hierarchy of a set of topics that are interconnected using semantic relations, which is being increasingly used in the web mining techniques. Reviews of the past research reveal that semiautomatic ontology is not capable of handling high usage. This shortcoming prompted the authors to develop an automatic topic ontology construction process. However, in the past many attempts have been made by other researchers to utilize the automatic construction of ontology, which turned out to be challenging due to time, cost and maintenance. In this paper, the authors have proposed a corpus based novel approach to enrich the set of categories in the ODP by automatically identifying the concepts and their associated semantic relationship with corpus based external knowledge resources, such as Wikipedia and WordNet. This topic ontology construction approach relies on concept acquisition and semantic relation extraction. A Jena API framework has been developed to organize the set of extracted semantic concepts, while Protégé provides the platform to visualize the automatically constructed topic ontology. To evaluate the performance, web documents were classified using SVM classifier based on ODP and topic ontology. The topic ontology based classification produced better accuracy than ODP. Copyright 0 0
Semantic Web service discovery based on FIPA multi agents Song W. Lecture Notes in Electrical Engineering English 2013 In this paper we propose a framework for semantic Web service discovery that communicates between multi agent system and Web services without changing their existing specifications and implementations by providing a broker. We explained that the ontology management in the broker creates the user ontology and merges it with general ontology (i.e. WordNet, Yago, Wikipedia ...) and recommends the created WSDL based on generalized ontology to selected Web service provider to increase their retrieval probability in the related queries. In the future works, we solve inconsistencies during the merge and will improve matching process and will implement the recommendation component. 0 0
Semantic smoothing for text clustering Nasir J.A.
Varlamis I.
Karim A.
Tsatsaronis G.
Knowledge-Based Systems English 2013 In this paper we present a new semantic smoothing vector space kernel (S-VSM) for text documents clustering. In the suggested approach semantic relatedness between words is used to smooth the similarity and the representation of text documents. The basic hypothesis examined is that considering semantic relatedness between two text documents may improve the performance of the text document clustering task. For our experimental evaluation we analyze the performance of several semantic relatedness measures when embedded in the proposed (S-VSM) and present results with respect to different experimental conditions, such as: (i) the datasets used, (ii) the underlying knowledge sources of the utilized measures, and (iii) the clustering algorithms employed. To the best of our knowledge, the current study is the first to systematically compare, analyze and evaluate the impact of semantic smoothing in text clustering based on 'wisdom of linguists', e.g., WordNets, 'wisdom of crowds', e.g., Wikipedia, and 'wisdom of corpora', e.g., large text corpora represented with the traditional Bag of Words (BoW) model. Three semantic relatedness measures for text are considered; two knowledge-based (Omiotis [1] that uses WordNet, and WLM [2] that uses Wikipedia), and one corpus-based (PMI [3] trained on a semantically tagged SemCor version). For the comparison of different experimental conditions we use the BCubed F-Measure evaluation metric which satisfies all formal constraints of good quality cluster. The experimental results show that the clustering performance based on the S-VSM is better compared to the traditional VSM model and compares favorably against the standard GVSM kernel which uses word co-occurrences to compute the latent similarities between document terms. © 2013 Elsevier B.V. All rights reserved. 0 0
Tìpalo: A tool for automatic typing of DBpedia entities Nuzzolese A.G.
Aldo Gangemi
Valentina Presutti
Draicchio F.
Alberto Musetti
Paolo Ciancarini
Lecture Notes in Computer Science English 2013 In this paper we demonstrate the potentiality of Tìpalo, a tool for automatically typing DBpedia entities. Tìpalo identifies the most appropriate types for an entity in DBpedia by interpreting its definition extracted from its corresponding Wikipedia abstract. Tìpalo relies on FRED, a tool for ontology learning from natural language text, and on a set of graph-pattern-based heuristics which work on the output returned by FRED in order to select the most appropriate types for a DBpedia entity. The tool returns a RDF graph composed of rdf:type, rdfs:subClassOf, owl:sameAs, and owl:equivalentTo statements providing typing information about the entity. Additionally the types are aligned to two lists of top-level concepts, i.e., Wordnet supersenses and a subset of DOLCE Ultra Lite classes. Tìpalo is available as a Web-based tool and exposes its API as HTTP REST services. 0 0
Ukrainian WordNet: Creation and filling Anisimov A.
Marchenko O.
Nikonenko A.
Porkhun E.
Taranukha V.
Lecture Notes in Computer Science English 2013 This paper deals with the process of developing a lexical semantic database for Ukrainian language - UkrWordNet. The architecture of the developed system is described in detail. The data storing structure and mechanisms of access to knowledge are reviewed along with the internal logic of the system and some key software modules. The article is also concerned with the research and development of automated techniques of UkrWordNet Semantic Network replenishment and extension. 0 0
Wikipedia-based WSD for multilingual frame annotation Tonelli S.
Claudio Giuliano
Kateryna Tymoshenko
Artificial Intelligence English 2013 Many applications in the context of natural language processing have been proven to achieve a significant performance when exploiting semantic information extracted from high-quality annotated resources. However, the practical use of such resources is often biased by their limited coverage. Furthermore, they are generally available only for English and few other languages. We propose a novel methodology that, starting from the mapping between FrameNet lexical units and Wikipedia pages, automatically leverages from Wikipedia new lexical units and example sentences. The goal is to build a reference data set for the semi-automatic development of new FrameNets. In addition, this methodology can be adapted to perform frame identification in any language available in Wikipedia. Our approach relies on a state-of-the-art word sense disambiguation system that is first trained on English Wikipedia to assign a page to the lexical units in a frame. Then, this mapping is further exploited to perform frame identification in English or in any other language available in Wikipedia. Our approach shows a high potential in multilingual settings, because it can be applied to languages for which other lexical resources such as WordNet or thesauri are not available. © 2012 Elsevier B.V. All rights reserved. 0 0
YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia Johannes Hoffart
Suchanek F.M.
Berberich K.
Gerhard Weikum
Artificial Intelligence English 2013 We present YAGO2, an extension of the YAGO knowledge base, in which entities, facts, and events are anchored in both time and space. YAGO2 is built automatically from Wikipedia, GeoNames, and WordNet. It contains 447 million facts about 9.8 million entities. Human evaluation confirmed an accuracy of 95% of the facts in YAGO2. In this paper, we present the extraction methodology, the integration of the spatio-temporal dimension, and our knowledge representation SPOTL, an extension of the original SPO-triple model to time and space. © 2012 Elsevier B.V. All rights reserved. 0 0
A hybrid method based on WordNet and Wikipedia for computing semantic relatedness between texts Malekzadeh R.
Bagherzadeh J.
Noroozi A.
AISP 2012 - 16th CSI International Symposium on Artificial Intelligence and Signal Processing English 2012 In this article we present a new method for computing semantic relatedness between texts. For this purpose we use a tow-phase approach. The first phase involves modeling document sentences as a matrix to compute semantic relatedness between sentences. In the second phase, we compare text relatedness by using the relation of their sentences. Since Semantic relation between words must be searched in lexical semantic knowledge source, selecting a suitable source is very important, so that produced accurate results with correct selection. In this work, we attempt to capture the semantic relatedness between texts with a more accuracy. For this purpose, we use a collection of tow well known knowledge bases namely, WordNet and Wikipedia, so that provide more complete data source for calculate the semantic relatedness with a more accuracy. We evaluate our approach by comparison with other existing techniques (on Lee datasets). 0 0
A novel Framenet-based resource for the semantic web Bryl V.
Tonelli S.
Claudio Giuliano
Luciano Serafini
Proceedings of the ACM Symposium on Applied Computing English 2012 FrameNet is a large-scale lexical resource encoding information about semantic frames (situations) and semantic roles. The aim of the paper is to enrich FrameNet by mapping the lexical fillers of semantic roles to WordNet using a Wikipedia-based detour. The applied methodology relies on a word sense disambiguation step, in which a Wikipedia page is assigned to a role filler, and then BabelNet and YAGO are used to acquire WordNet synsets for a filler. We show how to represent the acquired resource in OWL, linking it to the existing RDF/OWL representations of FrameNet and WordNet. Part of the resource is evaluated by matching it with the WordNet synsets manually assigned by FrameNet lexicographers to a subset of semantic roles. 0 0
Annotating words using wordnet semantic glosses Szymanski J.
Duch W.
Lecture Notes in Computer Science English 2012 An approach to the word sense disambiguation (WSD) relaying on the WordNet synsets is proposed. The method uses semantically tagged glosses to perform a process similar to the spreading activation in semantic network, creating ranking of the most probable meanings for word annotation. Preliminary evaluation shows quite promising results. Comparison with the state-of-the-art WSD methods indicates that the use of WordNet relations and semantically tagged glosses should enhance accuracy of word disambiguation methods. 0 0
Automatic taxonomy extraction in different languages using wikipedia and minimal language-specific information Dominguez Garcia R.
Schmidt S.
Rensing C.
Steinmetz R.
Lecture Notes in Computer Science English 2012 Knowledge bases extracted from Wikipedia are particularly useful for various NLP and Semantic Web applications due to their co- verage, actuality and multilingualism. This has led to many approaches for automatic knowledge base extraction from Wikipedia. Most of these approaches rely on the English Wikipedia as it is the largest Wikipedia version. However, each Wikipedia version contains socio-cultural knowledge, i.e. knowledge with relevance for a specific culture or language. In this work, we describe a method for extracting a large set of hyponymy relations from the Wikipedia category system that can be used to acquire taxonomies in multiple languages. More specifically, we describe a set of 20 features that can be used for for Hyponymy Detection without using additional language-specific corpora. Finally, we evaluate our approach on Wikipedia in five different languages and compare the results with the WordNet taxonomy and a multilingual approach based on interwiki links of the Wikipedia. 0 0
Automatic typing of DBpedia entities Aldo Gangemi
Nuzzolese A.G.
Valentina Presutti
Draicchio F.
Alberto Musetti
Paolo Ciancarini
Lecture Notes in Computer Science English 2012 We present Tìpalo, an algorithm and tool for automatically typing DBpedia entities. Tìpalo identifies the most appropriate types for an entity by interpreting its natural language definition, which is extracted from its corresponding Wikipedia page abstract. Types are identified by means of a set of heuristics based on graph patterns, disambiguated to WordNet, and aligned to two top-level ontologies: WordNet supersenses and a subset of DOLCE+DnS Ultra Lite classes. The algorithm has been tuned against a golden standard that has been built online by a group of selected users, and further evaluated in a user study. 0 0
BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network Roberto Navigli
Ponzetto S.P.
Artificial Intelligence English 2012 We present an automatic approach to the construction of BabelNet, a very large, wide-coverage multilingual semantic network. Key to our approach is the integration of lexicographic and encyclopedic knowledge from WordNet and Wikipedia. In addition, Machine Translation is applied to enrich the resource with lexical information for all languages. We first conduct in vitro experiments on new and existing gold-standard datasets to show the high quality and coverage of BabelNet. We then show that our lexical resource can be used successfully to perform both monolingual and cross-lingual Word Sense Disambiguation: thanks to its wide lexical coverage and novel semantic relations, we are able to achieve state-of the-art results on three different SemEval evaluation tasks. © 2012 Elsevier B.V. 0 0
Categorizing search results using WordNet and Wikipedia Hemayati R.T.
Meng W.
Yu C.
Lecture Notes in Computer Science English 2012 Terms used in search queries often have multiple meanings and usages. Consequently, search results corresponding to different meanings or usages may be retrieved, making identifying relevant results inconvenient and time-consuming. In this paper, we study the problem of grouping the search results based on the different meanings and usages of a query. We build on a previous work that identifies and ranks possible categories of any user query based on the meanings and common usages of the terms and phrases within the query. We use these categories to group search results. In this paper, we study different methods, including several new methods, to assign search result record (SRRs) to the categories. Our SRR grouping framework supports a combination of categorization, clustering and query rewriting techniques. Our experimental results show that some of our grouping methods can achieve high accuracy. 0 0
Coarse lexical semantic annotation with supersenses: An Arabic case study Schneider N.
Mohit B.
Oflazer K.
Smith N.A.
50th Annual Meeting of the Association for Computational Linguistics, ACL 2012 - Proceedings of the Conference English 2012 "Lightweight" semantic annotation of text calls for a simple representation, ideally without requiring a semantic lexicon to achieve good coverage in the language and domain. In this paper, we repurpose WordNet's supersense tags for annotation, developing specific guidelines for nominal expressions and applying them to Arabic Wikipedia articles in four topical domains. The resulting corpus has high coverage and was completed quickly with reasonable inter-annotator agreement. 0 0
Comparing taxonomies for organising collections of documents Fernando S.
Mary Hall
Eneko Agirre
Aitor Soroa
Clough P.
Stevenson M.
24th International Conference on Computational Linguistics - Proceedings of COLING 2012: Technical Papers English 2012 There is a demand for taxonomies to organise large collections of documents into categories for browsing and exploration. This paper examines four existing taxonomies that have been manually created, along with two methods for deriving taxonomies automatically from data items. We use these taxonomies to organise items from a large online cultural heritage collection. We then present two human evaluations of the taxonomies. The first measures the cohesion of the taxonomies to determine how well they group together similar items under the same concept node. The second analyses the concept relations in the taxonomies. The results show that the manual taxonomies have high quality well defined relations. However the novel automatic method is found to generate very high cohesion. 0 0
Computing text-to-text semantic relatedness based on building and analyzing enriched concept graph Jahanbakhsh Nagadeh Z.
Mahmoudi F.
Jadidinejad A.H.
Lecture Notes in Electrical Engineering English 2012 This paper discusses about effective usage of key concepts in computing texts semantic relatedness. Thus, we present a novel method for computing texts semantic relatedness by using key concepts. Problem of appropriate semantic resource selection is very important in Semantic Relatedness algorithms. For this purpose, we proposed to use a collection of two semantic resource namely, WordNet, Wikipedia, so that provide more complete data source and accuracy for calculate the semantic relatedness. Result of this proposal is compute semantic relatedness between almost any concepts pair. In purposed method, text is modeled as a graph of semantic relatedness between concepts of text that are exploited from WordNet and Wikipedia. This graph is named Enriched Concepts Graph. Then key concepts are extracted by analyzing ECG. Finally, texts semantic relatedness is obtained semantically by comparing key concepts of texts together. We evaluated our approach and obtained a high correlation coefficient of 0.782 which outperformed all other existing state of art approaches. © 2012 Springer Science+Business Media B.V. 0 0
Effective tag recommendation system based on topic ontology using Wikipedia and WordNet Subramaniyaswamy V.
Chenthur Pandian S.
International Journal of Intelligent Systems English 2012 In this paper, we proposed a novel approach based on topic ontology for tag recommendation. The proposed approach intelligently generates tag suggestions to blogs. In this approach, we construct topic ontology through enriching the set of categories in existing small ontology called as Open Directory Project. To construct topic ontology, a set of topics and their associated semantic relationships is identified automatically from the corpus-based external knowledge resources such as Wikipedia and WordNet. The construction relies on two folds such as concept acquisition and semantic relation extraction. In the first fold, a topic-mapping algorithm is developed to acquire the concepts from the semantic of Wikipedia. A semantic similarity-clustering algorithm is used to compute the semantic similarity measure to group the set of similar concepts. The second is the semantic relation extraction algorithm, which derives associated semantic relations between the set of extracted topics from the lexical patterns between synsets in WordNet. A suitable software prototype is created to implement the topic ontology construction process. A Jena API framework is used to organize the set of extracted semantic concepts and their corresponding relationship in the form of knowledgeable representation of Web ontology language. Thus, Protégé tool provides the platform to visualize the automatically constructed topic ontology successfully. Using the constructed topic ontology, we can generate and suggest the most suitable tags for the new resource to users. The applicability of topic ontology with a spreading activation algorithm supports efficient recommendation in practice that can recommend the most popular tags for a specific resource. The spreading activation algorithm can assign the interest scores to the existing extracted blog content and tags. The weight of the tags is computed based on the activation score determined from the similarity between the topics in constructed topic ontology and content of the existing blogs. High-quality tags that has the highest activation score is recommended to the users. Finally, we conducted experimental evaluation of our tag recommendation approach using a large set of real-world data sets. Our experimental results explore and compare the capabilities of our proposed topic ontology with the spreading activation tag recommendation approach with respect to the existing AutoTag mechanism. And also discuss about the improvement in precision and recall of recommended tags on the data sets of Delicious and BibSonomy. The experiment shows that tag recommendation using topic ontology results in the folksonomy enrichment. Thus, we report the results of an experiment mean to improve the performance of the tag recommendation approach and its quality. 0 0
Expanding approach to information retrieval using semantic similarity analysis based on wordnet and wikipedia Fei Zhao
Fang F.
Yan F.
Jin H.
Zhang Q.
International Journal of Software Engineering and Knowledge Engineering English 2012 Performance of information retrieval (IR) systems greatly relies on textual keywords and retrieval documents. Inaccurate and incomplete retrieval results are always induced by query drift and ignorance of semantic relationship among terms. Expanding retrieval approach attempts to incorporate expansion terms into original query, such as unexplored words combing from pseudo-relevance feedback (PRF) or relevance feedback documents semantic words extracting from external corpus etc. In this paper a semantic analysis-based query expansion method for information retrieval using WordNet and Wikipedia as corpus are proposed. We derive semantic-related words from human knowledge repositories such as WordNet and Wikipedia, which are combined with words filtered by semantic mining from PRF document. Our approach automatically generates new semantic-based query from original query of IR. Experimental results on TREC datasets and Google search engine show that performance of information retrieval can be significantly improved using proposed method over previous results. 0 0
Exploring lexicographic ontologies for hierarchically organizing the greek wikipedia articles Niarou M.
Stamou S.
Journal of Digital Information Management English 2012 To effectively manage the proliferating online content, it is imperative that we come up with efficient data structuring and organization methods. Based on the findings of previous research [6] [7] that the most flexible and useful way to organize the online content is via the use of taxonomies and/or ontologies, we carried out the present study, which aims at structuring the content of the Greek Wikipedia via the use of the Greek WordNet. In particular, our study objective is to design a model that can automatically organize the Greek Wikipedia categories into a thematic taxonomy and based on the derived organization, to implicitly assign hierarchical structure to the encyclopedia articles that have been classified to the respective categories. To this end, we relied on the data encoded in Greek WordNet out of which we harvested the hierarchical relations that hold between the terms used to verbalize the Wikipedia categories. The effectiveness of our model is verified by the findings of several experimental evaluations conducted, which demonstrate that semantic networks are powerful resources for hierarchically organizing large volumes of dynamic data. 0 0
LexOnt: A semi-automatic ontology creation tool for programmable web Arabshian K.
Danielsen P.
Afroz S.
AAAI Spring Symposium - Technical Report English 2012 We propose LexOnt, a semi-automatic ontology creation tool for a high-level service classification ontology. LexOnt uses the Programmable Web directory as the corpus, although it can evolve to use other corpora as well. The main contribution of LexOnt is its novel algorithm which generates and ranks frequent terms and significant phrases within a PW category by comparing them to external domain knowledge such as Wikipedia, Wordnet and the current state of the ontology. First it matches terms to the Wikipedia page description of the category and ranks them higher, since these indicate domain descriptive words. Synonymous words from Wordnet are then matched and ranked. In a semi-automated process, the user chooses the terms it wants to add to the ontology and indicates the properties to assign these values to and the ontology is automatically generated. In the next iteration, terms within the current state of the ontology are compared to terms in the other categories and automatic property assignments are made for these API instances as well. Copyright © 2012, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
Pattern for python De Smedt T.
Daelemans W.
Journal of Machine Learning Research English 2012 Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/ pattern. 0 0
Predicting user tags using semantic expansion Chandramouli K.
Piatrik T.
Izquierdo E.
Communications in Computer and Information Science English 2012 Manually annotating content such as Internet videos, is an intellectually expensive and time consuming process. Furthermore, keywords and community-provided tags lack consistency and present numerous irregularities. Addressing the challenge of simplifying and improving the process of tagging online videos, which is potentially not bounded to any particular domain, we present an algorithm for predicting user-tags from the associated textual metadata in this paper. Our approach is centred around extracting named entities exploiting complementary textual resources such as Wikipedia and Wordnet. More specifically to facilitate the extraction of semantically meaningful tags from a largely unstructured textual corpus we developed a natural language processing framework based on GATE architecture. Extending the functionalities of the in-built GATE named entities, the framework integrates a bag-of-articles algorithm for effectively searching through the Wikipedia articles for extracting relevant articles. The proposed framework has been evaluated against MediaEval 2010 Wild Wild Web dataset, which consists of large collection of Internet videos. 0 0
SemaFor: Semantic document indexing using semantic forests Tsatsaronis G.
Varlamis I.
Norvag K.
ACM International Conference Proceeding Series English 2012 Traditional document indexing techniques store documents using easily accessible representations, such as inverted indices, which can efficiently scale for large document sets. These structures offer scalable and efficient solutions in text document management tasks, though, they omit the cornerstone of the documents' purpose: meaning. They also neglect semantic relations that bind terms into coherent fragments of text that convey messages. When semantic representations are employed, the documents are mapped to the space of concepts and the similarity measures are adapted appropriately to better fit the retrieval tasks. However, these methods can be slow both at indexing and retrieval time. In this paper we propose SemaFor, an indexing algorithm for text documents, which uses semantic spanning forests constructed from lexical resources, like Wikipedia, and WordNet, and spectral graph theory in order to represent documents for further processing. 0 0
Study of ontology or thesaurus based document clustering and information retrieval Bharathi G.
Venkatesan D.
Journal of Theoretical and Applied Information Technology
Journal of Engineering and Applied Sciences
English 2012 Document clustering generates clusters from the whole document collection automatically and is used in many fields, including data mining and information retrieval. Clustering text data faces a number of new challenges. Among others, the volume of text data, dimensionality, sparsity and complex semantics are the most important ones. These characteristics of text data require clustering techniques to be scalable to large and high dimensional data, and able to handle sparsity and semantics. In the traditional vector space model, the unique words occurring in the document set are used as the features. But because of the synonym problem and the polysemous problem, such a bag of original words cannot represent the content of a document precisely. Most of the existing text clustering methods use clustering techniques which depend only on term strength and document frequency where single terms are used as features for representing the documents and they are treated independently which can be easily applied to non-ontological clustering. To overcome the above issues, this paper makes a survey of recent research done on ontology or thesaurus based document clustering.
Document clustering generate clusters from the whole document collection automatically and is used in many fields including data mining and information retrieval. Clustering text data faces a number of new challenges. Among others, the volume of text data, dimensionality, sparsity and complex semantics are the most impodant ones. These characteristics of text data require clustering techmques to be scalable to large and hgh dimensional data and able to handle sparsity and semantics. In the traditional vector space model, the unique words occurring in the document set are used as the features. But because of the synonym problem and the polysemous problem such a bag of original words cannot represent the content of a document precisely. Most of the existing text clustering methods use clustering techniques whch depend only on term strength and document frequency where single terms are used as features for representing the documents and they are treated independently whch can be easily applied to non-ontological clustering. To overcome these issues, this study makes a survey of recent research done on ontology or thesaurus based document clustering.
0 0
Tapping into knowledge base for concept feedback: Leveraging ConceptNet to improve search results for difficult queries Kotov A.
Zhai C.X.
WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining English 2012 Query expansion is an important and commonly used technique for improving Web search results. Existing methods for query expansion have mostly relied on global or local analysis of document collection, click-through data, or simple ontologies such as WordNet. In this paper, we present the results of a systematic study of the methods leveraging the ConceptNet knowledge base, an emerging new Web resource, for query expansion. Specifically, we focus on the methods leveraging ConceptNet to improve the search results for poorly performing (or difficult) queries. Unlike other lexico-semantic resources, such as WordNet and Wikipedia, which have been extensively studied in the past, ConceptNet features a graph-based representation model of commonsense knowledge, in which the terms are conceptually related through rich relational ontology. Such representation structure enables complex, multi-step inferences between the concepts, which can be applied to query expansion. We first demonstrate through simulation experiments that expanding queries with the related concepts from ConceptNet has great potential for improving the search results for difficult queries. We then propose and study several supervised and unsupervised methods for selecting the concepts from ConceptNet for automatic query expansion. The experimental results on multiple data sets indicate that the proposed methods can effectively leverage ConceptNet to improve the retrieval performance of difficult queries both when used in isolation as well as in combination with pseudo-relevance feedback. Copyright 2012 ACM. 0 0
Using a bilingual resource to add synonyms to awordnet: Finnwordnet and wikipedia as an example Niemi J.
Linden K.
Hyvarinen M.
GWC 2012: 6th International Global Wordnet Conference, Proceedings English 2012 This paper presents a simple method for finding new synonym candidates for a bilingual wordnet by using another bilingual resource. Our goal is to add new synonyms to the existing synsets of the Finnish WordNet, which has direct word sense translation correspondences to the Princeton WordNet. For this task, we use Wikipedia and its links between the articles of the same topic in Finnish and English. One of the automatically extracted groups of synonyms yielded ca. 2,000 synonyms with 89 % accuracy. 0 0
Web image retrieval re-ranking with Wikipedia semantics Seungwoo Lee
Cho S.
International Journal of Multimedia and Ubiquitous Engineering English 2012 Nowadays, to take advantage of tags is a general tendency when users need to store or retrieve images on the Web. In this article, we introduce some approaches to calculate semantic importance of tags attached to Web images, and to make re-ranking the retrieved images according to them. We have compared the results from image re-ranking with two semantic providers, Word Net and Wikipedia. With the semantic importance of image tags calculated by using Wikipedia, we found the superiority of the method in precision and recall rate as experimental results. 0 0
YouCat : Weakly supervised youtube video categorization system from meta data & user comments using wordnet & wikipedia Saswati Mukherjee
Prantik Bhattacharyya
24th International Conference on Computational Linguistics - Proceedings of COLING 2012: Technical Papers English 2012 In this paper, we propose a weakly supervised system, YouCat, for categorizing Youtube videos into different genres like Comedy, Horror, Romance, Sports and Technology The system takes a Youtube video url as input and gives it a belongingness score for each genre. The key aspects of this work can be summarized as: (1) Unlike other genre identification works, which are mostly supervised, this system is mostly unsupervised, requiring no labeled data for training. (2) The system can easily incorporate new genres without requiring labeled data for the genres. (3) YouCat extracts information from the video title, meta description and user comments (which together form the video descriptor). (4) It uses Wikipedia and WordNet for concept expansion. (5) The proposed algorithm with a time complexity of O( 0 0
Automatic acquisition of taxonomies in different languages from multiple Wikipedia versions Garcia R.D.
Rensing C.
Steinmetz R.
ACM International Conference Proceeding Series English 2011 In the last years, the vision of the Semantic Web has led to many approaches that aim to automatically derive knowledge bases from Wikipedia. These approaches rely mostly on the English Wikipedia as it is the largest Wikipedia version and have lead to valuable knowledge bases. However, each Wikipedia version contains socio-cultural knowledge, i.e. knowledge with specific relevance for a culture or language. One difficulty of the application of existing approaches to multiple Wikipedia versions is the use of additional corpora. In this paper, we describe the adaptation of existing heuristics that make the extraction of large sets of hyponymy relations from multiple Wikipedia versions with little information about each language possible. Further, we evaluate our approach with Wikipedia versions in four different languages and compare results with GermaNet for German and WordNet for English. 0 0
Automatic document tagging using online knowledge base Choi C.
Myunggwon Hwang
Choi D.
Choi J.
Kim P.
Information English 2011 Online Knowledge bases are utilized for semantic information processing such as WordNet. However, research indicates the existing knowledge base cannot cover all concepts used in talking and writing in the real world. It is necessary to use online knowledge base such as Wikipedia to resolve this limitation. Web document tagging generally chooses core words from a document itself. However, the core words are not standardized taggers. Thus, users should make an effort to grasp the tagged words first in the retrieval. This paper proposes methods to utilize titles (Wiki concept) of Wikipedia documents and to find the best Wiki concept that describes the Web documents (target documents). In addition to these methods, the research tries to classify target documents into a Wikipedia category (Wiki category) for semantic document interconnections. 0 0
Beyond the bag-of-words paradigm to enhance information retrieval applications Paolo Ferragina Proceedings - 4th International Conference on SImilarity Search and APplications, SISAP 2011 English 2011 The typical IR-approach to indexing, clustering, classification and retrieval, just to name a few, is the one based on the bag-of-words paradigm. It eventually transforms a text into an array of terms, possibly weighted (with tf-idf scores or derivatives), and then represents that array via points in highly-dimensional space. It is therefore syntactical and unstructured, in the sense that different terms lead to different dimensions. Co-occurrence detection and other processing steps have been thus proposed (see e.g. LSI, Spectral analysis [7]) to identify the existence of those relations, but yet everyone is aware of the limitations of this approach especially in the expanding context of short (and thus poorly composed) texts, such as the snippets of search-engine results, the tweets of a Twitter channel, the items of a news feed, the posts of a blog, or the advertisement messages, etc.. A good deal of recent work is attempting to go beyond this paradigm by enriching the input text with additional structured annotations. This general idea has been declined in the literature in two distinct ways. One consists of extending the classic term-based vector-space model with additional dimensions corresponding to features (concepts) extracted from an external knowledge base, such as DMOZ, Wikipedia, or even the whole Web (see e.g. [4, 5, 12]). The pro of this approach is to extend the bag-of-words scheme with more concepts, thus possibly allowing the identification of related texts which are syntactically far apart. The cons resides in the contamination of these vectors by un-related (but common) concepts retrieved via the syntactic queries. The second way consists of identifying in the input text short-and-meaningful sequences of terms (aka spots) which are then connected to unambiguous concepts drawn from a catalog. The catalog can be formed by either a small set of specifically recognized types, most often People and Locations (aka Named Entities, see e.g. [13, 14]), or it can consists of millions of concepts drawn from a large knowledge base, such as Wikipedia. This latter catalog is ever-expanding and currently offers the best trade-off between a catalog with a rigorous structure but with low coverage (like WordNet, CYC, TAP), and a large text collection with wide coverage but unstructured and noised content (like the whole Web). To understand how this annotation works, let us consider the following short news: "Diego Maradona won against Mexico". The goal of the annotation is to detect "Diego Maradona" and"Mexico" as spots, and then hyper-link them with theWikipedia pages which deal with the ex Argentina's coach and the football team of Mexico. The annotator uses as spots the anchor texts which occur in Wikipedia pages, and as possible concepts for each spot the (possibly many) pages pointed in Wikipedia by that spot/anchor 0 0
Building ontology for mashup services based on Wikipedia Xiao K.
Li B.
Lecture Notes in Electrical Engineering English 2011 Tagging as a useful way to organize online resources has attracted many attentions in the last few years. And many ontology building approaches are proposed using such tags. While tags usually associated with concepts in some databases, such as WordNet and online ontologies. However, these databases are stable, static and lack of consistence. In this paper, we build an ontology for a collection of mashup services using their affiliated tags by referring to the entries of Wikipedia. Core tags are filtered out and mapped to the corresponding Wikipedia entries (i.e., URIs). An experiment is given as an illustration. © 2011 Springer Science+Business Media B.V. 0 0
Capability modeling of knowledge-based agents for commonsense knowledge integration Kuo Y.-L.
Hsu J.Y.-J.
Lecture Notes in Computer Science English 2011 Robust intelligent systems require commonsense knowledge. While significant progress has been made in building large commonsense knowledge bases, they are intrinsically incomplete. It is difficult to combine multiple knowledge bases due to their different choices of representation and inference mechanisms, thereby limiting users to one knowledge base and its reasonable methods for any specific task. This paper presents a multi-agent framework for commonsense knowledge integration, and proposes an approach to capability modeling of knowledge bases without a common ontology. The proposed capability model provides a general description of large heterogeneous knowledge bases, such that contents accessible by the knowledge-based agents may be matched up against specific requests. The concept correlation matrix of a knowledge base is transformed into a k-dimensional vector space using low-rank approximation for dimensionality reduction. Experiments are performed with the matchmaking mechanism for commonsense knowledge integration framework using the capability models of ConceptNet, WordNet, and Wikipedia. In the user study, the matchmaking results are compared with the ranked lists produced by online users to show that over 85% of them are accurate and have positive correlation with the user-produced ranked lists. 0 0
Categorising social tags to improve folksonomy-based recommendations Ivan Cantador
Ioannis Konstas
Jose J.M.
Journal of Web Semantics English 2011 In social tagging systems, users have different purposes when they annotate items. Tags not only depict the content of the annotated items, for example by listing the objects that appear in a photo, or express contextual information about the items, for example by providing the location or the time in which a photo was taken, but also describe subjective qualities and opinions about the items, or can be related to organisational aspects, such as self-references and personal tasks. Current folksonomy-based search and recommendation models exploit the social tag space as a whole to retrieve those items relevant to a tag-based query or user profile, and do not take into consideration the purposes of tags. We hypothesise that a significant percentage of tags are noisy for content retrieval, and believe that the distinction of the personal intentions underlying the tags may be beneficial to improve the accuracy of search and recommendation processes. We present a mechanism to automatically filter and classify raw tags in a set of purpose-oriented categories. Our approach finds the underlying meanings (concepts) of the tags, mapping them to semantic entities belonging to external knowledge bases, namely WordNet and Wikipedia, through the exploitation of ontologies created within the W3C Linking Open Data initiative. The obtained concepts are then transformed into semantic classes that can be uniquely assigned to content- and context-based categories. The identification of subjective and organisational tags is based on natural language processing heuristics. We collected a representative dataset from Flickr social tagging system, and conducted an empirical study to categorise real tagging data, and evaluate whether the resultant tags categories really benefit a recommendation model using the Random Walk with Restarts method. The results show that content- and context-based tags are considered superior to subjective and organisational tags, achieving equivalent performance to using the whole tag space. © 2010 Elsevier B.V. All rights reserved. 0 0
Cooperative WordNet editor for lexical semantic acquisition Szymanski J. Communications in Computer and Information Science English 2011 The article describes an approach for building WordNet semantic dictionary in a collaborative approach paradigm. The presented system system enables functionality for gathering lexical data in a Wikipedia-like style. The core of the system is a user-friendly interface based on component for interactive graph navigation. The component has been used for WordNet semantic network presentation on web page, and it brings functionalities of modification its content by the distributed group of people. 0 0
Enhancing accessibility of microblogging messages using semantic knowledge Hu X.
Tang L.
Hongyan Liu
International Conference on Information and Knowledge Management, Proceedings English 2011 The volume of microblogging messages is increasing exponentially with the popularity of microblogging services. With a large number of messages appearing in user interfaces, it hinders user accessibility to useful information buried in disorganized, incomplete, and unstructured text messages. In order to enhance user accessibility, we propose to aggregate related microblogging messages into clusters and automatically assign them semantically meaningful labels. However, a distinctive feature of microblogging messages is that they are much shorter than conventional text documents. These messages provide inadequate term co occurrence information for capturing semantic associations. To address this problem, we propose a novel framework for organizing unstructured microblogging messages by transforming them to a semantically structured representation. The proposed framework first captures informative tree fragments by analyzing a parse tree of the message, and then exploits external knowledge bases (Wikipedia and WordNet) to enhance their semantic information. Empirical evaluation on a Twitter dataset shows that our framework significantly outperforms existing state-of-the-art methods. 0 0
Extracting and modeling user interests based on social media Wasim M.
Shahzadi I.
Ahmad Q.
Mahmood W.
Proceedings of the 14th IEEE International Multitopic Conference 2011, INMIC 2011 English 2011 With the increasing demand for personalized applications, user interests' mining is gaining more and more importance. Various sources of information have been used for extracting and modeling parameters that portray users' interests. Social media has become one of the most popular and significant platform for information sharing and dissemination. These social platforms provide users not only a medium to share the content of their interest but also provide an insight of their day to day activities. Mining this content to define user interests can be used for the customization and personalization of a variety of commercial and non-commercial applications like product marketing and recommendation. In this paper, we propose a user interest's model using a popular social community called twitter. Proposed user model represents interests in the form of ontological concepts interlinked with predefined source ontology by using concepts from Wikipedia and Wordnet. 0 0
Extracting events from Wikipedia as RDF triples linked to widespread semantic web datasets Carlo Aliprandi
Francesco Ronzano
Andrea Marchetti
Maurizio Tesconi
Salvatore Minutoli
Lecture Notes in Computer Science English 2011 Many attempts have been made to extract structured data from Web resources, exposing them as RDF triples and interlinking them with other RDF datasets: in this way it is possible to create clouds of highly integrated Semantic Web data collections. In this paper we describe an approach to enhance the extraction of semantic contents from unstructured textual documents, in particular considering Wikipedia articles and focusing on event mining. Starting from the deep parsing of a set of English Wikipedia articles, we produce a semantic annotation compliant with the Knowledge Annotation Format (KAF). We extract events from the KAF semantic annotation and then we structure each event as a set of RDF triples linked to both DBpedia and WordNet. We point out examples of automatically mined events, providing some general evaluation of how our approach may discover new events and link them to existing contents. 0 0
Graph-based named entity linking with Wikipedia Ben Hachey
Will Radford
Curran J.R.
Lecture Notes in Computer Science English 2011 Named entity linking (NEL) grounds entity mentions to their corresponding Wikipedia article. State-of-the-art supervised NEL systems use features over the rich Wikipedia document and link-graph structure. Graph-based measures have been effective over WordNet for word sense disambiguation (wsd). We draw parallels between NEL and (wsd), motivating our unsupervised NEL approach that exploits the Wikipedia article and category link graphs. Our system achieves 85.5% accuracy on the TAC 2010 shared task - competitive with the best supervised and unsupervised systems. 0 0
Harnessing different knowledge sources to measure semantic relatedness under a uniform model Zhang Z.
Gentile A.L.
Ciravegna F.
EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference English 2011 Measuring semantic relatedness between words or concepts is a crucial process to many Natural Language Processing tasks. Exiting methods exploit semantic evidence from a single knowledge source, and are predominantly evaluated only in the general domain. This paper introduces a method of harnessing different knowledge sources under a uniform model for measuring semantic relatedness between words or concepts. Using Wikipedia and WordNet as examples, and evaluated in both the general and biomedical domains, it successfully combines strengths from both knowledge sources and outperforms state-of-the-art on many datasets. 0 0
ITEM: Extract and integrate entities from tabular data to RDF knowledge base Guo X.
Yirong Chen
Jilin Chen
Du X.
Lecture Notes in Computer Science English 2011 Many RDF Knowledge Bases are created and enlarged by mining and extracting web data. Hence their data sources are limited to social tagging networks, such as Wikipedia, WordNet, IMDB, etc., and their precision is not guaranteed. In this paper, we propose a new system, ITEM, for extracting and integrating entities from tabular data to RDF knowledge base. ITEM can efficiently compute the schema mapping between a table and a KB, and inject novel entities into the KB. Therefore, ITEM can enlarge and improve RDF KB by employing tabular data, which is assumed of high quality. ITEM detects the schema mapping between table and RDF KB only by tuples, rather than the table's schema information. Experimental results show that our system has high precision and good performance. 0 0
Linguistically informed mining lexical semantic relations from Wikipedia structure Maciej Piasecki
Agnieszka Indyka-Piasecka
Roman Kurc
Lecture Notes in Computer Science English 2011 A method of the extraction of the wordnet lexico-semantic relations from the Polish Wikipedia articles was proposed. The method is based on a set of hand-written set of lexico-morphosyntactic extraction patterns that were developed in less than one man-week of workload. Two kinds of patterns were proposed: processing encyclopaedia articles as text documents, and utilising the information about the structure of the Wikipedia article (including links). Two types of evaluation were applied: manual assessment of the extracted data and on the basis of the application of the extracted data as an additional knowledge source in automatic plWordNet expansion. 0 0
Linked open data: For NLP or by NLP? Choi K.-S. PACLIC 25 - Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation English 2011 If we call Wikipedia or Wiktionary as "web knowledge resource", the question is about whether they can contribute to NLP itself and furthermore to the knowledge resource for knowledge-leveraged computational thinking. Comparing with the structure insideWordNet from the view of its human- encoded precise classification scheme, such web knowledge resource has category structure based on collectively generated tags and structures like infobox. They are called also as "Collectively Generated Content" and its structuralized content based on collective intelligence. It is heavily based on linking among terms and we also say that it is one member of linked data. The problem is in whether such collectively generated knowledge resource can contribute to NLP and how much it can be effective. The more clean primitives of linked terms in web knowledge resources will be assumed, based on the essential property of Guarino (2000) or intrinsic property of Mizoguchi (2004). The number of entries in web knowledge resources increases very fast but their inter-relationships are indirectly calculated by their link structure. We can imagine that their entries could be mapped to one of instances under some structure of primitive concepts, like synsets of WordNet. Let's name such primitives to be "intrinsic tokens" that are derived from collectively generated knowledge resource under the principles of intrinsic properties. The procedure could be approximately proven and it will be a kind of statistical logic. We then go to the issues about what area of NLP can be solved by the so-called intrinsic tokens and their relations, a resultant approximately generated primitives. Can NLP contribute to the user generation process of content? Consider the structure of infobox in Wikipedia more closely. It will be discussed about how NLP can help the population of relevant entries, like the social network mechanism for multi-lingual environment and information extraction purpose. The traditional NLP starts from words in text but now also works have been undergoing on the web corpus with hyperlinks and html markups. In web knowledge resources, the words and chunks have underlying URIs, a kind of annotation. It signals a new paradigm of NLP. 0 0
Model-portability experiments for textual temporal analysis Kolomiyets O.
Bethard S.
Moens M.-F.
ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies English 2011 We explore a semi-supervised approach for improving the portability of time expression recognition to non-newswire domains: we generate additional training examples by substituting temporal expression words with potential synonyms. We explore using synonyms both from WordNet and from the Latent Words Language Model (LWLM), which predicts synonyms in context using an unsupervised approach. We evaluate a state-of-the-art time expression recognition system trained both with and without the additional training examples using data from TempEval 2010, Reuters and Wikipedia. We find that the LWLM provides substantial improvements on the Reuters corpus, and smaller improvements on the Wikipedia corpus. We find that WordNet alone never improves performance, though intersecting the examples from the LWLM and WordNet provides more stable results for Wikipedia. 0 0
NULEX: An open-license broad coverage lexicon McFate C.J.
Forbus K.D.
ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies English 2011 Broad coverage lexicons for the English language have traditionally been handmade. This approach, while accurate, requires too much human labor. Furthermore, resources contain gaps in coverage, contain specific types of information, or are incompatible with other resources. We believe that the state of open-license technology is such that a comprehensive syntactic lexicon can be automatically compiled. This paper describes the creation of such a lexicon, NU-LEX, an open-license feature-based lexicon for general purpose parsing that combines WordNet, VerbNet, and Wiktionary and contains over 100,000 words. NU-LEX was integrated into a bottom up chart parser. We ran the parser through three sets of sentences, 50 sentences total, from the Simple English Wikipedia and compared its performance to the same parser using Comlex. Both parsers performed almost equally with NU-LEX finding all lex-items for 50% of the sentences and Comlex succeeding for 52%. Furthermore, NULEX's shortcomings primarily fell into two categories, suggesting future research directions. 0 0
Ontology enhancement and concept granularity learning: Keeping yourself current and adaptive Jiang S.
Bing L.
Sun B.
YanChun Zhang
Lam W.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining English 2011 As a well-known semantic repository, WordNet is widely used in many applications. However, due to costly edit and maintenance, WordNet's capability of keeping up with the emergence of new concepts is poor compared with on-line encyclopedias such as Wikipedia. To keep WordNet current with folk wisdom, we propose a method to enhance WordNet automatically by merging Wikipedia entities into WordNet, and construct an enriched ontology, named as WorkiNet. WorkiNet keeps the desirable structure of WordNet. At the same time, it captures abundant information from Wikipedia. We also propose a learning approach which is able to generate a tailor-made semantic concept collection for a given document collection. The learning process takes the characteristics of the given document collection into consideration and the semantic concepts in the tailor-made collection can be used as new features for document representation. The experimental results show that the adaptively generated feature space can outperform a static one significantly in text mining tasks, and WorkiNet dominates WordNet most of the time due to its high coverage. Copyright 2011 ACM. 1 0
Relational similarity measure: An approach combining Wikipedia and wordnet Cao Y.J.
Lu Z.
Cai S.M.
Applied Mechanics and Materials English 2011 Relational similarities between two pairs of words are the degrees of their semantic relations. Vector Space Model (VSM) is used to measure the relational similarity between two pairs of words, however it needs create patterns manually and these patterns are limited. Recently, Latent Relational Analysis (LRA) is proposed and achieves state-of-art results. However, it is time-consuming and cannot express implicit semantic relations. In this study, we propose a new approach to measure relational similarities between two pairs of words by combining Wordnet3.0 and the Web-Wikipedia, thus implicit semantic relations from the very large corpus can be mined. The proposed approach mainly possesses two characters: (1) A new method is proposed in the pattern extraction step, which considers various part-of-speeches of words. (2) Wordnet3.0 is applied to calculate the semantic relatedness between a pair of words so that the implicit semantic relation of the two words can be expressed. Experimental evaluation based on the 374 SAT multiple-choice word-analogy questions, the precision of the proposed approach is 43.9%, which is lower than that of LRA suggested by Turney in 2005, but the suggested approach mainly focuses on mining the semantic relations among words. 0 0
Searching the Web for Peculiar Images based on hand-made concept hierarchies Hattori S. Proceedings of the 2011 7th International Conference on Next Generation Web Services Practices, NWeSP 2011 English 2011 Most researches on Image Retrieval (IR) have aimed at clearing away noisy images and allowing users to search only acceptable images for a target object specified by its object-name. We have become able to get enough acceptable images of a target object just by submitting its object-name to a conventional keyword-based Web image search engine. However, because the search results rarely include its uncommon images, we can often get only its common images and cannot easily get exhaustive knowledge about its appearance (look and feel). As next steps of IR, it is very important to discriminate between "Typical Images" and "Peculiar Images" in the acceptable images, and moreover, to collect many different kinds of peculiar images exhaustively. This paper proposes a method to search the Web for peculiar images by expanding or modifying a target object-name (as an original query) with its hyponyms based on hand-made concept hierarchies such as WordNet and Wikipedia. 0 0
Semantic processing of database textual attributes using Wikipedia Campana J.R.
Medina J.M.
Vila M.A.
Lecture Notes in Computer Science English 2011 Text attributes in databases contain rich semantic information that is seldom processed or used. This paper proposes a method to extract and semantically represent concepts from texts stored in databases. This process relies on tools such as WordNet and Wikipedia to identify concepts extracted from texts and represent them as a basic ontology whose concepts are annotated with search terms. This ontology can play diverse roles. It can be seen as a conceptual summary of the content of an attribute, which can be used as a means to navigate through the textual content of an attribute. It can also be used as a profile for text search using the terms associated to the ontology concepts. The ontology is built as a subset of Wikipedia category graph, selected using diverse metrics. Category selection using these metrics is discussed and an example application is presented and evaluated. 0 0
Simultaneous joint and conditional modeling of documents tagged from two perspectives Das P.
Srihari R.
Fu Y.
International Conference on Information and Knowledge Management, Proceedings English 2011 This paper explores correspondence and mixture topic modeling of documents tagged from two different perspectives. There has been ongoing work in topic modeling of documents with tags (tag-topic models) where words and tags typically reflect a single perspective, namely document content. However, words in documents can also be tagged from different perspectives, for example, syntactic perspective as in part-of-speech tagging or an opinion perspective as in sentiment tagging. The models proposed in this paper are novel in: (i) the consideration of two different tag perspectives - a document level tag perspective that is relevant to the document as a whole and a word level tag perspective pertaining to each word in the document; (ii) the attribution of latent topics with word level tags and labeling latent topics with images in case of multimedia documents; and (iii) discovering the possible correspondence of the words to document level tags. The proposed correspondence tag-topic model shows better predictive power i.e. higher likelihood on heldout test data than all existing tag topic models and even a supervised topic model. To evaluate the models in practical scenarios, quantitative measures between the outputs of the proposed models and the ground truth domain knowledge have been explored. Manually assigned (gold standard) document category labels in Wikipedia pages are used to validate model-generated tag suggestions using a measure of pairwise concept similarity within an ontological hierarchy like WordNet. Using a news corpus, automatic relationship discovery between person names was performed and compared to a robust baseline. 0 0
Two birds with one stone: Learning semantic models for text categorization and word sense disambiguation Roberto Navigli
Stefano Faralli
Aitor Soroa
Oier De Lacalle
Eneko Agirre
International Conference on Information and Knowledge Management, Proceedings English 2011 In this paper we present a novel approach to learning semantic models for multiple domains, which we use to categorize Wikipedia pages and to perform domain Word Sense Disambiguation (WSD). In order to learn a semantic model for each domain we first extract relevant terms from the texts in the domain and then use these terms to initialize a random walk over the WordNet graph. Given an input text, we check the semantic models, choose the appropriate domain for that text and use the best-matching model to perform WSD. Our results show considerable improvements on text categorization and domain WSD tasks. 0 0
Using a lexical dictionary and a folksonomy to automatically construct domain ontologies Macias-Galindo D.
Wong W.
Cavedon L.
Thangarajah J.
Lecture Notes in Computer Science English 2011 We present and evaluate MKBUILD, a tool for creating domain-specific ontologies. These ontologies, which we call Modular Knowledge Bases (MKBs), contain concepts and associations imported from existing large-scale knowledge resources, in particular WordNet and Wikipedia. The combination of WordNet's human-crafted taxonomy and Wikipedia's semantic associations between articles produces a highly connected resource. Our MKBs are used by a conversational agent operating in a small computational environment. We constructed several domains with our technique, and then conducted an evaluation by asking human subjects to rate the domain-relevance of the concepts included in each MKB on a 3-point scale. The proposed methodology achieved precision values between 71% and 88% and recall between 37% and 95% in the evaluation, depending on how the middle-score judgements are interpreted. The results are encouraging considering the cross-domain nature of the construction process and the difficulty of representing concepts as opposed to terms. 0 0
Using ontological and document similarity to estimate museum exhibit relatedness Grieser K.
Baldwin T.
Bohnert F.
Sonenberg L.
Journal of Computing and Cultural Heritage English 2011 Exhibits within cultural heritage collections such as museums and art galleries are arranged by experts with intimate knowledge of the domain, but there may exist connections between individual exhibits that are not evident in this representation. For example, the visitors to such a space may have their own opinions on how exhibits relate to one another. In this article, we explore the possibility of estimating the perceived relatedness of exhibits by museum visitors through a variety of ontological and document similarity-based methods. Specifically, we combine theWikipedia category hierarchy with lexical similarity measures, and evaluate the correlation with the relatedness judgements of visitors. We compare our measure with simple document similarity calculations, based on either Wikipedia documents or Web pages taken from the Web site for the museum of interest. We also investigate the hypothesis that physical distance in the museum space is a direct representation of the conceptual distance between exhibits. We demonstrate that ontological similarity measures are highly effective at capturing perceived relatedness and that the proposed RACO(Related Article Conceptual Overlap) method is able to achieve results closest to relatedness judgements provided by human annotators compared to existing state-of-the art measures of semantic relatedness. 0 0
A recursive approach to entity ranking and list completion using entity determining terms, qualifiers and prominent n-grams Ramanathan M.
Rajagopal S.
Karthik V.
Murugeshan M.S.
Saswati Mukherjee
Lecture Notes in Computer Science English 2010 This paper presents our approach for INEX 2009 Entity Ranking track which consists of two subtasks viz. Entity Ranking and List Completion. Retrieving the correct entities according to the user query is a three-step process viz. extracting the required information from the query and the provided categories, extracting the relevant documents which may be either prospective entities or intermediate pointers to prospective entities by making use of the structure available in the Wikipedia Corpus and finally ranking the resultant set of documents. We have extracted the Entity Determining Terms (EDTs), Qualifiers and prominent n-grams from the query, strategically exploited the relation between the extracted terms and the structure and connectedness of the corpus to retrieve links which are highly probable of being entities and then used a recursive mechanism for retrieving relevant documents through the Lucene Search. Our ranking mechanism combines various approaches that make use of category information, links, titles and WordNet information, initial description and the text of the document. 0 0
Aligning WordNet synsets and wikipedia articles Fernando S.
Stevenson M.
AAAI Workshop - Technical Report English 2010 This paper examines the problem of finding articles in Wikipedia to match noun synsets in WordNet. The motivation is that these articles enrich the synsets with much more information than is already present in WordNet. Two methods are used. The first is title matching, following redirects and disambiguation links. The second is information retrieval over the set of articles. The methods are evaluated over a random sample set of 200 noun synsets which were manually annotated. With 10 candidate articles retrieved for each noun synset, the methods achieve recall of 93%. The manually annotated data set and the automatically generated candidate article sets are available online for research purposes. Copyright © 2010, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
Automatic evaluation of topic coherence Newman D.
Lau J.H.
Grieser K.
Baldwin T.
NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference English 2010 This paper introduces the novel task of topic coherence evaluation, whereby a set of words, as generated by a topic model, is rated for coherence or interpretability. We apply a range of topic scoring models to the evaluation task, drawing on WordNet, Wikipedia and the Google search engine, and existing research on lexical similarity/relatedness. In comparison with human scores for a set of learned topics over two distinct datasets, we show a simple co-occurrence measure based on point-wise mutual information over Wikipedia data is able to achieve results for the task at or nearing the level of inter-annotator correlation, and that other Wikipedia-based lexical relatedness methods also achieve strong results. Google produces strong, if less consistent, results, while our results over WordNet are patchy at best. 0 0
Automatic generation of semantic fields for annotating web images Gang Wang
Chua T.S.
Ngo C.-W.
Wang Y.C.
Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference English 2010 The overwhelming amounts of multimedia contents have triggered the need for automatically detecting the semantic concepts within the media contents. With the development of photo sharing websites such as Flickr, we are able to obtain millions of images with usersupplied tags. However, user tags tend to be noisy, ambiguous and incomplete. In order to improve the quality of tags to annotate web images, we propose an approach to build Semantic Fields for annotating the web images. The main idea is that the images are more likely to be relevant to a given concept, if several tags to the image belong to the same Semantic Field as the target concept. Semantic Fields are determined by a set of highly semantically associated terms with high tag co-occurrences in the image corpus and in different corpora and lexica such as WordNet and Wikipedia. We conduct experiments on the NUSWIDE web image corpus and demonstrate superior performance on image annotation as compared to the state-ofthe- art approaches. 0 0
Automatically acquiring a semantic network of related concepts Szumlanski S.
Gomez F.
International Conference on Information and Knowledge Management, Proceedings English 2010 We describe the automatic construction of a semantic network1, in which over 3000 of the most frequently occurring monosemous nouns2 in Wikipedia (each appearing between 1,500 and 100,000 times) are linked to their semantically related concepts in the WordNet noun ontology. Relatedness between nouns is discovered automatically from cooccurrence in Wikipedia texts using an information theoretic inspired measure. Our algorithm then capitalizes on salient sense clustering among related nouns to automatically dis-ambiguate them to their appropriate senses (i.e., concepts). Through the act of disambiguation, we begin to accumulate relatedness data for concepts denoted by polysemous nouns, as well. The resultant concept-to-concept associations, covering 17,543 nouns, and 27,312 distinct senses among them, constitute a large-scale semantic network of related concepts that can be conceived of as augmenting the WordNet noun ontology with related-to links. 0 0
BabelNet: Building a very large multilingual semantic network Roberto Navigli
Ponzetto S.P.
ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference English 2010 In this paper we present BabelNet - a very large, wide-coverage multilingual semantic network. The resource is automatically constructed by means of a methodology that integrates lexicographic and encyclopedic knowledge from WordNet and Wikipedia. In addition Machine Translation is also applied to enrich the resource with lexical information for all languages. We conduct experiments on new and existing gold-standard datasets to show the high quality and coverage of the resource. 0 0
Concept neighbourhoods in knowledge organisation systems Priss U.
Old L.J.
Advances in Knowledge Organization English 2010 This paper discusses the application of concept neighbourhoods (in the sense of formal concept analysis) to knowledge organisation systems. Examples are provided using Roget's Thesaurus, WordNet and Wikipedia categories. 0 0
Exploring the semantics behind a collection to improve automated image annotation Llorente A.
Motta E.
Stefan Ruger
Lecture Notes in Computer Science English 2010 The goal of this research is to explore several semantic relatedness measures that help to refine annotations generated by a baseline non-parametric density estimation algorithm. Thus, we analyse the benefits of performing a statistical correlation using the training set or using the World Wide Web versus approaches based on a thesaurus like WordNet or Wikipedia (considered as a hyperlink structure). Experiments are carried out using the dataset provided by the 2009 edition of the ImageCLEF competition, a subset of the MIR-Flickr 25k collection. Best results correspond to approaches based on statistical correlation as they do not depend on a prior disambiguation phase like WordNet and Wikipedia. Further work needs to be done to assess whether proper disambiguation schemas might improve their performance. 0 0
Extraction and approximation of numerical attributes from the Web Davidov D.
Rappoport A.
ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference English 2010 We present a novel framework for automated extraction and approximation of numerical object attributes such as height and weight from the Web. Given an object-attribute pair, we discover and analyze attribute information for a set of comparable objects in order to infer the desired value. This allows us to approximate the desired numerical values even when no exact values can be found in the text. Our framework makes use of relation defining patterns and WordNet similarity information. First, we obtain from the Web andWordNet a list of terms similar to the given object. Then we retrieve attribute values for each term in this list, and information that allows us to compare different objects in the list and to infer the attribute value range. Finally, we combine the retrieved data for all terms from the list to select or approximate the requested value. We evaluate our method using automated question answering, WordNet enrichment, and comparison with answers given in Wikipedia and by leading search engines. In all of these, our framework provides a significant improvement. 0 0
Folksonomy expansion process using soft techniques Martinez-Cruz C.
Angeletou S.
Proceedings - 2010 7th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2010 English 2010 The use of folksonomies involves several problems due to its lack of semantics associated with them. The nature of these structures makes difficult the process to enrich them semantically by the association of meaningful terms of the Semantic Web. This task implies a phase of disambiguation and another of expansion of the initial tagset, returning an increased contextualised set where synonyms, hyperonyms, gloss terms, etc. are part of it. In this novel proposal a technique based on confidence and similarity degrees is applied to weight this extended tagset in order to allow the user to obtain a customised resulting tagset. Moreover a comparision between the two main thesaurus, WordNet and Wikipedia, are presented due to their great influence in the disambiguation and expansion process. 0 0
Identifying and ranking possible semantic and common usage categories of search engine queries Hemayati R.T.
Meng W.
Yu C.
Lecture Notes in Computer Science English 2010 In this paper, we propose a method for identifying and ranking possible categories of any user query based on the meanings and common usages of the terms and phrases within the query. Our solution utilizes WordNet and Wikipedia to recognize phrases and to determine the basic meanings and usages of each term or phrase in a query. The categories are ranked based on their likelihood in capturing the query's intention. Experimental results show that our method can achieve high accuracy. 0 0
Identifying animals with dynamic location-aware and semantic hierarchy-based image browsing for different cognitive style learners Wen D.
Liu M.-C.
Huang Y.-M.
Hung P.-H.
Proceedings - 10th IEEE International Conference on Advanced Learning Technologies, ICALT 2010 English 2010 Lack of overall ecological knowledge structure is a critical reason for learners' failure in keyword-based search. To address this issue, this paper firstly presents the dynamic location-aware and semantic hierarchy (DLASH) designed for the learners to browse images, which aims to identify learners' current interesting sights and provide adaptive assistance accordingly in ecological learning. The main idea is based on the observation that the species of plants and animals are discontinuously distributed around the planet, and hence their semantic hierarchy, besides its structural similarity with WordNet, is related to location information. This study then investigates how different cognitive styles of the learners influence the use of DLASH in their image browsing. The preliminary results show that the learners perform better when using DLASH based image browsing than using the Flickr one. In addition, cognitive styles have more effects on image browsing in the DLASH version than in the Flickr one. 0 0
Knowledge-rich Word Sense Disambiguation rivaling supervised systems Ponzetto S.P.
Roberto Navigli
ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference English 2010 One of the main obstacles to high-performance Word Sense Disambiguation (WSD) is the knowledge acquisition bottleneck. In this paper, we present a methodology to automatically extend WordNet with large amounts of semantic relations from an encyclopedic resource, namely Wikipedia. We show that, when provided with a vast amount of high-quality semantic relations, simple knowledge-lean disambiguation algorithms compete with state-of-the-art supervised WSD systems in a coarse-grained all-words setting and outperform them on gold-standard domain-specific datasets. 0 0
Learning a large scale of ontology from Japanese Wikipedia Susumu Tamagawa
Shinya Sakurai
Takuya Tejima
Takeshi Morita
Noriaki Izumi
Takahira Yamaguchi
Transactions of the Japanese Society for Artificial Intelligence Japanese 2010 Here is discussed how to learn a large scale of ontology from Japanese Wikipedia. The learned ontology includes the following properties: rdfs:subClassOf (IS-A relationship), rdf:type (class-instance relationship), owl:Object/DatatypeProperty (Infobox triple), rdfs:domain (property domain), and skos:altLabel (synonym). Experimental case studies show us that the learned Japanese Wikipedia Ontology goes better than already existing general linguistic ontologies, such as EDR and Japanese WordNet, from the points of building costs and structure information richness. 0 0
Learning animal concepts with semantic hierarchy-based location-aware image browsing and ecology task generator Liu M.-C.
Wen D.
Huang Y.-M.
6th IEEE International Conference on Wireless, Mobile and Ubiquitous Technologies in Education, WMUTE 2010: Mobile Social Media for Learning and Education in Formal and Informal Settings English 2010 This study firstly notices that lack of overall ecologic knowledge structure is one critical reason for learners' failure of keyword search. Therefore in order to identify their current interesting sight, the dynamic location-aware and semantic hierarchy (DLASH) is presented for learners to browse images. This hierarchy mainly considers that plant and animal species are discontinuously distributed around the planet, hence this hierarchy combines location information for constructing the semantic hierarchy through WordNet. After learners confirmed their intent information needs, this study also provides learners three kinds of image-based learning tasks to learn: similar-images comparison, concept map fill-out and placement map fill-out. These tasks are designed based on Ausubel's advance organizers and improved it by integrating three new properties: Displaying the nodes of the concepts by authentic images, automatically generating the knowledge structure by computer and interactively integrating new and old knowledge. 0 0
MENTA: Inducing multilingual taxonomies from Wikipedia De Melo G.
Gerhard Weikum
International Conference on Information and Knowledge Management, Proceedings English 2010 In recent years, a number of projects have turned to Wikipedia to establish large-scale taxonomies that describe orders of magnitude more entities than traditional manually built knowledge bases. So far, however, the multilingual nature of Wikipedia has largely been neglected. This paper investigates how entities from all editions of Wikipedia as well as WordNet can be integrated into a single coherent taxonomic class hierarchy. We rely on linking heuristics to discover potential taxonomic relationships, graph partitioning to form consistent equivalence classes of entities, and a Markov chain-based ranking approach to construct the final taxonomy. This results in MENTA (Multilingual Entity Taxonomy), a resource that describes 5.4 million entities and is presumably the largest multilingual lexical knowledge base currently available. 0 0
MENTA: inducing multilingual taxonomies from wikipedia Gerard de Melo
Gerhard Weikum
CIKM English 2010 0 0
Morpheus: A deep web question answering system Grant C.
George C.P.
Gumbs J.-D.
Wilson J.N.
Dobbins P.J.
IiWAS2010 - 12th International Conference on Information Integration and Web-Based Applications and Services English 2010 When users search the deep web, the essence of their search is often found in a previously answered query. The Morpheus question answering system reuses prior searches to answer similar user queries. Queries are represented in a semistructured format that contains query terms and referenced classes within a specific ontology. Morpheus answers questions by using methods from prior successful searches. The system ranks stored methods based on a similarity quasimetric defined on assigned classes of queries. Similarity depends on the class heterarchy in an ontology and its associated text corpora. Morpheus revisits the prior search pathways of the stored searches to construct possible answers. Realm-based ontologies are created using Wikipedia pages, associated categories, and the synset heterarchy of WordNet. This paper describes the entire process with emphasis on the matching of user queries to stored answering methods. Copyright 2010 ACM. 0 0
Named entity disambiguation for german news articles Lommatzsch A.
Ploch D.
De Luca E.W.
Albayrak S.
LWA 2010 - Lernen, Wissen und Adaptivitat - Learning, Knowledge, and Adaptivity, Workshop Proceedings English 2010 Named entity disambiguation has become an important research area providing the basis for improving search engine precision and for enabling semantic search. Current approaches for the named entity disambiguation are usually based on exploiting structured semantic and lingual resources (e.g. WordNet, DBpedia). Unfortunately, each of these resources cover independently from each other insufficient information for the task of named entity disambiguation. On the one handWordNet comprises a relative small number of named entities while on the other hand DBpedia provides only little context for named entities. Our approach is based on the use of multi-lingual Wikipedia data. We show how the combination of multi-lingual resources can be used for named entity disambiguation. Based on a German and an English document corpus, we evaluate various similarity measures and algorithms for extracting data for named entity disambiguation. We show that the intelligent filtering of context data and the combination of multilingual information provides high quality named entity disambiguation results. 0 0
On the sampling of web images for learning visual concept classifiers Zhu S.
Gang Wang
Ngo C.-W.
Jiang Y.-G.
CIVR 2010 - 2010 ACM International Conference on Image and Video Retrieval English 2010 Visual concept learning often requires a large set of training images. In practice, nevertheless, acquiring noise-free training labels with sufficient positive examples is always expensive. A plausible solution for training data collection is by sampling the largely available user-tagged images from social media websites. With the general belief that the probability of correct tagging is higher than that of incorrect tagging, such a solution often sounds feasible, though is not without challenges. First, user-tags can be subjective and, to certain extent, are ambiguous. For instance, an image tagged with "whales" may be simply a picture about ocean museum. Learning concept "whales" with such training samples will not be effective. Second, user-tags can be overly abbreviated. For instance, an image about concept "wedding" may be tagged with "love" or simply the couple's names. As a result, crawling sufficient positive training examples is difficult. This paper empirically studies the impact of exploiting the tagged images towards concept learning, investigating the issue of how the quality of pseudo training images affects concept detection performance. In addition, we propose a simple approach, named semantic field, for predicting the relevance between a target concept and the tag list associated with the images. Specifically, the relevance is determined through concept-tag co-occurrence by exploring external sources such as WordNet and Wikipedia. The proposed approach is shown to be effective in selecting pseudo training examples, exhibiting better performance in concept learning than other approaches such as those based on keyword sampling and tag voting. Copyright 0 0
On the saturation of YAGO Suda M.
Weidenbach C.
Wischnewski P.
Lecture Notes in Computer Science English 2010 YAGO is an automatically generated ontology out of Wikipedia and WordNet. It is eventually represented in a proprietary flat text file format and a core comprises 10 million facts and formulas. We present a translation of YAGO into the Bernays-Schönfinkel Horn class with equality. A new variant of the superposition calculus is sound, complete and terminating for this class. Together with extended term indexing data structures the new calculus is implemented in Spass-YAGO. YAGO can be finitely saturated by Spass-YAGO in about 1 hour. We have found 49 inconsistencies in the original generated ontology which we have fixed. Spass-YAGO can then prove non-trivial conjectures with respect to the resulting saturated and consistent clause set of about 1.4 GB in less than one second. 0 0
QMUL @ MediaEval 2010 Tagging Task: Semantic query expansion for predicting user tags Chandramouli K.
Kliegr T.
Piatrik T.
Izquierdo E.
MediaEval Benchmarking Initiative for Multimedia Evaluation - The "Multi" in Multimedia: Speech, Audio, Visual Content, Tags, Users, Context, MediaEval 2010 Working Notes Proceedings 2010 This paper describes our participation in "The Wild Wild Web Tagging Task @ MediaEval 2010", which aims to predict user tags based on features derived from video such as speech, audio, visual content or associated textual or social information. Two tasks were pursued: (i) closed-set annotations and (ii) open-set annotations. We have attempted to evaluate whether using only a limited number of features (video title, filename and description) can be compensated by semantic expansion with NLP tools and Wikipedia and Wordnet. This technique proved successful on the open-set task with approximately 20% generated tags being considered relevant by all manual annotators. On the closed-set task, the best result (MAP 0.3) was achieved on tokenized filenames combined with video descriptions, indicating that filenames are a valuable tag predictor. 0 0
Real anaphora resolution is hard: The case of German Klenner M.
Angela Fahrni
Sennrich R.
Lecture Notes in Computer Science English 2010 We introduce a system for anaphora resolution for German that uses various resources in order to develop a real system as opposed to systems based on idealized assumptions, e.g. the use of true mentions only or perfect parse trees and perfect morphology. The components that we use to replace such idealizations comprise a full-fledged morphology, a Wikipedia-based named entity recognition, a rule-based dependency parser and a German wordnet. We show that under these conditions coreference resolution is (at least for German) still far from being perfect. 0 0
SINAI at Tagging Task professional in MediaEval 2010 Perea-Ortega J.M.
Montejo-Raez A.
Diaz-Galiano M.C.
Martin-Valdivia M.T.
MediaEval Benchmarking Initiative for Multimedia Evaluation - The "Multi" in Multimedia: Speech, Audio, Visual Content, Tags, Users, Context, MediaEval 2010 Working Notes Proceedings 2010 In this paper we present some experiments on video categorization using dual language ASR transcriptions. We applied two different approaches depending on the language of the video transcriptions. For the original Dutch ASR transcriptions we have applied an information retrieval approach. One document was generated per subject label, manually selecting the main related article from Dutch Wikipedia. Then, the Terrier information retrieval system was used for retrieving each video transcription against each generated document, selecting the two best results as labels or possible categories for that video. For the translated English ASR transcriptions, our approach is based on calculating the semantic similarity between each subject label with their synonyms extracted from WordNet and the nouns recognized in each video transcription. The two labels with highest similarity are selected as categories for that video. Overall, the low results obtained show that this is a difficult task and it requires not only work with the text of the transcriptions (speech-based retrieval), but also to use the visual content of videos (e.g. visual-concept detection). Nevertheless, the new approach proposed based on calculating the semantic similarity seems to be interesting for this task. 0 0
Scalable semantic annotation of text using lexical and Web resources Zavitsanos E.
Tsatsaronis G.
Varlamis I.
Paliouras G.
Lecture Notes in Computer Science English 2010 In this paper we are dealing with the task of adding domain-specific semantic tags to a document, based solely on the domain ontology and generic lexical and Web resources. In this manner, we avoid the need for trained domain-specific lexical resources, which hinder the scalability of semantic annotation. More specifically, the proposed method maps the content of the document to concepts of the ontology, using the WordNet lexicon and Wikipedia. The method comprises a novel combination of measures of semantic relatedness and word sense disambiguation techniques to identify the most related ontology concepts for the document. We test the method on two case studies: (a) a set of summaries, accompanying environmental news videos, (b) a set of medical abstracts. The results in both cases show that the proposed method achieves reasonable performance, thus pointing to a promising path for scalable semantic annotation of documents. 0 0
Structural Semantic Relatedness: A knowledge-based method to named entity disambiguation Xianpei Han
Jun Zhao
ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference English 2010 Name ambiguity problem has raised urgent demands for efficient, high-quality named entity disambiguation methods. In recent years, the increasing availability of large-scale, rich semantic knowledge sources (such as Wikipedia and WordNet) creates new opportunities to enhance the named entity disambiguation by developing algorithms which can exploit these knowledge sources at best. The problem is that these knowledge sources are heterogeneous and most of the semantic knowledge within them is embedded in complex structures, such as graphs and networks. This paper proposes a knowledge-based method, called Structural Semantic Relatedness (SSR), which can enhance the named entity disambiguation by capturing and leveraging the structural semantic knowledge in multiple knowledge sources. Empirical results show that, in comparison with the classical BOW based methods and social network based methods, our method can significantly improve the disambiguation performance by respectively 8.7% and 14.7%. 0 0
Syntactic/semantic structures for textual entailment recognition Yashar Mehdad
Moschitti A.
Zanzotto F.M.
NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference English 2010 In this paper, we describe an approach based on off-the-shelf parsers and semantic resources for the Recognizing Textual Entailment (RTE) challenge that can be generally applied to any domain. Syntax is exploited by means of tree kernels whereas lexical semantics is derived from heterogeneous resources, e.g. WordNet or distributional semantics through Wikipedia. The joint syntactic/semantic model is realized by means of tree kernels, which can exploit lexical relatedness to match syntactically similar structures, i.e. whose lexical compounds are related. The comparative experiments across different RTE challenges and traditional systems show that our approach consistently and meaningfully achieves high accuracy, without requiring any adaptation or tuning. 0 0
UNIpedia: A unified ontological knowledge platform for semantic content tagging and search Kalender M.
Dang J.
Uskudarli S.
Proceedings - 2010 IEEE 4th International Conference on Semantic Computing, ICSC 2010 English 2010 The emergence of an ever increasing number of documents makes it more and more difficult to locate them when desired. An approach for improving search results is to make use of user-generated tags. This approach has led to improvements. However, they are limited because tags are (1) free from context and form, (2) user generated, (3) used for purposes other than description, and (4) often ambiguous. As a formal, declarative knowledge representation model, Ontologies provide a foundation upon which machine understandable knowledge can be obtained and tagged, and as a result, it makes semantic tagging and search possible. With an ontology, semantic web technologies can be utilized to automatically generate semantic tags. WordNet has been used for this purpose. However, this approach falls short in tagging documents that refer to new concepts and instances. To address this challenge, we present UNIpedia - a platform for unifying different ontological knowledge bases by reconciling their instances as WordNet concepts. Our mapping algorithms use rule based heuristics extracted from ontological and statistical features of concept and instances. UNIpedia is used to semantically tag contemporary documents. For this purpose, the Wikipedia and OpenCyc knowledge bases, which are known to contain up to date instances and reliable metadata about them, are selected. Experiments show that the accuracy of the mapping between WordNet and Wikipedia is 84% for the most relevant concept name and 90% for the appropriate sense. 0 0
Wisdom of crowds versus wisdom of linguists - Measuring the semantic relatedness of words Torsten Zesch
Iryna Gurevych
Natural Language Engineering English 2010 In this article, we present a comprehensive study aimed at computing semantic relatedness of word pairs. We analyze the performance of a large number of semantic relatedness measures proposed in the literature with respect to different experimental conditions, such as (i) the datasets employed, (ii) the language (English or German), (iii) the underlying knowledge source, and (iv) the evaluation task (computing scores of semantic relatedness, ranking word pairs, solving word choice problems). To our knowledge, this study is the first to systematically analyze semantic relatedness on a large number of datasets with different properties, while emphasizing the role of the knowledge source compiled either by the wisdom of linguists (i.e., classical wordnets) or by the wisdom of crowds (i.e., collaboratively constructed knowledge sources like Wikipedia). The article discusses benefits and drawbacks of different approaches to evaluating semantic relatedness. We show that results should be interpreted carefully to evaluate particular aspects of semantic relatedness. For the first time, we employ a vector based measure of semantic relatedness, relying on a concept space built from documents, to the first paragraph of Wikipedia articles, to English WordNet glosses, and to GermaNet based pseudo glosses. Contrary to previous research (Strube and Ponzetto 2006; Gabrilovich and Markovitch 2007; Zesch et al. 2007), we find that wisdom of crowds based resources are not superior to wisdom of linguists based resources. We also find that using the first paragraph of a Wikipedia article as opposed to the whole article leads to better precision, but decreases recall. Finally, we present two systems that were developed to aid the experiments presented herein and are freely available1 for research purposes: (i) DEXTRACT, a software to semi-automatically construct corpus-driven semantic relatedness datasets, and (ii) JWPL, a Java-based high-performance Wikipedia Application Programming Interface (API) for building natural language processing (NLP) applications. Copyright 0 0
A large margin approach to anaphora resolution for neuroscience knowledge discovery Burak Ozyurt I. Proceedings of the 22nd International Florida Artificial Intelligence Research Society Conference, FLAIRS-22 English 2009 A discriminative large margin classifier based approach to anaphora resolution for neuroscience abstracts is presented. The system employs both syntactic and semantic features. A support vector machine based word sense disambiguation method combining evidence from three methods, that use WordNet and Wikipedia, is also introduced and used for semantic features. The support vector machine anaphora resolution classifier with probabilistic outputs achieved almost four-fold improvement in accuracy over the baseline method. Copyright © 2009, Assocation for the Advancement of ArtdicaI Intelligence (www.aaai.org). All rights reserved. 0 0
A new approach for semantic web service discovery and propagation based on agents Neiat A.G.
Shavalady S.H.
Mohsenzadeh M.
Rahmani A.M.
Proceedings of the 5th International Conference on Networking and Services, ICNS 2009 English 2009 for Web based systems integration become a time challenge. To improve the automation of Web services interoperation, a lot of technologies are recommended, such as semantic Web services and agents. In this paper an approach for semantic Web service discovery and propagation based on semantic Web services and FIPA multi agents is proposed. A broker allowing to expose semantic interoperability between semantic Web service provider and agent by translating WSDL to DF description for semantic Web services and vice versa is proposed . We describe how the proposed architecture analyzes the request and after being analyzed, matches or publishes the request. The ontology management in the broker creates the user ontology and merges it with general ontology (i.e. WordNet, Yago, Wikipedia ...). We also describe the recommender which analyzes the created WSDL based on the functional and non-functional requirements and then recommends it to Web service provider to increase their retrieval probability in the related queries. 0 0
A study on Linking Wikipedia categories to WordNet synsets using text similarity Antonio Toral
Oscar Ferrandez
Eneko Agirre
Munoz R.
International Conference Recent Advances in Natural Language Processing, RANLP English 2009 This paper studies the application of text similarity methods to disambiguate ambiguous links between WordNet nouns and Wikipedia categories. The methods range from word overlap between glosses, random projections, WordNetbased similarity, and a full-fledged textual entailment system. Both unsupervised and supervised combinations have been tried. The goldstandard with disambiguated links is publicly available. The results range from 64.7% for the first sense heuristic, 68% for an unsupervised combination, and up to 77.74% for a supervised combination. 0 0
A study on the semantic relatedness of query and document terms in information retrieval Muller C.
Iryna Gurevych
EMNLP 2009 - Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009 English 2009 The use of lexical semantic knowledge in information retrieval has been a field of active study for a long time. Collaborative knowledge bases like Wikipedia and Wiktionary, which have been applied in computational methods only recently, offer new possibilities to enhance information retrieval. In order to find the most beneficial way to employ these resources, we analyze the lexical semantic relations that hold among query and document terms and compare how these relations are represented by a measure for semantic relatedness. We explore the potential of different indicators of document relevance that are based on semantic relatedness and compare the characteristics and performance of the knowledge bases Wikipedia, Wiktionary and WordNet. 0 0
Accurate semantic class classifier for coreference resolution Huang Z.
Zeng G.
Xu W.
Celikyilmaz A.
EMNLP 2009 - Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009 English 2009 There have been considerable attempts to incorporate semantic knowledge into coreference resolution systems: different knowledge sources such as WordNet and Wikipedia have been used to boost the performance. In this paper, we propose new ways to extract WordNet feature. This feature, along with other features such as named entity feature, can be used to build an accurate semantic class (SC) classifier. In addition, we analyze the SC classification errors and propose to use relaxed SC agreement features. The proposed accurate SC classifier and the relaxation of SC agreement features on ACE2 coreference evaluation can boost our baseline system by 10.4% and 9.7% using MUC score and anaphor accuracy respectively. 0 0
An agent- based semantic web service discovery framework Neiat A.G.
Mohsenzadeh M.
Forsati R.
Rahmani A.M.
Proceedings - 2009 International Conference on Computer Modeling and Simulation, ICCMS 2009 English 2009 Web services have changed the Web from a database of static documents to a service provider. To improve the automation of Web services interoperation, a lot of technologies are recommended, such as semantic Web services and agents. In this paper we propose a framework for semantic Web service discovery based on semantic Web services and FIPA multi agents. This paper provides a broker which provides semantic interoperability between semantic Web service provider and agents by translating WSDL to DF description for semantic Web services and DF description to WSDL forFIPA multi agents. We describe how the proposed architecture analyzes the request and match search query. The ontology management in the broker creates the user ontology and merges it with general ontology (i.e. WordNet, Yago, Wikipedia ⋯). We also describe the recommendation component that recommends the WSDL to Web service provider to increase their retrieval probability in the related queries. 0 0
An unsupervised model of exploiting the web to answer definitional questions Wu Y.
Kashioka H.
Proceedings - 2009 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2009 English 2009 In order to build accurate target profiles, most definition question answering (QA) systems primarily involve utilizing various external resources, such as WordNet, Wikipedia, Biograpy.com, etc. However, these external resources are not always available or helpful when answering definition questions. In contrast, this paper proposes an unsupervised classification model, called the U-Model, which can liberate definitional QA systems from heavily depending on a variety of external resources via applying sentence expansion (SE) and SVM classifier. Experimental results from testing on English TREC test sets reveal that the proposed U-Model can not only significantly outperform baseline system but also require no specific external resources. 0 0
Classifying web pages by using knowledge bases for entity retrieval Kiritani Y.
Ma Q.
Masatoshi Yoshikawa
Lecture Notes in Computer Science English 2009 In this paper, we propose a novel method to classify Web pages by using knowledge bases for entity search, which is a kind of typical Web search for information related to a person, location or organization. First, we map a Web page to entities according to the similarities between the page and the entities. Various methods for computing such similarity are applied. For example, we can compute the similarity between a given page and a Wikipedia article describing a certain entity. The frequency of an entity appearing in the page is another factor used in computing the similarity. Second, we construct a directed acyclic graph, named PEC graph, based on the relations among Web pages, entities, and categories, by referring to YAGO, a knowledge base built on Wikipedia and WordNet. Finally, by analyzing the PEC graph, we classify Web pages into categories. The results of some preliminary experiments validate the methods proposed in this paper. 0 0
Conceptual image retrieval over a large scale database Adrian Popescu
Le Borgne H.
Moellic P.-A.
Lecture Notes in Computer Science English 2009 Image retrieval in large-scale databases is currently based on a textual chains matching procedure. However, this approach requires an accurate annotation of images, which is not the case on the Web. To tackle this issue, we propose a reformulation method that reduces the influence of noisy image annotations. We extract a ranked list of related concepts for terms in the query from WordNet and Wikipedia, and use them to expand the initial query. Then some visual concepts are used to re-rank the results for queries containing, explicitly or implicitly, visual cues. First evaluations on a diversified corpus of 150000 images were convincing since the proposed system was ranked 4 th and 2 nd at the WikipediaMM task of the ImageCLEF 2008 campaign [1]. 0 0
Cross-lingual Dutch to english alignment using EuroWordNet and Dutch Wikipedia Gosse Bouma CEUR Workshop Proceedings English 2009 This paper describes a system for linking the thesaurus of the Netherlands Institute for Sound and Vision to English WordNet and dbpedia. We used EuroWordNet, a multilingual wordnet, and Dutch Wikipedia as intermediaries for the two alignments. EuroWordNet covers most of the subject terms in the thesaurus, but the organization of the cross-lingual links makes selection of the most appropriate English target term almost impossible. Using page titles, redirects, disambiguation pages, and anchor text harvested from Dutch Wikipedia gives reasonable performance on subject terms and geographical terms. Many person and organization names in the thesaurus could not be located in (Dutch or English) Wikipedia. 0 0
Domain specific ontology on computer science Salahli M.A.
Gasimzade T.M.
Guliyev A.I.
ICSCCW 2009 - 5th International Conference on Soft Computing, Computing with Words and Perceptions in System Analysis, Decision and Control English 2009 In this paper we introduce the application system based on the domain specific ontology. Some design problems of the ontology are discussed. The ontology is based on the WordNet's database and consists of Turkish and English terms on computer science and informatics. Second we present the method for determining a set of words, which are related to a given concept and computing the degree of semantic relatedness between them. The presented method has been used for semantic searching process, which is carried out by our application. 0 0
Explicit versus latent concept models for cross-language information retrieval Philipp Cimiano
Schultz A.
Sizov S.
Sorg P.
Staab S.
IJCAI International Joint Conference on Artificial Intelligence English 2009 The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many approaches aim at a concept-based retrieval, but differ in the nature of the concepts, which range from linguistic concepts as defined in lexical resources such as WordNet, latent topics derived from the data itself - as in Latent Semantic Indexing (LSI) or (Latent Dirichlet Allocation (LDA) - to Wikipedia articles as proxies for concepts, as in the recently proposed Explicit Semantic Analysis (ESA) model. A crucial question which has not been answered so far is whether models based on explicitly given concepts (as in the ESA model for instance) perform inherently better than retrieval models based on "latent" concepts (as in LSI and/or LDA). In this paper we investigate this question closer in the context of a cross-language setting, which inherently requires concept-based retrieval bridging between different languages. In particular, we compare the recently proposed ESA model with two latent models (LSI and LDA) showing that the former is clearly superior to the both. From a general perspective, our results contribute to clarifying the role of explicit vs. implicitly derived or latent concepts in (crosslanguage) information retrieval research. 0 0
Exploiting internal and external semantics for the clustering of short texts using world knowledge Hu X.
Sun N.
Zhang C.
Chua T.-S.
International Conference on Information and Knowledge Management, Proceedings English 2009 Clustering of short texts, such as snippets, presents great challenges in existing aggregated search techniques due to the problem of data sparseness and the complex semantics of natural language. As short texts do not provide sufficient term occurring information, traditional text representation methods, such as bag of words" model, have several limitations when directly applied to short texts tasks. In this paper, we propose a novel framework to improve the performance of short texts clustering by exploiting the internal semantics from original text and external concepts from world knowledge. The proposed method employs a hierarchical three-level structure to tackle the data sparsity problem of original short texts and reconstruct the corresponding feature space with the integration of multiple semantic knowledge bases - Wikipedia and WordNet. Empirical evaluation with Reuters and real web dataset demonstrates that our approach is able to achieve significant improvement as compared to the state-of-the-art methods. Copyright 2009 ACM. 0 0
Harvesting, searching, and ranking knowledge on the web Gerhard Weikum Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, WSDM'09 English 2009 There are major trends to advance the functionality of search engines to a more expressive semantic level (e.g., [2, 4, 6, 7, 8, 9, 13, 14, 18]). This is enabled by employing large-scale information extraction [1, 11, 20] of entities and relationships from semistructured as well as natural-language Web sources. In addition, harnessing Semantic-Web-style ontologies [22] and reaching into Deep-Web sources [16] can contribute towards a grand vision of turning the Web into a comprehensive knowledge base that can be efficiently searched with high precision. This talk presents ongoing research towards this objective, with emphasis on our work on the YAGO knowledge base [23, 24] and the NAGA search engine [14] but also covering related projects. YAGO is a large collection of entities and relational facts that are harvested from Wikipedia and WordNet with high accuracy and reconciled into a consistent RDF-style "semantic" graph. For further growing YAGO from Web sources while retaining its high quality, pattern-based extraction is combined with logic-based consistency checking in a unified framework [25]. NAGA provides graph-template-based search over this data, with powerful ranking capabilities based on a statistical language model for graphs. Advanced queries and the need for ranking approximate matches pose efficiency and scalability challenges that are addressed by algorithmic and indexing techniques [15, 17]. YAGO is publicly available and has been imported into various other knowledge-management projects including DB-pedia. YAGO shares many of its goals and methodologies with parallel projects along related lines. These include Avatar [19], Cimple/DBlife [10, 21], DBpedia [3], Know-ItAll/TextRunner [12, 5], Kylin/KOG [26, 27], and the Libra technology [18, 28] (and more). Together they form an exciting trend towards providing comprehensive knowledge bases with semantic search capabilities. copyright 2009 ACM. 0 0
Improving website user model automatically using a comprehensive lexical semantic resource Safarkhani B.
Mohsenzadeh M.
Rahmani A.M.
2009 International Conference on E-Business and Information System Security, EBISS 2009 English 2009 A major component in any web personalization system is its user model. Recently a number of researches have been done to incorporate semantics of a web site in representation of its users. All of these efforts use either a specific manually constructed taxonomy or ontology or a general purpose one like WordNet to map page views into semantic elements. However, building a hierarchy of concepts manually is time consuming and expensive. On the other hand, general purpose resources suffer from low coverage of domain specific terms. In this paper we intend to address both these shortcomings. Our contribution is that we introduce a mechanism to automatically improve the representation of the user in the website using a comprehensive lexical semantic resource. We utilize Wikipedia, the largest encyclopedia to date, as a rich lexical resource to enhance the automatic construction of vector model representation of user interests. We evaluate the effectiveness of the resulting model using concepts extracted from this promising resource. 0 0
Large-scale taxonomy mapping for restructuring and integrating Wikipedia Ponzetto S.P.
Roberto Navigli
IJCAI International Joint Conference on Artificial Intelligence English 2009 We present a knowledge-rich methodology for disambiguating Wikipedia categories with WordNet synsets and using this semantic information to restructure a taxonomy automatically generated from the Wikipedia system of categories. We evaluate against a manual gold standard and show that both category disambiguation and taxonomy restructuring perform with high accuracy. Besides, we assess these methods on automatically generated datasets and show that we are able to effectively enrich WordNet with a large number of instances from Wikipedia. Our approach produces an integrated resource, thus bringing together the fine-grained classification of instances in Wikipedia and a well-structured top-level taxonomy from WordNet. 0 0
Linking Dutch wikipedia categories to eurowordnet SA-OT accounts for pronoun resolution in child language Gosse Bouma Computational Linguistics in the Netherlands 2009 - Selected Papers from the 19th CLIN Meeting, CLIN 2009 English 2009 Wikipedia provides category information for a large number of named entities but the category structure of Wikipedia is associative, and not always suitable for linguistic applications. For this reason, a merger ofWikipedia andWordNet has been proposed. In this paper, we address the word sense disambiguation problem that needs to be solved when linking Dutch Wikipedia categories to polysemous Dutch EuroWordNet literals. We show that a method based on automatically acquired predominant word senses outperforms a method based on word overlap between Wikipedia supercategories and WordNet hypernyms. We compare the coverage of the resulting categorization with that of a corpus-based system that uses automatically acquired category labels. Copyright 0 0
Minimally supervised question classification and answering based on WordNet and Wikipedia Jian Chang
Yen T.-H.
Tsai R.T.-H.
Proceedings of the 21st Conference on Computational Linguistics and Speech Processing, ROCLING 2009 English 2009 In this paper, we introduce an automatic method for classifying a given question using broad semantic categories in an existing lexical database (i.e., WordNet) as the class tagset. For this, we also constructed a large scale entity supersense database that contains over 1.5 million entities to the 25 WordNet lexicographer's files (supersenses) from titles of Wikipedia entry. To show the usefulness of our work, we implement a simple redundancy-based system that takes the advantage of the large scale semantic database to perform question classification and named entity classification for open domain question answering. Experimental results show that the proposed method outperform the baseline of not using question classification. 0 0
Question classification - A semantic approach using wordnet and wikipedia Ray S.K.
Sandesh Singh
Joshi B.P.
Proceedings of the 4th Indian International Conference on Artificial Intelligence, IICAI 2009 English 2009 Question Answering Systems are providing answers to the users' questions in succinct form where Question classification module of a Question Answering System plays a very important role in pinpointing the exact answer of the question. In literature, incorrect question classification lias been cited as one of the major causes of poor performance of the Question Answering Systems and this emphasizes on the importance of question classification module designing. In this paper, we have proposed a question classification method that combines the powerful semantic features of the WordNet and the vast knowledge repository of the Wikipedia to describe informative terms explicitly. We have trained our method over a standard set of 5500 questions (by UIUC) and then tested over 5 TREC question collections and compared our results. Our system's average question classification accuracy is 89.55% in comparison of 80.2% by Zhang and Lee [17], 84.2% by Li and Roth [7], 89.2% by Huang [6]. The question classification accuracy suggests the effectiveness of the method which is promising in the field of open domain question classification. Copyright 0 0
Reference resolution challenges for intelligent agents: The need for knowledge McShane M. IEEE Intelligent Systems English 2009 The difficult cases of reference in natural language processing require intelligent agents that can reason about language and the machine-tractable knowledge. The knowledge-lean model relies on various statistical techniques that are trained over a manually defined collection, typically using a small number of features such as morphological agreement, the text distance between the entity and the potential coreferent, and various other features that do not require text understanding. The incorporation of some semantic features drawn from Wikipedia, and WordNet improves reference resolution for some referring expressions. One promoter of knowledge-lean corpus-based methods was the message understanding conference (MUC) reference resolution task, for which sponsors provided annotated corpora for the training and evaluation of the competing systems. The two requirements for the reference annotation strategy were the need for greater than 95 percent interannotator agreement and the ability to annotate quickly and cheaply. 0 0
Sentence-level opinion-topic association for opinion detection in blogs Muhammad Saad Missen M.
Boughanem M.
Proceedings - International Conference on Advanced Information Networking and Applications, AINA English 2009 The Opinion Detection from blogs has always been a challenge for researchers. One of the challenges faced is to find such documents that specifically contain opinion on users' information need. This requires text processing on sentence level rather than on document level. In this paper, we have proposed an opinion detection approach. The proposed approach tries to tackle opinion detection problem by using some document level heuristics and processing documents on sentence level using different semantic similarity relations of WordNet between sentence words and list of weighted query terms expanded through encyclopedia Wikipedia. According to initial results, our approach performs well with MAP of 0.2177 with improvement of 28.89% over baseline results obtained through BM25 matching formula. TREC Blog 2006 data is used as test data collection. 0 0
The design of semantic web services discovery model based on multi proxy Linna L. Proceedings - 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems, ICIS 2009 English 2009 Web services have changed the Web from a database of static documents to a service provider. To improve the automation of Web services interoperation, a lot of technologies are recommended, such as semantic Web services and proxys. In this paper we propose a model for semantic Web service discovery based on semantic Web services and FIPA multi proxys. This paper provides a broker which provides semantic interoperability between semantic Web service provider and proxys by translating WSDL to DF description for semantic Web services and DF description to WSDL for FIPA multi proxys. We describe how the proposed architecture analyzes the request and match search query. The ontology management in the broker creates the user ontology and merges it with general ontology (i.e. WordNet, Yago, Wikipedia ...). We also describe the recommendation component that recommends the WSDL to Web service provider to increase their retrieval probability in the related queries. 0 0
Towards semantic tagging in collaborative environments Chandramouli K.
Kliegr T.
Svatek V.
Izquierdo E.
DSP 2009: 16th International Conference on Digital Signal Processing, Proceedings English 2009 Tags pose an efficient and effective way of organization of resources, but they are not always available. A technique called SCM/THD investigated in this paper extracts entities from free-text annotations, and using the Lin similarity measure over the WordNet thesaurus classifies them into a controlled vocabulary of tags. Hypernyms extracted from Wikipedia are used to map uncommon entities to Wordnet synsets. In collaborative environments, users can assign multiple annotations to the same object hence increasing the amount of information available. Assuming that the semantics of the annotations overlap, this redundancy can be exploited to generate higher quality tags. A preliminary experiment presented in the paper evaluates the consistency and quality of tags generated from multiple annotations of the same image. The results obtained on an experimental dataset comprising of 62 annotations from four annotators show that the accuracy of a simple majority vote surpasses the average accuracy obtained through assessing the annotations individually by 18%. A moderate-strength correlation has been found between the quality of generated tags and the consistency of annotations. 0 0
Unsupervised knowledge extraction for taxonomies of concepts from Wikipedia Barbu E.
Poesio M.
International Conference Recent Advances in Natural Language Processing, RANLP English 2009 A novel method for unsupervised acquisition of knowledge for taxonomies of concepts from raw Wikipedia text is presented. We assume that the concepts classified under the same node in a taxonomy are described in a comparable way in Wikipedia. The concepts in 6 taxonomies extracted from WordNet are mapped onto Wikipedia pages and the lexico-syntactic patterns describing semantic structures expressing relevant knowledge for the concepts are automatically learnt. 0 0
Using wikipedia for hierarchical finer categorization of named entities? Pappu A. PACLIC 23 - Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation English 2009 Wikipedia is one of the largest growing structured resources on the Web and can be used as a training corpus in natural language processing applications. In this work, we present a method to categorize named entities under the hierarchical fine-grained categories provided by the Wikipedia taxonomy. Such a categorization can be further used to extract semantic relations among these named entities. More specifically, we examine instances of different kinds of Named Entities picked from Wikipedia articles categorized under 55 categories. We employ a Maximum Entropy based method to perform supervised learning that learns from local context of a named entity as well as a higher-level context such as hypernyms/hyponyms from Wikipedia and WordNet. 0 0
Using wordnet's semantic relations for opinion detection in blogs Missen M.M.S.
Boughanem M.
Lecture Notes in Computer Science English 2009 The Opinion Detection from blogs has always been a challenge for researchers. One of the challenges faced is to find such documents that specifically contain opinion on users' information need. This requires text processing on sentence level rather than on document level. In this paper, we have proposed an opinion detection approach. The proposed approach focuses on above problem by processing documents on sentence level using different semantic similarity relations of WordNet between sentence words and list of weighted query words expanded through encyclopedia Wikipedia. According to initial results, our approach performs well with MAP of 0.28 and P@10 of 0.64 with improvement of 27% over baseline results. TREC Blog 2006 data is used as test data collection. 0 0
WikiSense: Supersense tagging of Wikipedia named entities based WordNet Jian Chang
Tsai R.T.-H.
Chang J.S.
PACLIC 23 - Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation English 2009 In this paper, we introduce a minimally supervised method for learning to classify named-entity titles in a given encyclopedia into broad semantic categories in an existing ontology. Our main idea involves using overlapping entries in the encyclopedia and ontology and a small set of 30 handed tagged parenthetic explanations to automatically generate the training data. The proposed method involves automatically recognizing whether a title is a named entity, automatically generating two sets of training data, and automatically building a classification model for training a classification model based on textual and non-textual features. We present WikiSense, an implementation of the proposed method for extending the named entity coverage of WordNet by sense tagging Wikipedia titles. Experimental results show WikiSense achieves accuracy of over 95% and near 80% applicability for all NE titles in Wikipedia. WikiSense cleanly produces over 1.2 million of NEs tagged with broad categories, based on the lexicographers' files of WordNet, effectively extending WordNet to form a very large scale semantic category, a potentially useful resource for many natural language related tasks. © 2009 by Joseph Chang, Richard Tzong-Han Tsai, and Jason S. Chang. 0 0
WordVenture - Cooperative WordNet editor: Architecture for lexical semantic acquisition Szymanski J. KEOD 2009 - 1st International Conference on Knowledge Engineering and Ontology Development, Proceedings English 2009 This article presents architecture for acquiring lexical semantics in a collaborative approach paradigm. The system enables functionality for editing semantic networks in a wikipedia-like style. The core of the system is a user-friendly interface based on interactive graph navigation. It has been used for semantic network presentation, and brings simultaneously modification functionality. 0 0
Automated object shape modelling by clustering of web images Scardino G.
Infantino I.
Gaglio S.
VISAPP 2008 - 3rd International Conference on Computer Vision Theory and Applications, Proceedings English 2008 The paper deals with the description of a framework to create shape models of an object using images fromthe web. Results obtained from different image search engines using simple keywords are filtered, and it is possible to select images viewing a single object owning a well-defined contour. In order to have a large set of valid images, the implemented system uses lexical web databases (e.g. WordNet) or free web encyclopedias (e.g. Wikipedia), to get more keywords correlated to the given object. The shapes extracted from selected images are represented by Fourier descriptors, and are grouped by K-means algorithm. Finally, the more representative shapes of main clusters are considered as prototypical contours of the object. Preliminary experimental results are illustrated to show the effectiveness of the proposed approach. 0 0
Building a textual entailment system for the RTE3 competition. Application to a QA system Adrian Iftene Proceedings of the 2008 10th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2008 English 2008 Textual entailment recognition (RTE) is the task of deciding, when given two text fragments, whether the meaning of one text is entailed (can be inferred) from the other text. Last year, we built our first Textual Entailment (TE) system, with which we participated in the RTE31 competition. The main idea of this system is to transform the hypothesis making use of extensive semantic knowledge from sources like DIRT, WordNet, Wikipedia and acronyms database. Additionally, the system applies complex grammar rules for rephrasing in English and uses the results of a module we built to acquire the extra background knowledge needed. In the first part, we presented the system architecture and the results, whose best run ranked 3rd in RTE3 among 45 participating runs of 26 groups. The second part of the paper presents the manner in which we adapted the TE system in order to include it in a Question Answering (QA) system. The aim of using the TE system as a module in the general architecture of a QA system is to improve the ranking between possible answers for questions in which the answer type is Measure, Person, Location, Date and Organization. 0 0
Extracting concept hierarchy knowledge from the Web based on Property Inheritance and Aggregation Hattori S.
Katsumi Tanaka
Proceedings - 2008 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2008 English 2008 Concept hierarchy knowledge, such as hyponymy and meronymy, is very important for various natural language processing systems. While WordNet and Wikipedia are being manually constructed and maintained as lexical ontologies, many researchers have tackled how to extract concept hierarchies from very large corpora of text documents such as the Web not manually but automatically. However, their methods are mostly based on lexico-syntactic patterns as not necessary but sufficient conditions of hyponymy and meronymy, so they can achieve high precision but low recall when using stricter patterns or they can achieve high recall but low precision when using looser patterns. Therefore, we need necessary conditions of hyponymy and meronymy to achieve high recall and not low precision. In this paper, not only "Property Inheritance "from a target concept to its hyponyms but also "Property Aggregation" from its hyponyms to the target concept is assumed to be necessary and sufficient conditions of hyponymy, and we propose a method to extract concept hierarchy knowledge from the Web based on property inheritance and property aggregation. 0 0
Lexical and semantic resources for NLP: From words to meanings Gentile A.L.
Pierpaolo Basile
Iaquinta L.
Giovanni Semeraro
Lecture Notes in Computer Science English 2008 A user expresses her information need through words with a precise meaning, but from the machine point of view this meaning does not come with the word. A further step is needful to automatically associate it to the words. Techniques that process human language are required and also linguistic and semantic knowledge, stored within distinct and heterogeneous resources, which play an important role during all Natural Language Processing (NLP) steps. Resources management is a challenging problem, together with the correct association between URIs coming from the resources and meanings of the words. This work presents a service that, given a lexeme (an abstract unit of morphological analysis in linguistics, which roughly corresponds to a set of words that are different forms of the same word), returns all syntactic and semantic information collected from a list of lexical and semantic resources. The proposed strategy consists in merging data with origin from stable resources, such as WordNet, with data collected dynamically from evolving sources, such as the Web or Wikipedia. That strategy is implemented in a wrapper to a set of popular linguistic resources that provides a single point of access to them, in a transparent way to the user, to accomplish the computational linguistic problem of getting a rich set of linguistic and semantic annotations in a compact way. 0 0
Old wine orwarm beer: Target-specific sentiment analysis of adjectives Angela Fahrni
Klenner M.
AISB 2008 Convention: Communication, Interaction and Social Intelligence - Proceedings of the AISB 2008 Symposium on Affective Language in Human and Machine English 2008 In this paper, we focus on the target-specific polarity determination of adjectives. A domain-specific noun, the target noun, is modified by a qualifying adjective. Rather than having a prior polarity, adjectives are often bearing a target-specific polarity. In some cases, a single adjective even switches polarity depending on the accompanying noun. In order to realise such a'sentiment disambiguation', a two stage model is proposed: Identification of domainspecific targets and the construction of a target-specific polarity adjective lexicon.We use Wikipedia for automatic target detection, and a bootstrapping approach to determine the target-specific adjective polarity. It can be shown that our approach outperforms a baseline system that is based on a prior adjective lexicon derived from Senti- WordNet. 0 0
Unsupervised entity classification with Wikipedia and Wordnet Kliegr T. CEUR Workshop Proceedings English 2008 The task of classifying entities appearing in textual annotations to an arbitrary set of classes has not been extensively researched, yet it is useful in multimedia retrieval. We proposed an unsupervised algorithm, which expresses entities and classes as Wordnet synsets and uses Lin measure to classify them. Real-time hypernym discovery from Wikipedia is used to map uncommon entities to Wordnet. Further, this paper investigates the possibility to improve the performance by utilizing the global context with simulated annealing. 0 0
Using wiktionary for computing semantic relatedness Torsten Zesch
Muller C.
Iryna Gurevych
Proceedings of the National Conference on Artificial Intelligence English 2008 We introduce Wiktionary as an emerging lexical semantic resource that can be used as a substitute for expert-made resources in AI applications. We evaluate Wiktionary on the pervasive task of computing semantic relatedness for English and German by means of correlation with human rankings and solving word choice problems. For the first time, we apply a concept vector based measure to a set of different concept representations like Wiktionary pseudo glosses, the first paragraph of Wikipedia articles, English WordNet glosses, and GermaNet pseudo glosses. We show that: (i) Wiktionary is the best lexical semantic resource in the ranking task and performs comparably to other resources in the word choice task, and (ii) the concept vector based approach yields the best results on all datasets in both evaluations. Copyright © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 0 1
ZLinks: Semantic framework for invoking contextual linked data Bergman M.K.
Giasson F.
CEUR Workshop Proceedings English 2008 This first-ever demonstration of the new zLinks plug-in shows how any existing Web document link can be automatically transformed into a portal to relevant Linked Data. Each existing link disambiguates to its contextual and relevant subject concept (SC) or named entity (NE). The SCs are grounded in the OpenCyc knowledge base, supplemented by aliases and WordNet synsets to aid disambiguation. The NEs are drawn from Wikipedia as processed via YAGO, and other online fact-based repositories. The UMBEL ontology basis to this framework offers significant further advantages. The zLinks popup is invoked only as desired via unobtrusive user interface cues. 0 0
Comparing Wikipedia and German Wordnet by Evaluating Semantic Relatedness on Multiple Datasets. Torsten Zesch
Iryna Gurevych
Max Muhlhauser
Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT) 2007 We evaluate semantic relatedness measures on different German datasets showing that their performance depends on: (i) the definition of relatedness that was underlying the construction of the evaluation dataset, and (ii) the knowledge source used for computing semantic relatedness. We analyze how the underlying knowledge source influences the performance of a measure. Finally, we investigate the combination of wordnets and Wikipedia to improve the performance of semantic relatedness measures. 0 0
Efficient interactive query expansion with complete Search Holger Bast
Debapriyo Majumdar
Ingmar Weber
International Conference on Information and Knowledge Management, Proceedings English 2007 We present an efficient realization of the following interactive search engine feature: as the user is typing the query, words that are related to the last query word and that would lead to good hits are suggested, as well as selected such hits. The realization has three parts: (i) building clusters of related terms, (ii) adding this information as artificial words to the index such that (iii) the described feature reduces to an instance of prefix search and completion. An efficient solution for the latter is provided by the CompleteSearch engine, with which we have integrated the proposed feature. For building the clusters of related terms we propose a variant of latent semantic indexing that, unlike standard approaches, is completely transparent to the user. By experiments on two large test-collections, we demonstrate that the feature is provided at only a slight increase in query processing time and index size. Copyright 2007 ACM. 0 0
Finding experts using Wikipedia Gianluca Demartini CEUR Workshop Proceedings English 2007 When we want to find experts on the Web we might want to search where the knowledge is created by the users. One of such knowledge repository is Wikipedia. People expertises are described in Wikipedia pages and also the Wikipedia users can be considered experts on the topics they produce content on. In this paper we propose algorithms to find experts in Wikipedia. The two different approaches are finding experts in the Wikipedia content or among the Wikipedia users. We also use semantics from WordNet and Yago in order to disambiguate expertise topics and to improve the retrieval effectiveness. In the end, we show how our methodology can be implemented in a system in order to improve the expert retrieval effectiveness. 0 0
What to be? - Electronic Career Guidance based on semantic relatedness Iryna Gurevych
Muller C.
Torsten Zesch
ACL 2007 - Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics English 2007 We present a study aimed at investigating the use of semantic information in a novel NLP application, Electronic Career Guidance (ECG), in German. ECG is formulated as an information retrieval (IR) task, whereby textual descriptions of professions (documents) are ranked for their relevance to natural language descriptions of a person's professional interests (the topic). We compare the performance of two semantic IR models: (IR-1) utilizing semantic relatedness (SR) measures based on either wordnet or Wikipedia and a set of heuristics, and (IR-2) measuring the similarity between the topic and documents based on Explicit Semantic Analysis (ESA) (Gabrilovich and Markovitch, 2007). We evaluate the performance of SR measures intrinsically on the tasks of (T-1) computing SR, and (T-2) solving Reader's Digest Word Power (RDWP) questions. 0 0
YAWN: A semantically annotated Wikipedia XML corpus Ralf Schenkel
Fabian Suchanek
Gjergji Kasneci
Datenbanksysteme in Business, Technologie und Web, BTW 2007 - 12th Fachtagung des GI-Fachbereichs "Datenbanken und Informationssysteme" (DBIS), Proceedings 2007 The paper presents YAWN, a system to convert the well-known and widely used Wikipedia collection into an XML corpus with semantically rich, self-explaining tags. We introduce algorithms to annotate pages and links with concepts from the WordNet thesaurus. This annotation process exploits categorical information in Wikipedia, which is a high-quality, manually assigned source of information, extracts additional information from lists, and utilizes the invocations of templates with named parameters. We give examples how such annotations can be exploited for high-precision queries. 0 0
Yago: A core of semantic knowledge Suchanek F.M.
Gjergji Kasneci
Gerhard Weikum
16th International World Wide Web Conference, WWW2007 English 2007 We present YAGO, a light-weight and extensible ontology with high coverage and quality. YAGO builds on entities and relations and currently contains more than 1 million entities and 5 million facts. This includes the Is-A hierarchy as well as non-taxonomic relations between entities (such as HASONEPRIZE). The facts have been automatically extracted from Wikipedia and unified with WordNet, using a carefully designed combination of rule-based and heuristic methods described in this paper. The resulting knowledge base is a major step beyond WordNet: in quality by adding knowledge about individuals like persons, organizations, products, etc. with their semantic relationships - and in quantity by increasing the number of facts by more than an order of magnitude. Our empirical evaluation of fact correctness shows an accuracy of about 95%. YAGO is based on a logically clean model, which is decidable, extensible, and compatible with RDFS. Finally, we show how YAGO can be further extended by state-of-the-art information extraction techniques. 0 0