Similarity measure

From WikiPapers
Jump to: navigation, search

similarity measure is included as keyword or extra keyword in 0 datasets, 0 tools and 25 publications.

Datasets

There is no datasets for this keyword.

Tools

There is no tools for this keyword.


Publications

Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Improving contextual advertising matching by using Wikipedia thesaurus knowledge GuanDong Xu
ZongDa Wu
Li G.
Chen E.
Knowledge and Information Systems English 2014 As a prevalent type of Web advertising, contextual advertising refers to the placement of the most relevant commercial ads within the content of a Web page, to provide a better user experience and as a result increase the user's ad-click rate. However, due to the intrinsic problems of homonymy and polysemy, the low intersection of keywords, and a lack of sufficient semantics, traditional keyword matching techniques are not able to effectively handle contextual matching and retrieve relevant ads for the user, resulting in an unsatisfactory performance in ad selection. In this paper, we introduce a new contextual advertising approach to overcome these problems, which uses Wikipedia thesaurus knowledge to enrich the semantic expression of a target page (or an ad). First, we map each page into a keyword vector, upon which two additional feature vectors, the Wikipedia concept and category vector derived from the Wikipedia thesaurus structure, are then constructed. Second, to determine the relevant ads for a given page, we propose a linear similarity fusion mechanism, which combines the above three feature vectors in a unified manner. Last, we validate our approach using a set of real ads, real pages along with the external Wikipedia thesaurus. The experimental results show that our approach outperforms the conventional contextual advertising matching approaches and can substantially improve the performance of ad selection. © 2014 Springer-Verlag London. 0 0
Sentence similarity by combining explicit semantic analysis and overlapping n-grams Vu H.H.
Villaneau J.
Said F.
Marteau P.-F.
Lecture Notes in Computer Science English 2014 We propose a similarity measure between sentences which combines a knowledge-based measure, that is a lighter version of ESA (Explicit Semantic Analysis), and a distributional measure, Rouge. We used this hybrid measure with two French domain-orientated corpora collected from the Web and we compared its similarity scores to those of human judges. In both domains, ESA and Rouge perform better when they are mixed than they do individually. Besides, using the whole Wikipedia base in ESA did not prove necessary since the best results were obtained with a low number of well selected concepts. 0 0
Approximate semantic matching of heterogeneous events Hasan S.
O'Riain S.
Curry E.
Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems, DEBS'12 English 2012 Event-based systems have loose coupling within space, time and synchronization, providing a scalable infrastructure for information exchange and distributed workflows. However, event-based systems are tightly coupled, via event subscriptions and patterns, to the semantics of the underlying event schema and values. The high degree of semantic heterogeneity of events in large and open deployments such as smart cities and the sensor web makes it difficult to develop and maintain event-based systems. In order to address semantic coupling within event-based systems, we propose vocabulary free subscriptions together with the use of approximate semantic matching of events. This paper examines the requirement of event semantic decoupling and discusses approximate semantic event matching and the consequences it implies for event processing systems. We introduce a semantic event matcher and evaluate the suitability of an approximate hybrid matcher based on both thesauri-based and distributional semantics-based similarity and relatedness measures. The matcher is evaluated over a structured representation of Wikipedia and Freebase events. Initial evaluations show that the approach matches events with a maximal combined precision-recall F1 score of 75.89% on average in all experiments with a subscription set of 7 subscriptions. The evaluation shows how a hybrid approach to semantic event matching outperforms a single similarity measure approach. Copyright 0 0
Catching the drift - Indexing implicit knowledge in chemical digital libraries Kohncke B.
Tonnies S.
Balke W.-T.
Lecture Notes in Computer Science English 2012 In the domain of chemistry the information gathering process is highly focused on chemical entities. But due to synonyms and different entity representations the indexing of chemical documents is a challenging process. Considering the field of drug design, the task is even more complex. Domain experts from this field are usually not interested in any chemical entity itself, but in representatives of some chemical class showing a specific reaction behavior. For describing such a reaction behavior of chemical entities the most interesting parts are their functional groups. The restriction of each chemical class is somehow also related to the entities' reaction behavior, but further based on the chemist's implicit knowledge. In this paper we present an approach dealing with this implicit knowledge by clustering chemical entities based on their functional groups. However, since such clusters are generally too unspecific, containing chemical entities from different chemical classes, we further divide them into sub-clusters using fingerprint based similarity measures. We analyze several uncorrelated fingerprint/similarity measure combinations and show that the most similar entities with respect to a query entity can be found in the respective sub-cluster. Furthermore, we use our approach for document retrieval introducing a new similarity measure based on Wikipedia categories. Our evaluation shows that the sub-clustering leads to suitable results enabling sophisticated document retrieval in chemical digital libraries. 0 0
Horizontal search method for Wikipedia category grouping Myunggwon Hwang
Song S.K.
Kim D.J.
Hanmin Jung
Jeong D.H.
Ko H.
Proceedings - 2012 IEEE Int. Conf. on Green Computing and Communications, GreenCom 2012, Conf. on Internet of Things, iThings 2012 and Conf. on Cyber, Physical and Social Computing, CPSCom 2012 English 2012 Category hierarchies, which show the basic relationship between concepts, are utilized as fundamental clues for semantic information processing in diverse research fields. These research works have employed Wikipedia due to its high coverage of real-world concepts and data reliability. Wikipedia also constructs a category hierarchy, and defines various categories according to the common characteristics of a concept. However, some limitations have been uncovered in the use of a vertical search (especially top-down) to form a set of domain categories. In order to overcome these limitations, this paper proposes a horizontal search method, and uses Wikipedia components to measure the similarity between categories. In an experimental evaluation, we confirm that our method shows a wide coverage and high precision for similar (domain) category grouping. 0 0
SemaFor: Semantic document indexing using semantic forests Tsatsaronis G.
Varlamis I.
Norvag K.
ACM International Conference Proceeding Series English 2012 Traditional document indexing techniques store documents using easily accessible representations, such as inverted indices, which can efficiently scale for large document sets. These structures offer scalable and efficient solutions in text document management tasks, though, they omit the cornerstone of the documents' purpose: meaning. They also neglect semantic relations that bind terms into coherent fragments of text that convey messages. When semantic representations are employed, the documents are mapped to the space of concepts and the similarity measures are adapted appropriately to better fit the retrieval tasks. However, these methods can be slow both at indexing and retrieval time. In this paper we propose SemaFor, an indexing algorithm for text documents, which uses semantic spanning forests constructed from lexical resources, like Wikipedia, and WordNet, and spectral graph theory in order to represent documents for further processing. 0 0
Automatic semantic web annotation of named entities Charton E.
Marie-Pierre Gagnon
Ozell B.
Lecture Notes in Computer Science English 2011 This paper describes a method to perform automated semantic annotation of named entities contained in large corpora. The semantic annotation is made in the context of the Semantic Web. The method is based on an algorithm that compares the set of words that appear before and after the name entity with the content of Wikipedia articles, and identifies the more relevant one by means of a similarity measure. It then uses the link that exists between the selected Wikipedia entry and the corresponding RDF description in the Linked Data project to establish a connection between the named entity and some URI in the Semantic Web. We present our system, discuss its architecture, and describe an algorithm dedicated to ontological disambiguation of named entities contained in large-scale corpora. We evaluate the algorithm, and present our results. 0 0
Clustering blogs using document context similarity and spectral graph partitioning Ayyasamy R.K.
Alhashmi S.M.
Eu-Gene S.
Tahayna B.
Advances in Intelligent and Soft Computing English 2011 Semantic-based document clustering has been a challenging problem over the past few years and its execution depends on modeling the underlying content and its similarity metrics. Existing metrics evaluate pair wise text similarity based on text content, which is referred as content similarity. The performances of these measures are based on co-occurrences, and ignore the semantics among words. Although, several research works have been carried out to solve this problem, we propose a novel similarity measure by exploiting external knowledge base-Wikipedia to enhance document clustering task. Wikipedia articles and the main categories were used to predict and affiliate them to their semantic concepts. In this measure, we incorporate context similarity by constructing a vector with each dimension representing contents similarity between a document and other documents in the collection. Experimental result conducted on TREC blog dataset confirms that the use of context similarity measure, can improve the precision of document clustering significantly. 0 0
Concept disambiguation exploiting semantic databases Hossucu A.G.
Ayyildiz H.
Gokturk Z.O.
Proceedings of the International Workshop on Semantic Web Information Management, SWIM 2011 English 2011 This paper presents a novel approach for resolving ambiguities in concepts that already reside in semantic databases such as Freebase and DBpedia. Different from standard dictionaries and lexical databases, semantic databases provide a rich hierarchy of semantic relations in ontological structures. Our disambiguation approach decides on the implied sense by computing concept similarity measures as a function of semantic relations defined in ontological graph representation of concepts. Our similarity measures also utilize Wikipedia descriptions of concepts. We performed a preliminary experimental evaluation, measuring disambiguation success rate and its correlation with input text content. The results show that our method outperforms well-known disambiguation methods. 0 0
Geodesic distances for web document clustering Tekir S.
Mansmann F.
Keim D.
IEEE SSCI 2011: Symposium Series on Computational Intelligence - CIDM 2011: 2011 IEEE Symposium on Computational Intelligence and Data Mining English 2011 While traditional distance measures are often capable of properly describing similarity between objects, in some application areas there is still potential to fine-tune these measures with additional information provided in the data sets. In this work we combine such traditional distance measures for document analysis with link information between documents to improve clustering results. In particular, we test the effectiveness of geodesic distances as similarity measures under the space assumption of spherical geometry in a 0-sphere. Our proposed distance measure is thus a combination of the cosine distance of the term-document matrix and some curvature values in the geodesic distance formula. To estimate these curvature values, we calculate clustering coefficient values for every document from the link graph of the data set and increase their distinctiveness by means of a heuristic as these clustering coefficient values are rough estimates of the curvatures. To evaluate our work, we perform clustering tests with the k-means algorithm on the English Wikipedia hyperlinked data set with both traditional cosine distance and our proposed geodesic distance. The effectiveness of our approach is measured by computing micro-precision values of the clusters based on the provided categorical information of each article. 0 0
Identifying word translations from comparable corpora using latent topic models Vulic I.
De Smet W.
Moens M.-F.
ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies English 2011 A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from word-topic distributions with similarity measures in the original space, are also reported. 0 0
Leveraging Wikipedia concept and category information to enhance contextual advertising Zongda Wu
Guandong Xu
Rong Pan
Yanchun Zhang
Zhiwen Hu
Jianfeng Lu
CIKM English 2011 0 0
Measuring similarities between technical terms based on Wikipedia Myunggwon Hwang
Jeong D.-H.
Seungwoo Lee
Hanmin Jung
Proceedings - 2011 IEEE International Conferences on Internet of Things and Cyber, Physical and Social Computing, iThings/CPSCom 2011 English 2011 Measuring similarities between terms is useful for semantic information processing such as query expansion and WSD (Word Sense Disambiguation). This study aims at identifying technologies closely related to emerging technologies. Thus, we propose a hybrid method using both category and internal link information in Wikipedia, which is the largest database that everyone can share and edit its contents. Comparative experimental results with a state-of-theart WLM (Wikipedia Link-based Measure) show that this proposed method works better than each single method. 0 0
Ontology-based data instantiation using web service Rezazadeh R.
Shadgar B.
Osareh A.
Rezazadeh A.
Proceedings - UKSim 5th European Modelling Symposium on Computer Modelling and Simulation, EMS 2011 English 2011 The Semantic Web aims at creating a platform where information has its semantics and can be understood and processed by computers themselves with minimum human interference. Ontology theory and its related technology have been developed to help construct such a platform because ontology promises to encode certain levels of semantics for information and offers a set of common vocabulary for people or computer to communicate with. In this article, we introduced the open-source software called ontology instantiate. This software has been created for book ontology construction and instantiation using web services. This software helps users to instantiate ontology of book information on Amazon web site. This software also allows the user to merge another book ontology in its produced ontology and integrates them in the form unit ontology. This software for integration of these ontologies uses a wide range of similarity measures, including semantic similarity, string-based similarity and structural similarity. The tree is used for investigating the structural similarity. Dictionaries like Wikipedia, Word Net, Google and Yahoo is used for investigating semantic similarity and string-based similarity. 0 0
A negative category based approach for Wikipedia document classification Meenakshi Sundaram Murugeshan
K. Lakshmi
Saswati Mukherjee
Int. J. Knowl. Eng. Data Min. English 2010 0 0
Exploiting n-gram importance and wikipedia based additional knowledge for improvements in GAAC based document clustering Kumar N.
Vemula V.V.B.
Srinathan K.
Vasudeva Varma
KDIR 2010 - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval English 2010 This paper provides a solution to the issue: "How can we use Wikipedia based concepts in document clustering with lesser human involvement, accompanied by effective improvements in result?" In the devised system, we propose a method to exploit the importance of N-grams in a document and use Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams in a document depends on a many features including, but not limited to: frequency, position of their occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we introduce a new similarity measure, which takes the weighted N-gram importance into account, in the calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area. 0 0
Exploiting relation extraction for ontology alignment Beisswanger E. Lecture Notes in Computer Science English 2010 When multiple ontologies are used within one application system, aligning the ontologies is a prerequisite for interoperability and unhampered semantic navigation and search. Various methods have been proposed to compute mappings between elements from different ontologies, the majority of which being based on various kinds of similarity measures. As a major shortcoming of these methods it is difficult to decode the semantics of the results achieved. In addition, in many cases they miss important mappings due to poorly developed ontology structures or dissimilar ontology designs. I propose a complementary approach making massive use of relation extraction techniques applied to broad-coverage text corpora. This approach is able to detect different types of semantic relations, dependent on the extraction techniques used. Furthermore, exploiting external background knowledge, it can detect relations even without clear evidence in the input ontologies themselves. 0 0
Named entity disambiguation for german news articles Lommatzsch A.
Ploch D.
De Luca E.W.
Albayrak S.
LWA 2010 - Lernen, Wissen und Adaptivitat - Learning, Knowledge, and Adaptivity, Workshop Proceedings English 2010 Named entity disambiguation has become an important research area providing the basis for improving search engine precision and for enabling semantic search. Current approaches for the named entity disambiguation are usually based on exploiting structured semantic and lingual resources (e.g. WordNet, DBpedia). Unfortunately, each of these resources cover independently from each other insufficient information for the task of named entity disambiguation. On the one handWordNet comprises a relative small number of named entities while on the other hand DBpedia provides only little context for named entities. Our approach is based on the use of multi-lingual Wikipedia data. We show how the combination of multi-lingual resources can be used for named entity disambiguation. Based on a German and an English document corpus, we evaluate various similarity measures and algorithms for extracting data for named entity disambiguation. We show that the intelligent filtering of context data and the combination of multilingual information provides high quality named entity disambiguation results. 0 0
Exploiting Negative Categories and Wikipedia Structures for Document Classification Meenakshi Sundaram Murugeshan
K. Lakshmi
Saswati Mukherjee
ARTCOM English 2009 0 0
Towards semantic tagging in collaborative environments Chandramouli K.
Kliegr T.
Svatek V.
Izquierdo E.
DSP 2009: 16th International Conference on Digital Signal Processing, Proceedings English 2009 Tags pose an efficient and effective way of organization of resources, but they are not always available. A technique called SCM/THD investigated in this paper extracts entities from free-text annotations, and using the Lin similarity measure over the WordNet thesaurus classifies them into a controlled vocabulary of tags. Hypernyms extracted from Wikipedia are used to map uncommon entities to Wordnet synsets. In collaborative environments, users can assign multiple annotations to the same object hence increasing the amount of information available. Assuming that the semantics of the annotations overlap, this redundancy can be exploited to generate higher quality tags. A preliminary experiment presented in the paper evaluates the consistency and quality of tags generated from multiple annotations of the same image. The results obtained on an experimental dataset comprising of 62 annotations from four annotators show that the accuracy of a simple majority vote surpasses the average accuracy obtained through assessing the annotations individually by 18%. A moderate-strength correlation has been found between the quality of generated tags and the consistency of annotations. 0 0
Instanced-based mapping between thesauri and folksonomies Wartena C.
Brussee R.
Lecture Notes in Computer Science English 2008 The emergence of web based systems in which users can annotate items, raises the question of the semantic interoperability between vocabularies originating from collaborative annotation processes, often called folksonomies, and keywords assigned in a more traditional way. If collections are annotated according to two systems, e.g. with tags and keywords, the annotated data can be used for instance based mapping between the vocabularies. The basis for this kind of matching is an appropriate similarity measure between concepts, based on their distribution as annotations. In this paper we propose a new similarity measure that can take advantage of some special properties of user generated metadata. We have evaluated this measure with a set of articles from Wikipedia which are both classified according to the topic structure of Wikipedia and annotated by users of the bookmarking service del.icio.us. The results using the new measure are significantly better than those obtained using standard similarity measures proposed for this task in the literature, i.e., it correlates better with human judgments. We argue that the measure also has benefits for instance based mapping of more traditionally developed vocabularies. 0 0
Semantic relatedness metric for Wikipedia concepts based on link analysis and its application to word sense disambiguation Denis Turdakov
Pavel Velikhov
CEUR Workshop Proceedings English 2008 Wikipedia has grown into a high quality up-todate knowledge base and can enable many knowledge-based applications, which rely on semantic information. One of the most general and quite powerful semantic tools is a measure of semantic relatedness between concepts. Moreover, the ability to efficiently produce a list of ranked similar concepts for a given concept is very important for a wide range of applications. We propose to use a simple measure of similarity between Wikipedia concepts, based on Dice's measure, and provide very efficient heuristic methods to compute top k ranking results. Furthermore, since our heuristics are based on statistical properties of scale-free networks, we show that these heuristics are applicable to other complex ontologies. Finally, in order to evaluate the measure, we have used it to solve the problem of word-sense disambiguation. Our approach to word sense disambiguation is based solely on the similarity measure and produces results with high accuracy. 0 1
Topic detection by clustering keywords Wartena C.
Brussee R.
Proceedings - International Workshop on Database and Expert Systems Applications, DEXA English 2008 We consider topic detection without any prior knowledge of category structure or possible categories. Keywords are extracted and clustered based on different similarity measures using the induced k-bisecting clustering algorithm. Evaluation on Wikipedia articles shows that clusters of keywords correlate strongly with the Wikipedia categories of the articles. In addition, we find that a distance measure based on the Jensen-Shannon divergence of probability distributions outperforms the cosine similarity. In particular, a newly proposed term distribution taking co-occurrence of terms into account gives best results. 0 0
Semantic extensions of the ephyra QA system for TREC 2007 Schlaefer N.
Ko J.
Betteridge J.
Guido Sautter
Pathak M.
Nyberg E.
NIST Special Publication English 2007 We describe recent extensions to the Ephyra question answering (QA) system and their evaluation in the TREC 2007 QA track. Existing syntactic answer extraction approaches for factoid and list questions have been complemented with a high-accuracy semantic approach that generates a semantic representation of the question and extracts answer candidates from similar semantic structures in the corpus. Candidates found by different answer extractors are combined and ranked by a statistical framework that integrates a variety of answer validation techniques and similarity measures to estimate a probability for each candidate. A novel answer type classifier combines a statistical model and hand-coded rules to predict the answer type based on syntactic and semantic features of the question. Our approach for the 'other' questions uses Wikipedia and Google to judge the relevance of answer candidates found in the corpora. 0 0
Featureless similarities for terms clustering using tree-traversing ants Wong W.
Wei Liu
Bennamoun M.
ACM International Conference Proceeding Series English 2006 Besides being difficult to scale between different domains and to handle knowledge fluctuations, the results of terms clustering presented by existing ontology engineering systems are far from desirable. In this paper, we propose a new version of ant-based method for clustering terms known as Tree-Traversing Ants (TTA). With the help of the Normalized Google Distance (NGD) and n° of Wikipedia (n°W) as measures for similarity and distance between terms, we attempt to achieve an adaptable clustering method that is highly scalable across domains. Initial experiments with two datasets show promising results and demonstrated several advantages that are not simultaneously present in standard ant-based and other conventional clustering methods. Copyright 0 0