Document similarity

From WikiPapers
Jump to: navigation, search

Document similarity is included as keyword or extra keyword in 0 datasets, 0 tools and 8 publications.

Datasets

There is no datasets for this keyword.

Tools

There is no tools for this keyword.


Publications

Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Tagging Scientific Publications Using Wikipedia and Natural Language Processing Tools Lopuszynski M.
Bolikowski L.
Communications in Computer and Information Science English 2014 In this work, we compare two simple methods of tagging scientific publications with labels reflecting their content. As a first source of labels Wikipedia is employed, second label set is constructed from the noun phrases occurring in the analyzed corpus. We examine the statistical properties and the effectiveness of both approaches on the dataset consisting of abstracts from 0.7 million of scientific documents deposited in the ArXiv preprint collection. We believe that obtained tags can be later on applied as useful document features in various machine learning tasks (document similarity, clustering, topic modelling, etc.). 0 0
Calculating Wikipedia article similarity using machine translation evaluation metrics Maike Erdmann
Andrew Finch
Kotaro Nakayama
Eiichiro Sumita
Takahiro Hara
Shojiro Nishio
Proceedings - 25th IEEE International Conference on Advanced Information Networking and Applications Workshops, WAINA 2011 English 2011 Calculating the similarity of Wikipedia articles in different languages is helpful for bilingual dictionary construction and various other research areas. However, standard methods for document similarity calculation are usually very simple. Therefore, we describe an approach of translating one Wikipedia article into the language of the other article, and then calculating article similarity with standard machine translation evaluation metrics. An experiment revealed that our approach is effective for identifying Wikipedia articles in different languages that are covering the same concept. 0 0
Measuring Hyperlink Distances: Wikipedia Case Study Rodrigo Rodrigues Paim
Daniel Ratton Figueiredo
WebSci Conference English 2011 Hyperlinks are a fundamental aspect of the Web, as they play a major role in accomplishing important functions such as document clustering and document ranking. Despite various facets of hyperlink analysis, in this work we consider a novel aspect of hyperlinks, namely their distance. How far in terms of contextual similarity will a hyperlink take you? We consider classical distance functions that capture the similarity between documents as well as propose a new distance function, an IDF-based generalization of Jaccard distance. We characterize the distance distribution of hyperlinks considering Wikipedia as a case study. Our results indicate that hyperlink distances are strongly skewed, with the majority of hyperlinks exhibiting very long distances. 0 0
Ranking multilingual documents using minimal language dependent resources Santosh G.S.K.
Kiran Kumar N.
Vasudeva Varma
Lecture Notes in Computer Science English 2011 This paper proposes an approach of extracting simple and effective features that enhances multilingual document ranking (MLDR). There is limited prior research on capturing the concept of multilingual document similarity in determining the ranking of documents. However, the literature available has worked heavily with language specific tools, making them hard to reimplement for other languages. Our approach extracts various multilingual and monolingual similarity features using a basic language resource (bilingual dictionary). No language-specific tools are used, hence making this approach extensible for other languages. We used the datasets provided by Forum for Information Retrieval Evaluation (FIRE) for their 2010 Adhoc Cross-Lingual document retrieval task on Indian languages. Experiments have been performed with different ranking algorithms and their results are compared. The results obtained showcase the effectiveness of the features considered in enhancing multilingual document ranking. 0 0
Using ontological and document similarity to estimate museum exhibit relatedness Grieser K.
Baldwin T.
Bohnert F.
Sonenberg L.
Journal of Computing and Cultural Heritage English 2011 Exhibits within cultural heritage collections such as museums and art galleries are arranged by experts with intimate knowledge of the domain, but there may exist connections between individual exhibits that are not evident in this representation. For example, the visitors to such a space may have their own opinions on how exhibits relate to one another. In this article, we explore the possibility of estimating the perceived relatedness of exhibits by museum visitors through a variety of ontological and document similarity-based methods. Specifically, we combine theWikipedia category hierarchy with lexical similarity measures, and evaluate the correlation with the relatedness judgements of visitors. We compare our measure with simple document similarity calculations, based on either Wikipedia documents or Web pages taken from the Web site for the museum of interest. We also investigate the hypothesis that physical distance in the museum space is a direct representation of the conceptual distance between exhibits. We demonstrate that ontological similarity measures are highly effective at capturing perceived relatedness and that the proposed RACO(Related Article Conceptual Overlap) method is able to achieve results closest to relatedness judgements provided by human annotators compared to existing state-of-the art measures of semantic relatedness. 0 0
Wikipedia-based smoothing for enhancing text clustering Rahimtoroghi E.
Shakery A.
Lecture Notes in Computer Science English 2011 The conventional algorithms for text clustering that are based on the bag of words model, fail to fully capture the semantic relations between the words. As a result, documents describing an identical topic may not be categorized into same clusters if they use different sets of words. A generic solution for this issue is to utilize background knowledge to enrich the document contents. In this research, we adopt a language modeling approach for text clustering and propose to smooth the document language models using Wikipedia articles in order to enhance text clustering performance. The contents of Wikipedia articles as well as their assigned categories are used in three different ways to smooth the document language models with the goal of enriching the document contents. Clustering is then performed on a document similarity graph constructed on the enhanced document collection. Experiment results confirm the effectiveness of the proposed methods. 0 0
A random walk framework to compute textual semantic similarity: A unified model for three benchmark tasks Majid Yazdani
Andrei Popescu-Belis
Proceedings - 2010 IEEE 4th International Conference on Semantic Computing, ICSC 2010 English 2010 A network of concepts is built from Wikipedia documents using a random walk approach to compute distances between documents. Three algorithms for distance computation are considered: hitting/commute time, personalized page rank, and truncated visiting probability. In parallel, four types of weighted links in the document network are considered: actual hyperlinks, lexical similarity, common category membership, and common template use. The resulting network is used to solve three benchmark semantic tasks - word similarity, paraphrase detection between sentences, and document similarity - by mapping pairs of data to the network, and then computing a distance between these representations. The model reaches stateof-the-art performance on each task, showing that the constructed network is a general, valuable resource for semantic similarity judgments. 0 0
Using Wikipedia-Based Conceptual Contexts to Calculate Document Similarity Fabian Kaiser
Holger Schwarz
ICDS English 2009 0 0