Document indexing

From WikiPapers
Jump to: navigation, search

Document indexing is included as keyword or extra keyword in 0 datasets, 0 tools and 4 publications.


There is no datasets for this keyword.


There is no tools for this keyword.


Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
SemaFor: Semantic document indexing using semantic forests Tsatsaronis G.
Varlamis I.
Norvag K.
ACM International Conference Proceeding Series English 2012 Traditional document indexing techniques store documents using easily accessible representations, such as inverted indices, which can efficiently scale for large document sets. These structures offer scalable and efficient solutions in text document management tasks, though, they omit the cornerstone of the documents' purpose: meaning. They also neglect semantic relations that bind terms into coherent fragments of text that convey messages. When semantic representations are employed, the documents are mapped to the space of concepts and the similarity measures are adapted appropriately to better fit the retrieval tasks. However, these methods can be slow both at indexing and retrieval time. In this paper we propose SemaFor, an indexing algorithm for text documents, which uses semantic spanning forests constructed from lexical resources, like Wikipedia, and WordNet, and spectral graph theory in order to represent documents for further processing. 0 0
Conceptual indexing of documents using Wikipedia Carlo Abi Chahine
Nathalie Chaignaud
Kotowicz J.-P.
Pecuchet J.-P.
Proceedings - 2011 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2011 English 2011 This paper presents an indexing support system that suggests for librarians a set of topics and keywords relevant to a pedagogical document. Our method of document indexing uses the Wikipedia category network as a conceptual taxonomy. A directed acyclic graph is built for each document by mapping terms (one or more words) to a concept in the Wikipedia category network. Properties of the graph are used to weight these concepts. This allows the system to extract socalled important concepts from the graph and to disambiguate terms of the document. According to these concepts, topics and keywords are proposed. This method has been evaluated by the librarians on a corpus of french pedagogical documents. 0 0
A novel weighting scheme for efficient document indexing and classification Tahayna B.
Ayyasamy R.K.
Alhashmi S.
Eu-Gene S.
Proceedings 2010 International Symposium on Information Technology - Engineering Technology, ITSim'10 English 2010 In this paper we propose and illustrate the effectiveness of a new topic-based document classification method. The proposed method utilizes the Wikipedia, a large scale Web encyclopaedia that has high-quality and huge-scale articles and a category system. Wikipedia is used using an Ngram technique to transform the document from being a "bag of words" to become a "bag of concepts". Based on this transformation, a novel concept-based weighting scheme (denoted as Conf.idf) is proposed to index the text with the flavor of the traditional tf.idf indexing scheme. Moreover, a genetic algorithm-based support vector machine optimization method is used for the purpose of feature subset and instance selection. Experimental results showed that proposed weighting scheme outperform the traditional indexing and weighting scheme. 0 0
Mining wikipedia knowledge to improve document indexing and classification Ayyasamy R.K.
Tahayna B.
Alhashmi S.
Eu-Gene S.
Egerton S.
10th International Conference on Information Sciences, Signal Processing and their Applications, ISSPA 2010 English 2010 Web logs are an important source of information that requires automatic techniques to categorize them into "topic-based" content, to facilitate their future browsing and retrieval. In this paper we propose and illustrate the effectiveness of a new tf.idf measure. The proposed Conf.idf, Catf.idf measures are solely based on the mapping of terms-to-concepts-to- categories (TCONCAT) method that utilizes Wikipedia. The Knowledge base-Wikipedia is considered as a large scale Web encyclopaedia, that has high-quality and huge number of articles and categorical indexes. Using this system, our proposed framework consists of two stages to solve weblog classification problem. The first stage is to find out the terms belonging to a unique concept (article), as well as to disambiguate the terms belonging to more than one concept. The second stage is the determination of the categories to which these found concepts belong to. Experimental result confirms that, proposed system can distinguish the web logs that belongs to more than one category efficiently and has a better performance and success than the traditional statistical Natural Language Processing-NLP approaches. 0 0