Exploiting n-gram importance and wikipedia based additional knowledge for improvements in GAAC based document clustering
|Exploiting n-gram importance and wikipedia based additional knowledge for improvements in GAAC based document clustering|
|Author(s)||Kumar N., Vemula V.V.B., Srinathan K., Varma V.|
|Published in||KDIR 2010 - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval|
|Keyword(s)||Community detection, Document clustering, Group-average agglomerative clustering, N-gram, Similarity measure, Wikipedia based additional knowledge (Extra: Agglomerative clustering, Community detection, Document clustering, N-gram, Similarity measure, Wikipedia, Cluster analysis, Knowledge based systems, Population dynamics, Information retrieval)|
|Article||BASE, CiteSeerX, Google Scholar|
|Web||Ask, Bing, Google (PDF), Yahoo!|
|Download and mirrors|
|Local copy||Not available|
|Remote mirror(s)||Not available|
|Export and share|
|BibTeX, CSV, RDF, JSON|
|Browse properties · List of conference papers|
Exploiting n-gram importance and wikipedia based additional knowledge for improvements in GAAC based document clustering is a 2010 conference paper written in English by Kumar N., Vemula V.V.B., Srinathan K., Varma V. and published in KDIR 2010 - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval.
This paper provides a solution to the issue: "How can we use Wikipedia based concepts in document clustering with lesser human involvement, accompanied by effective improvements in result?" In the devised system, we propose a method to exploit the importance of N-grams in a document and use Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams in a document depends on a many features including, but not limited to: frequency, position of their occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we introduce a new similarity measure, which takes the weighted N-gram importance into account, in the calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area.
- This section requires expansion. Please, help!
Probably, this publication is cited by others, but there are no articles available for them in WikiPapers.