Browse wiki

Jump to: navigation, search
Enhancing Short Text Clustering with Small External Repositories
Abstract The automatic clustering of textual data aThe automatic clustering of textual data according to their semantic concepts is a challenging, yet important task. Choosing an appropriate method to apply when clustering text depends on the nature of the documents being analysed. For example, traditional clustering algorithms can struggle to correctly model collections of very short text due to their extremely sparse nature. In recent times, much attention has been directed to finding methods for adequately clustering short text. Many popular approaches employ large, external document repositories, such as Wikipedia or the Open Directory Project, to incorporate additional world knowledge into the clustering process. However the sheer size of many of these external collections can make these techniques difficult or time consuming to apply. This paper also employs external document collections to aid short text clustering performance. The external collections are referred to in this paper as Background Knowledge. In contrast to most previous literature a separate collection of Background Knowledge is obtained for each short text dataset. However, this Background Knowledge contains several orders of magnitude fewer documents than commonly used repositories like Wikipedia. A simple approach is described where the Background Knowledge is used to re-express short text in terms of a much richer feature space. A discussion of how best to cluster documents in this feature space is presented. A solution is proposed, and an experimental evaluation is performed that demonstrates significant improvement over clustering based on standard metrics with several publicly available datasets represented in the richer feature space.s represented in the richer feature space.
Abstractsub The automatic clustering of textual data aThe automatic clustering of textual data according to their semantic concepts is a challenging, yet important task. Choosing an appropriate method to apply when clustering text depends on the nature of the documents being analysed. For example, traditional clustering algorithms can struggle to correctly model collections of very short text due to their extremely sparse nature. In recent times, much attention has been directed to finding methods for adequately clustering short text. Many popular approaches employ large, external document repositories, such as Wikipedia or the Open Directory Project, to incorporate additional world knowledge into the clustering process. However the sheer size of many of these external collections can make these techniques difficult or time consuming to apply. This paper also employs external document collections to aid short text clustering performance. The external collections are referred to in this paper as Background Knowledge. In contrast to most previous literature a separate collection of Background Knowledge is obtained for each short text dataset. However, this Background Knowledge contains several orders of magnitude fewer documents than commonly used repositories like Wikipedia. A simple approach is described where the Background Knowledge is used to re-express short text in terms of a much richer feature space. A discussion of how best to cluster documents in this feature space is presented. A solution is proposed, and an experimental evaluation is performed that demonstrates significant improvement over clustering based on standard metrics with several publicly available datasets represented in the richer feature space.s represented in the richer feature space.
Bibtextype article  +
Has author Petersen H. + , Poon J. +
Has extra keyword Automatic clustering + , Background knowledge + , Cluster documents + , Clustering + , Clustering process + , Dataset + , Document collection + , Document repositories + , Experimental evaluation + , Feature space + , Open directory projects + , Orders of magnitude + , Semantic concept + , Sheer size + , Short text + , Simple approach + , Standard metrics + , Text Clustering + , Text mining + , Textual data + , Traditional clustering + , Wikipedia + , World knowledge + , Cluster analysis + , Data mining + , Information technology + , Semantics + , Websites + , Clustering algorithms +
Has keyword Background knowledge + , Clustering + , Short text + , Text mining +
Isbn 9781921770029  +
Language English +
Number of citations by publication 0  +
Number of references by publication 0  +
Pages 79–90  +
Published in Conferences in Research and Practice in Information Technology Series +
Title Enhancing Short Text Clustering with Small External Repositories +
Type journal article  +
Volume 121  +
Year 2010 +
Creation dateThis property is a special property in this wiki. 7 November 2014 21:22:18  +
Categories Publications without license parameter  + , Publications without DOI parameter  + , Publications without remote mirror parameter  + , Publications without archive mirror parameter  + , Publications without paywall mirror parameter  + , Journal articles  + , Publications without references parameter  + , Publications  +
Modification dateThis property is a special property in this wiki. 7 November 2014 21:22:18  +
DateThis property is a special property in this wiki. 2010  +
hide properties that link here 
Enhancing Short Text Clustering with Small External Repositories + Title
 

 

Enter the name of the page to start browsing from.