Browse wiki

Jump to: navigation, search
Learning to classify short and sparse text & web with hidden topics from large-scale data collections
Abstract This paper presents a general framework foThis paper presents a general framework for building classifiers that deal with short and sparse text & Web segments by making the most of hidden topics discovered from large-scale data collections. The main motivation of this work is that many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness. We, therefore, come up with an idea of gaining external knowledge to make the data more related as well as expand the coverage of classifiers to handle future data better. The underlying idea of the framework is that for each classification task, we collect a large-scale external data collection called "universal dataset", and then build a classifier on both a (small) set of labeled training data and a rich set of hidden topics discovered from that data collection. The framework is general enough to be applied to different data domains and genres ranging from Web search results to medical text. We did a careful evaluation on several hundred megabytes of Wikipedia (30M words) and MEDLINE (18M words) with two tasks: "Web search domain disambiguation" and "disease categorization for medical text", and achieved significant quality enhancement. achieved significant quality enhancement.
Abstractsub This paper presents a general framework foThis paper presents a general framework for building classifiers that deal with short and sparse text & Web segments by making the most of hidden topics discovered from large-scale data collections. The main motivation of this work is that many classification tasks working with short segments of text & Web, such as search snippets, forum & chat messages, blog & news feeds, product reviews, and book & movie summaries, fail to achieve high accuracy due to the data sparseness. We, therefore, come up with an idea of gaining external knowledge to make the data more related as well as expand the coverage of classifiers to handle future data better. The underlying idea of the framework is that for each classification task, we collect a large-scale external data collection called "universal dataset", and then build a classifier on both a (small) set of labeled training data and a rich set of hidden topics discovered from that data collection. The framework is general enough to be applied to different data domains and genres ranging from Web search results to medical text. We did a careful evaluation on several hundred megabytes of Wikipedia (30M words) and MEDLINE (18M words) with two tasks: "Web search domain disambiguation" and "disease categorization for medical text", and achieved significant quality enhancement. achieved significant quality enhancement.
Bibtextype inproceedings  +
Doi 10.1145/1367497.1367510  +
Has author Phan X.-H. + , Nguyen L.-M. + , Horiguchi S. +
Has extra keyword Classifiers + , Data acquisition + , Information retrieval + , Internet + , Knowledge based systems + , Learning systems + , Text processing + , Classification tasks + , Data collections + , Data domains + , Data sparsenesses + , Do-mains + , External knowledges + , External- + , General frameworks + , High accuracies + , Labeled training datums + , Medical texts + , Medline + , News feeds + , Quality enhancements + , Short segments + , Sparse text + , Topic analysis + , Web data analysis/classification + , Web searches + , Wikipedia + , World Wide Web +
Has keyword Sparse text + , Topic analysis + , Web data analysis/classification +
Isbn 9781605580852  +
Language English +
Number of citations by publication 0  +
Number of references by publication 0  +
Pages 91–99  +
Published in Proceeding of the 17th International Conference on World Wide Web 2008, WWW'08 +
Title Learning to classify short and sparse text & web with hidden topics from large-scale data collections +
Type conference paper  +
Year 2008 +
Creation dateThis property is a special property in this wiki. 8 November 2014 05:43:19  +
Categories Publications without license parameter  + , Publications without remote mirror parameter  + , Publications without archive mirror parameter  + , Publications without paywall mirror parameter  + , Conference papers  + , Publications without references parameter  + , Publications  +
Modification dateThis property is a special property in this wiki. 8 November 2014 05:43:19  +
DateThis property is a special property in this wiki. 2008  +
hide properties that link here 
Learning to classify short and sparse text & web with hidden topics from large-scale data collections + Title
 

 

Enter the name of the page to start browsing from.