Improving Farsi multiclass text classification using a thesaurus and two-stage feature selection
|Improving Farsi multiclass text classification using a thesaurus and two-stage feature selection|
|Author(s)||Maghsoodi N., Homayounpour M.M.|
|Published in||Journal of the American Society for Information Science and Technology|
|Keyword(s)||Unknown (Extra: Automatic classification, Average numbers, Classification efficiency, Feature vectors, Information contents, Multi-class, Test data, Text classification, Training dataset, Training example, Two stage, Wikipedia, Classification (of information), Information retrieval systems, Statistical tests, Text processing, Thesauri, Websites, Feature extraction)|
|Article||BASE, CiteSeerX, Google Scholar|
|Web||Ask, Bing, Google (PDF), Yahoo!|
|Download and mirrors|
|Local copy||Not available|
|Remote mirror(s)||Not available|
|Export and share|
|BibTeX, CSV, RDF, JSON|
|Browse properties · List of journal articles|
Improving Farsi multiclass text classification using a thesaurus and two-stage feature selection is a 2011 journal article written in English by Maghsoodi N., Homayounpour M.M. and published in Journal of the American Society for Information Science and Technology.
The progressive increase of information content has recently made it necessary to create a system for automatic classification of documents. In this article, a system is presented for the categorization of multiclass Farsi documents that requires fewer training examples and can help to compensate the shortcoming of the standard training dataset. The new idea proposed in the present article is based on extending the feature vector by adding some words extracted from a thesaurus and then filtering the new feature vector by applying secondary feature selection to discard inappropriate features. In fact, a phase of secondary feature selection is applied to choose more appropriate features among the features added from a thesaurus to enhance the effect of using a thesaurus on the efficiency of the classifier. To evaluate the proposed system, a corpus is gathered from the Farsi Wikipedia website and some articles in the Hamshahri newspaper, the Roshd periodical, and the Soroush magazine. In addition to studying the role of a thesaurus and applying secondary feature selection, the effect of a various number of categories, size of the training dataset, and average number of words in the test data also are examined. As the results indicate, classification efficiency improves by applying this approach, especially when available data is not sufficient for some text categories.
- This section requires expansion. Please, help!
Probably, this publication is cited by others, but there are no articles available for them in WikiPapers. Cited 1 time(s)