Text classification

From WikiPapers
Jump to: navigation, search

Text classification is included as keyword or extra keyword in 0 datasets, 0 tools and 38 publications.


There is no datasets for this keyword.


There is no tools for this keyword.


Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Comparative analysis of text representation methods using classification Szymanski J. Cybernetics and Systems English 2014 In our work, we review and empirically evaluate five different raw methods of text representation that allow automatic processing of Wikipedia articles. The main contribution of the article - evaluation of approaches to text representation for machine learning tasks - indicates that the text representation is fundamental for achieving good categorization results. The analysis of the representation methods creates a baseline that cannot be compensated for even by sophisticated machine learning algorithms. It confirms the thesis that proper data representation is a prerequisite for achieving high-quality results of data analysis. Evaluation of the text representations was performed within the Wikipedia repository by examination of classification parameters observed during automatic reconstruction of human-made categories. For that purpose, we use a classifier based on a support vector machines method, extended with multilabel and multiclass functionalities. During classifier construction we observed parameters such as learning time, representation size, and classification quality that allow us to draw conclusions about text representations. For the experiments presented in the article, we use data sets created from Wikipedia dumps. We describe our software, called Matrixu, which allows a user to build computational representations of Wikipedia articles. The software is the second contribution of our research, because it is a universal tool for converting Wikipedia from a human-readable form to a form that can be processed by a machine. Results generated using Matrixu can be used in a wide range of applications that involve usage of Wikipedia data. 0 0
A multilingual and multiplatform application for medicinal plants prescription from medical symptoms Ruiz-Rico F.
Rubio-Sanchez M.-C.
Tomas D.
Vicedo J.-L.
SIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2013 This paper presents an application for medicinal plants prescription based on text classification techniques. The system receives as an input a free text describing the symptoms of a user, and retrieves a ranked list of medicinal plants related to those symptoms. In addition, a set of links to Wikipedia are also provided, enriching the information about every medicinal plant presented to the user. In order to improve the accessibility to the application, the input can be written in six different languages, adapting the results accordingly. The application interface can be accessed from different devices and platforms. 0 0
A new text representation scheme combining Bag-of-Words and Bag-of-Concepts approaches for automatic text classification Alahmadi A.
Joorabchi A.
Mahdi A.E.
2013 7th IEEE GCC Conference and Exhibition, GCC 2013 English 2013 This paper introduces a new approach to creating text representations and apply it to a standard text classification collections. The approach is based on supplementing the well-known Bag-of-Words (BOW) representational scheme with a concept-based representation that utilises Wikipedia as a knowledge base. The proposed representations are used to generate a Vector Space Model, which in turn is fed into a Support Vector Machine classifier to categorise a collection of textual documents from two publically available datasets. Experimental results for evaluating the performance of our model in comparison to using a standard BOW scheme and a concept-based scheme, as well as recently reported similar text representations that are based on augmenting the standard BOW approach with concept-based representations. 0 0
A portable multilingual medical directory by automatic categorization of wikipedia articles Ruiz-Rico F.
Rubio-Sanchez M.-C.
Tomas D.
Vicedo J.-L.
SIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2013 Wikipedia has become one of the most important sources of information available all over the world. However, the categorization of Wikipedia articles is not standardized and the searches are mainly performed on keywords rather than concepts. In this paper we present an application that builds a hierarchical structure to organize all Wikipedia entries, so that medical articles can be reached from general to particular, using the well known Medical Subject Headings (MeSH) thesaurus. Moreover, the language links between articles will allow using the directory created in different languages. The final system can be packed and ported to mobile devices as a standalone offline application. 0 0
Improving semi-supervised text classification by using wikipedia knowledge Zhang Z.
Hong Lin
Li P.
Haofen Wang
Lu D.
Lecture Notes in Computer Science English 2013 Semi-supervised text classification uses both labeled and unlabeled data to construct classifiers. The key issue is how to utilize the unlabeled data. Clustering based classification method outperforms other semi-supervised text classification algorithms. However, its achievements are still limited because the vector space model representation largely ignores the semantic relationships between words. In this paper, we propose a new approach to address this problem by using Wikipedia knowledge. We enrich document representation with Wikipedia semantic features (concepts and categories), propose a new similarity measure based on the semantic relevance between Wikipedia features, and apply this similarity measure to clustering based classification. Experiment results on several corpora show that our proposed method can effectively improve semi-supervised text classification performance. 0 0
Improving text categorization with semantic knowledge in wikipedia Xiaolong Wang
Jia Y.
Chen K.
Fan H.
Zhou B.
IEICE Transactions on Information and Systems English 2013 Text categorization, especially short text categorization, is a difficult and challenging task since the text data is sparse and multidimen-sional. In traditional text classification methods, document texts are repre-sented with Bag of Words (BOW) text representation schema, which is based on word co-occurrence and has many limitations. In this paper, we mapped document texts to Wikipedia concepts and used the Wikipedia-concept-based document representation method to take the place of tradi-tional BOW model for text classification. In order to overcome the weak-ness of ignoring the semantic relationships among terms in document rep-resentation model and utilize rich semantic knowledge in Wikipedia, we constructed a semantic matrix to enrich Wikipedia-concept-based docu-ment representation. Experimental evaluation on five real datasets of long and short text shows that our approach outperforms the traditional BOW method. 0 0
Using wikipedia with associative networks for document classification Bloom N.
Theune M.
De Jong F.M.G.
ESANN 2013 proceedings, 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning English 2013 We demonstrate a new technique for building associative networks based on Wikipedia, comparing them to WordNet-based associative networks that we used previously, finding the Wikipedia-based networks to perform better at document classification. Additionally, we compare the performance of associative networks to various other text classification techniques using the Reuters-21578 dataset, establishing that associative networks can achieve comparable results. 0 0
Wikipedia based semantic smoothing for twitter sentiment classification Torunoglu D.
Telseren G.
Sagturk O.
Ganiz M.C.
2013 IEEE International Symposium on Innovations in Intelligent Systems and Applications, IEEE INISTA 2013 English 2013 Sentiment classification is one of the important and popular application areas for text classification in which texts are labeled as positive and negative. Moreover, Naïve Bayes (NB) is one of the mostly used algorithms in this area. NB having several advantages on lower complexity and simpler training procedure, it suffers from sparsity. Smoothing can be a solution for this problem, mostly Laplace Smoothing is used; however in this paper we propose Wikipedia based semantic smoothing approach. In our study we extend semantic approach by using Wikipedia article titles that exist in training documents, categories and redirects of these articles as topic signatures. Results of the extensive experiments show that our approach improves the performance of NB and even can exceed the accuracy of SVM on Twitter Sentiment 140 dataset. 0 0
A multi-layer text classification framework based on two-level representation model Jiali Yun
Liping Jing
Jian Yu
Houkuan Huang
Expert Systems with Applications English 2012 Text categorization is one of the most common themes in data mining and machine learning fields. Unlike structured data, unstructured text data is more difficult to be analyzed because it contains complicated both syntactic and semantic information. In this paper, we propose a two-level representation model (2RM) to represent text data, one is for representing syntactic information and the other is for semantic information. Each document, in syntactic level, is represented as a term vector where the value of each component is the term frequency and inverse document frequency. The Wikipedia concepts related to terms in syntactic level are used to represent document in semantic level. Meanwhile, we designed a multi-layer classification framework (MLCLA) to make use of the semantic and syntactic information represented in 2RM model. The MLCLA framework contains three classifiers. Among them, two classifiers are applied on syntactic level and semantic level in parallel. The outputs of these two classifiers will be combined and input to the third classifier, so that the final results can be obtained. Experimental results on benchmark data sets (20Newsgroups, Reuters-21578 and Classic3) have shown that the proposed 2RM model plus MLCLA framework improves the text classification performance by comparing with the existing flat text representation models (Term-based VSM, Term Semantic Kernel Model, Concept-based VSM, Concept Semantic Kernel Model and Term + Concept VSM) plus existing classification methods. © 2011 Elsevier Ltd. All rights reserved. 0 0
Document classification by computing an echo in a very simple neural network Brouard C. Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI English 2012 In this paper we present a new classification system called ECHO. This system is based on a principle of echo and applied to document classification. It computes the score of a document for a class by combining a bottom-up and a top-down propagation of activation in a very simple neural network. This system bridges a gap between Machine Learning methods and Information Retrieval since the bottom-up and the top-down propagations can be seen as the measures of the specificity and exhaustivity which underlie the models of relevance used in Information Retrieval. The system has been tested on the Reuters 21578 collection and in the context of an international challenge on large scale hierarchical text classification with corpus extracted from Dmoz and Wikipedia. Its comparison with other classification systems has shown its efficiency. 0 0
Exploiting Turkish Wikipedia as a semantic resource for text classification Poyraz M.
Ganiz M.C.
Akyokus S.
Gorener B.
Kilimci Z.H.
INISTA 2012 - International Symposium on INnovations in Intelligent SysTems and Applications English 2012 Majority of the existing text classification algorithms are based on the "bag of words" (BOW) approach, in which the documents are represented as weighted occurrence frequencies of individual terms. However, semantic relations between terms are ignored in this representation. There are several studies which address this problem by integrating background knowledge such as WordNet, ODP or Wikipedia as a semantic source. However, vast majority of these studies are applied to English texts and to the date there are no similar studies on classification of Turkish documents. We empirically analyze the effect of using Turkish Wikipedia (Vikipedi) as a semantic resource in classification of Turkish documents. Our results demonstrate that performance of classification algorithms can be improved by exploiting Vikipedi concepts. Additionally, we show that Vikipedi concepts have surprisingly large coverage in our datasets which mostly consist of Turkish newspaper articles. 0 0
Feature selection in text categorization based on cloud model Yuanyuan Liu
Hao L.
Dongjie Zhao
Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ICUIMC'12 English 2012 In text domains, effect feature selection is to use a small amount of core information to delegate the text itself. This paper presents a new featured selection method with cloud model on text classification that was gathered from Wikipedia featured articles. The results reveal that the new featured selection metric based on cloud model outperformed the others. It can use a few featured to delegate text and achieve a good classification results. 0 0
Infobox suggestion for Wikipedia entities Sultana A.
Hasan Q.M.
Biswas A.K.
Sanmay Das
Rahman H.
Ding C.
Chenliang Li
ACM International Conference Proceeding Series English 2012 Given the sheer amount of work and expertise required in authoring Wikipedia articles, automatic tools that help Wikipedia contributors in generating and improving content are valuable. This paper presents our initial step towards building a full-fledged author assistant, particularly for suggesting infobox templates for articles. We build SVM classifiers to suggest infobox template types, among a large number of possible types, to Wikipedia articles without infoboxes. Different from prior works on Wikipedia article classification which deal with only a few label classes for named entity recognition, the much larger 337-class setup in our study is geared towards realistic deployment of infobox suggestion tool. We also emphasize testing on articles without infoboxes, due to that labeled and unlabeled data exhibit different distributions of features, which departs from the typical assumption that they are drawn from the same underlying population. 0 0
On empirical tradeoffs in large scale hierarchical classification Babbar R.
Partalas I.
Gaussier E.
Amblard C.
ACM International Conference Proceeding Series English 2012 While multi-class categorization of documents has been of research interest for over a decade, relatively fewer approaches have been proposed for large scale taxonomies in which the number of classes range from hundreds of thousand as in Directory Mozilla to over a million in Wikipedia. As a result of ever increasing number of text documents and images from various sources, there is an immense need for automatic classification of documents in such large hierarchies. In this paper, we analyze the tradeoffs between the important characteristics of different classifiers employed in the top down fashion. The properties for relative comparison of these classifiers include, (i) accuracy on test instance, (ii) training time (iii) size of the model and (iv) test time required for prediction. Our analysis is motivated by the well known error bounds from learning theory, which is also further reinforced by the empirical observations on the publicly available data from the Large Scale Hierarchical Text Classification Challenge. We show that by exploiting the data heterogenity across the large scale hierarchies, one can build an overall classification system which is approximately 4 times faster for prediction, 3 times faster to train, while sacrificing only 1% point in accuracy. 0 0
Semantic based category-keywords list enrichment for document classification Pandey U.
Chakraverty S.
Mihani R.
Arya R.
Rathee S.
Sharma R.K.
Advances in Intelligent and Soft Computing English 2012 In this paper we present a text categorization technique that extracts semantic features of documents to generate a compact set of keywords and uses the information obtained from those keywords to perform text classification. The algorithm reduces the dimensionality of the document representation using overlapping semantics. Later, a keyword-category relationship matrix computes the extent of membership of the documents for various input predefined categories. The category of the document is then derived from membership metrics. Also, Wikipedia is used for the purpose of category lists enrichment. The proposed work has shown a new direction towards document classification for web applications. 1 0
Text classification using Wikipedia knowledge Su C.
Yanne P.
YanChun Zhang
ICIC Express Letters, Part B: Applications English 2012 In the real world, there are large amounts of unlabeled text documents, but traditional approaches usually require a lot of labeled documents, which are expensive to obtain. In this paper we propose an approach using the Wikipedia for text classification. We firstly extract the related wiki documents with the given keywords, then label the documents with the representative features selected from the related wiki documents, and finally build an SVM text classifier. Experimental results on 20-Newsgroup dataset show that the proposed method performs well and stably. 0 0
A new approach for Arabic text classification using Arabic field-association terms Atlam E.-S.
Kazuhiro Morita
Masao Fuketa
Aoe J.-I.
Journal of the American Society for Information Science and Technology English 2011 Field-association (FA) terms give us the knowledge to identify document fields using a limited set of discriminating terms. Although many earlier methods tried to extract automatically relevant FA terms to build a comprehensive dictionary, the problem lies in the lack of an effective method to extract automatically relevant FA terms to build a comprehensive dictionary. Moreover, all previous studies are based on FA terms in English and Japanese, and the extension of FA terms to other languages such as Arabic could benefit future research in the field. We present a new method to build a comprehensive Arabic dictionary using part-of-speech, pattern rules, and corpora in Arabic language. Experimental evaluation is carried out for various fields using 251 MB of domain-specific corpora obtained from Arabic Wikipedia dumps and Alhayah news selected average of 2,825 FA terms (single and compound) per field. From the experimental results, recall and precision are 84% and 79%, respectively. We propose amended text classification methodology based on field association terms. Our approach is compared with Nave Bayes (NB) and kNN classifiers on 5,959 documents from Wikipedia dumps and Alhayah news. The new approach achieved a precision of 80.65% followed by NB (72.79%) and kNN (36.15%). 0 0
Cross lingual text classification by mining multilingual topics from Wikipedia Xiaochuan Ni
Sun J.-T.
Jian Hu
Zheng Chen
Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 English 2011 This paper investigates how to effectively do cross lingual text classification by leveraging a large scale and multilingual knowledge base, Wikipedia. Based on the observation that each Wikipedia concept is described by documents of different languages, we adapt existing topic modeling algorithms for mining multilingual topics from this knowledge base. The extracted topics have multiple types of representations, with each type corresponding to one language. In this work, we regard such topics extracted from Wikipedia documents as universal-topics, since each topic corresponds with same semantic information of different languages. Thus new documents of different languages can be represented in a space using a group of universal-topics. We use these universal-topics to do cross lingual text classification. Given the training data labeled for one language, we can train a text classifier to classify the documents of another language by mapping all documents of both languages into the universal-topic space. This approach does not require any additional linguistic resources, like bilingual dictionaries, machine translation tools, or labeling data for the target language. The evaluation results indicate that our topic modeling approach is effective for building cross lingual text classifier. Copyright 2011 ACM. 0 0
Discovering context: Classifying tweets through a semantic transform based on wikipedia Yegin Genc
Yasuaki Sakamoto
Nickerson J.V.
Lecture Notes in Computer Science English 2011 By mapping messages into a large context, we can compute the distances between them, and then classify them. We test this conjecture on Twitter messages: Messages are mapped onto their most similar Wikipedia pages, and the distances between pages are used as a proxy for the distances between messages. This technique yields more accurate classification of a set of Twitter messages than alternative techniques using string edit distance and latent semantic analysis. 0 0
Discovering context: classifying tweets through a semantic transform based on wikipedia Yegin Genc
Yasuaki Sakamoto
Jeffrey V. Nickerson
FAC English 2011 0 0
Enhancing concept based modeling approach for blog classification Ayyasamy R.K.
Alhashmi S.M.
Eu-Gene S.
Tahayna B.
Advances in Intelligent and Soft Computing English 2011 Blogs are user generated content discusses on various topics. For the past 10 years, the social web content is growing in a fast pace and research projects are finding ways to channelize these information using text classification techniques. Existing classification technique follows only boolean (or crisp) logic. This paper extends our previous work with a framework where fuzzy clustering is optimized with fuzzy similarity to perform blog classification. The knowledge base-Wikipedia, a widely accepted by the research community was used for our feature selection and classification. Our experimental result proves that proposed framework significantly improves the precision and recall in classifying blogs. 0 0
Improving Farsi multiclass text classification using a thesaurus and two-stage feature selection Maghsoodi N.
Homayounpour M.M.
Journal of the American Society for Information Science and Technology English 2011 The progressive increase of information content has recently made it necessary to create a system for automatic classification of documents. In this article, a system is presented for the categorization of multiclass Farsi documents that requires fewer training examples and can help to compensate the shortcoming of the standard training dataset. The new idea proposed in the present article is based on extending the feature vector by adding some words extracted from a thesaurus and then filtering the new feature vector by applying secondary feature selection to discard inappropriate features. In fact, a phase of secondary feature selection is applied to choose more appropriate features among the features added from a thesaurus to enhance the effect of using a thesaurus on the efficiency of the classifier. To evaluate the proposed system, a corpus is gathered from the Farsi Wikipedia website and some articles in the Hamshahri newspaper, the Roshd periodical, and the Soroush magazine. In addition to studying the role of a thesaurus and applying secondary feature selection, the effect of a various number of categories, size of the training dataset, and average number of words in the test data also are examined. As the results indicate, classification efficiency improves by applying this approach, especially when available data is not sufficient for some text categories. 0 0
On using crowdsourcing and active learning to improve classification performance Costa J.
Silva C.
Antunes M.
Ribeiro B.
International Conference on Intelligent Systems Design and Applications, ISDA English 2011 Crowdsourcing is an emergent trend for general-purpose classification problem solving. Over the past decade, this notion has been embodied by enlisting a crowd of humans to help solve problems. There are a growing number of real-world problems that take advantage of this technique, such as Wikipedia, Linux or Amazon Mechanical Turk. In this paper, we evaluate its suitability for classification, namely if it can outperform state-of-the-art models by combining it with active learning techniques. We propose two approaches based on crowdsourcing and active learning and empirically evaluate the performance of a baseline Support Vector Machine when active learning examples are chosen and made available for classification to a crowd in a web-based scenario. The proposed crowdsourcing active learning approach was tested with Jester data set, a text humour classification benchmark, resulting in promising improvements over baseline results. 0 0
Selective integration of background knowledge in TCBR systems Patelia A.
Chakraborti S.
Wiratunga N.
Lecture Notes in Computer Science English 2011 This paper explores how background knowledge from freely available web resources can be utilised for Textual Case Based Reasoning. The work reported here extends the existing Explicit Semantic Analysis approach to representation, where textual content is represented using concepts with correspondence to Wikipedia articles. We present approaches to identify Wikipedia pages that are likely to contribute to the effectiveness of text classification tasks. We also study the effect of modelling semantic similarity between concepts (amounting to Wikipedia articles) empirically. We conclude with the observation that integrating background knowledge from resources like Wikipedia into TCBR tasks holds a lot of promise as it can improve system effectiveness even without elaborate manual knowledge engineering. Significant performance gains are obtained using a very small number of features that have very strong correspondence to how humans describe the domain. 0 0
Two birds with one stone: Learning semantic models for text categorization and word sense disambiguation Roberto Navigli
Stefano Faralli
Aitor Soroa
Oier De Lacalle
Eneko Agirre
International Conference on Information and Knowledge Management, Proceedings English 2011 In this paper we present a novel approach to learning semantic models for multiple domains, which we use to categorize Wikipedia pages and to perform domain Word Sense Disambiguation (WSD). In order to learn a semantic model for each domain we first extract relevant terms from the texts in the domain and then use these terms to initialize a random walk over the WordNet graph. Given an input text, we check the semantic models, choose the appropriate domain for that text and use the best-matching model to perform WSD. Our results show considerable improvements on text categorization and domain WSD tasks. 0 0
Using thesaurus to improve multiclass text classification Maghsoodi N.
Homayounpour M.M.
Lecture Notes in Computer Science English 2011 With the growing amount of textual information available on the Internet, the importance of automatic text classification has been increasing in the last decade. In this paper, a system was presented for the classification of multi-class Farsi documents which uses Support Vector Machine (SVM) classifier. The new idea proposed in the present paper, is based on extending the feature vector by adding some words extracted from a thesaurus. The goal is to assist classifier when training dataset is not comprehensive for some categories. For corpus preparation, Farsi Wikipedia website and articles of some archived newspapers and magazines are used. As the results indicate, classification efficiency improves by applying this approach. 0.89 micro F-measure were achieved for classification of 10 categories of Farsi texts. 0 0
Geographical classification of documents using evidence from Wikipedia Odon De Alencar R.
Davis Jr. C.A.
Goncalves M.A.
Proceedings of the 6th Workshop on Geographic Information Retrieval, GIR'10 English 2010 Obtaining or approximating a geographic location for search results often motivates users to include place names and other geography-related terms in their queries. Previous work shows that queries that include geography-related terms correspond to a significant share of the users' demand. Therefore, it is important to recognize the association of documents to places in order to adequately respond to such queries. This paper describes strategies for text classification into geography-related categories, using evidence extracted from Wikipedia. We use terms that correspond to entry titles and the connections between entries in Wikipedia's graph to establish a semantic network from which classification features are generated. Results of experiments using a news data-set, classified over Brazilian states, show that such terms constitute valid evidence for the geographical classification of documents, and demonstrate the potential of this technique for text classification. Copyright 0 0
Mining wikipedia knowledge to improve document indexing and classification Ayyasamy R.K.
Tahayna B.
Alhashmi S.
Eu-Gene S.
Egerton S.
10th International Conference on Information Sciences, Signal Processing and their Applications, ISSPA 2010 English 2010 Web logs are an important source of information that requires automatic techniques to categorize them into "topic-based" content, to facilitate their future browsing and retrieval. In this paper we propose and illustrate the effectiveness of a new tf.idf measure. The proposed Conf.idf, Catf.idf measures are solely based on the mapping of terms-to-concepts-to- categories (TCONCAT) method that utilizes Wikipedia. The Knowledge base-Wikipedia is considered as a large scale Web encyclopaedia, that has high-quality and huge number of articles and categorical indexes. Using this system, our proposed framework consists of two stages to solve weblog classification problem. The first stage is to find out the terms belonging to a unique concept (article), as well as to disambiguate the terms belonging to more than one concept. The second stage is the determination of the categories to which these found concepts belong to. Experimental result confirms that, proposed system can distinguish the web logs that belongs to more than one category efficiently and has a better performance and success than the traditional statistical Natural Language Processing-NLP approaches. 0 0
Semantic enrichment of text representation with wikipedia for text classification Yamakawa H.
Peng J.
Feldman A.
Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics English 2010 Text classification is a widely studied topic in the area of machine learning. A number of techniques have been developed to represent and classify text documents. Most of the techniques try to achieve good classification performance while taking a document only by its words (e.g. statistical analysis on word frequency and distribution patterns). One of the recent trends in text classification research is to incorporate more semantic interpretation in text classification, especially by using Wikipedia. This paper introduces a technique for incorporating the vast amount of human knowledge accumulated in Wikipedia into text representation and classification. The aim is to improve classification performance by transforming general terms into a set of related concepts grouped around semantic themes. In order to achieve this goal, this paper proposes a unique method for breaking the enormous amount of extracted Wikipedia knowledge (concepts) into smaller pieces (subsets of concepts). The subsets of concepts are separately used to represent the same set of documents in a number of different ways, from which an ensemble of classifiers is built. Experimental results show that an ensemble of classifiers individually trained on a different representation of the document set performs better with increased accuracy and stability than that of a classifier trained only on the original document set. 0 0
Symbolic representation of text documents Guru D.S.
Harish B.S.
Manjunath S.
COMPUTE 2010 - The 3rd Annual ACM Bangalore Conference English 2010 This paper presents a novel method of representing a text document by the use of interval valued symbolic features. A method of classification of text documents based on the proposed representation is also presented. The newly proposed model significantly reduces the dimension of feature vectors and also the time taken to classify a given document. Further, extensive experimentations are conducted on vehicles-wikipedia datasets to evaluate the performance of the proposed model. The experimental results reveal that the obtained results are on par with the existing results for vehicles-wikipedia dataset. However, the advantage of the proposed model is that it takes relatively a less time for classification as it is based on a simple matching strategy. 0 0
Translingual document representations from discriminative projections Platt J.C.
Yih W.-T.
Toutanova K.
EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference English 2010 Representing documents by vectors that are independent of language enhances machine translation and multilingual text categorization. We use discriminative training to create a projection of documents from multiple languages into a single translingual vector space. We explore two variants to create these projections: Oriented Principal Component Analysis (OPCA) and Coupled Probabilistic Latent Semantic Analysis (CPLSA). Both of these variants start with a basic model of documents (PCA and PLSA). Each model is then made discriminative by encouraging comparable document pairs to have similar vector representations. We evaluate these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters. The two discriminative variants, OPCA and CPLSA, significantly outperform their corresponding baselines. The largest differences in performance are observed on the task of retrieval when the documents are only comparable and not parallel. The OPCA method is shown to perform best. 0 0
Building a text classifier by a keyword and Wikipedia knowledge Qiang Qiu
YanChun Zhang
Junping Zhu
Qu W.
Lecture Notes in Computer Science English 2009 Traditional approach for building text classifiers usually require a lot of labeled documents, which are expensive to obtain. In this paper, we propose a new text classification approach based on a keyword and Wikipedia knowledge, so as to avoid labeling documents manually. Firstly, we retrieve a set of related documents about the keyword from Wikipedia. And then, with the help of related Wikipedia pages, more positive documents are extracted from the unlabeled documents. Finally, we train a text classifier with these positive documents and unlabeled documents. The experiment result on 20Newsgroup dataset show that the proposed approach performs very competitively compared with NB-SVM, a PU learner, and NB, a supervised learner. 0 0
Ontology evaluation through text classification Netzer Y.
Gabay D.
Adler M.
Goldberg Y.
Elhadad M.
Lecture Notes in Computer Science English 2009 We present a new method to evaluate a search ontology, which relies on mapping ontology instances to textual documents. On the basis of this mapping, we evaluate the adequacy of ontology relations by measuring their classification potential over the textual documents. This data-driven method provides concrete feedback to ontology maintainers and a quantitative estimation of the functional adequacy of the ontology relations towards search experience improvement. We specifically evaluate whether an ontology relation can help a semantic search engine support exploratory search. We test this ontology evaluation method on an ontology in the Movies domain, that has been acquired semi-automatically from the integration of multiple semi-structured and textual data sources (e.g., IMDb and Wikipedia). We automatically construct a domain corpus from a set of movie instances by crawling the Web for movie reviews (both professional and user reviews). The 1-1 relation between textual documents (reviews) and movie instances in the ontology enables us to translate ontology relations into text classes. We verify that the text classifiers induced by key ontology relations (genre, keywords, actors) achieve high performance and exploit the properties of the learned text classifiers to provide concrete feedback on the ontology. The proposed ontology evaluation method is general and relies on the possibility to automatically align textual documents to ontology instances. 0 0
Towards a universal text classifier: Transfer learning using encyclopedic knowledge Pu Wang
Carlotta Domeniconi
ICDM Workshops 2009 - IEEE International Conference on Data Mining English 2009 Document classification is a key task for many text mining applications. However, traditional text classification requires labeled data to construct reliable and accurate classifiers. Unfortunately, labeled data are seldom available. In this work, we propose a universal text classifier, which does not require any labeled document. Our approach simulates the capability of people to classify documents based on background knowledge. As such, we build a classifier that can effectively group documents based on their content, under the guidance of few words describing the classes of interest. Background knowledge is modeled using encyclopedic knowledge, namely Wikipedia. The universal text classifier can also be used to perform document retrieval. In our experiments with real data we test the feasibility of our approach for both the classification and retrieval tasks. 0 0
Using Wikipedia knowledge to improve text classification Pu Wang
Jian Hu
Hua-Jun Zeng
Zheng Chen
Knowl. Inf. Syst. English 2009 Text classification has been widely used to assist users with the discovery of useful information from the Internet. However, traditional classification methods are based on the {œBag} of Words? {(BOW)} representation, which only accounts for term frequency in the documents, and ignores important semantic relationships between key terms. To overcome this problem, previous work attempted to enrich text representation by means of manual intervention or automatic document expansion. The achieved improvement is unfortunately very limited, due to the poor coverage capability of the dictionary, and to the ineffectiveness of term expansion. In this paper, we automatically construct a thesaurus of concepts from Wikipedia. We then introduce a unified framework to expand the {BOW} representation with semantic relations (synonymy, hyponymy, and associative relations), and demonstrate its efficacy in enhancing previous approaches for text classification. Experimental results on several data sets show that the proposed approach, integrated with the thesaurus built from Wikipedia, can achieve significant improvements with respect to the baseline algorithm. 0 0
VideoCLEF 2008: ASR classification with wikipedia categories Kusrsten J.
Richter D.
Eibl M.
Lecture Notes in Computer Science English 2009 This article describes our participation at the VideoCLEF track. We designed and implemented a prototype for the classification of the Video ASR data. Our approach was to regard the task as text classification problem. We used terms from Wikipedia categories as training data for our text classifiers. For the text classification the Naive-Bayes and kNN classifier from the WEKA toolkit were used. We submitted experiments for classification task 1 and 2. For the translation of the feeds to English (translation task) Google's AJAX language API was used. Although our experiments achieved only low precision of 10 to 15 percent, we assume those results will be useful in a combined setting with the retrieval approach that was widely used. Interestingly, we could not improve the quality of the classification by using the provided metadata. 0 0
Wikipedia as an Ontology for Describing Documents Zareen Syed
Tim Finin
Anupam Joshi
Proceedings of the Second International Conference on Weblogs and Social Media, AAAI, March 31, 2008 2008 Identifying topics and concepts associated with a set of documents is a task common to many applications. It can help in the annotation and categorization of documents and be used to model a person's current interests for improving search results, business intelligence or selecting appropriate advertisements. One approach is to associate a document with a set of topics selected from a fixed ontology or vocabulary of terms. We have investigated using Wikipedia's articles and associated pages as a topic ontology for this purpose. The benefits are that the ontology terms are developed through a social process, maintained and kept current by the Wikipedia community, represent a consensus view, and have meaning that can be understood simply by reading the associated Wikipedia page. We use Wikipedia articles and the category and article link graphs to predict concepts common to a set of documents. We describe several algorithms to aggregate and refine results, including the use of spreading activation to select the most appropriate terms. While the Wikipedia category graph can be used to predict generalized concepts, the article links graph helps by predicting more specific concepts and concepts not in the category hierarchy. Our experiments demonstrate the feasibility of extending the category system with new concepts identified as a union of pages from the page link graph. 0 0
Improving text classification by using encyclopedia knowledge Pu Wang
Jian Hu
Zeng H.-J.
Long Chen
Zheng Chen
Proceedings - IEEE International Conference on Data Mining, ICDM English 2007 The exponential growth of text documents available on the Internet has created an urgent need for accurate, fast, and general purpose text classification algorithms. However, the "bag of words" representation used for these classification methods is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. In order to deal with this problem, we integrate background knowledge - in our application: Wikipedia - into the process of classifying text documents. The experimental evaluation on Reuters newsfeeds and several other corpus shows that our classification results with encyclopedia knowledge are much better than the baseline "bag of words" methods. 0 0