Houkuan Huang

From WikiPapers
Jump to: navigation, search

Houkuan Huang is an author.

Publications

Only those publications related to wikis are shown here.
Title Keyword(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
A multi-layer text classification framework based on two-level representation model Multi-layer classification
Semantics
Text classification
Text representation
Wikipedia
Expert Systems with Applications English 2012 Text categorization is one of the most common themes in data mining and machine learning fields. Unlike structured data, unstructured text data is more difficult to be analyzed because it contains complicated both syntactic and semantic information. In this paper, we propose a two-level representation model (2RM) to represent text data, one is for representing syntactic information and the other is for semantic information. Each document, in syntactic level, is represented as a term vector where the value of each component is the term frequency and inverse document frequency. The Wikipedia concepts related to terms in syntactic level are used to represent document in semantic level. Meanwhile, we designed a multi-layer classification framework (MLCLA) to make use of the semantic and syntactic information represented in 2RM model. The MLCLA framework contains three classifiers. Among them, two classifiers are applied on syntactic level and semantic level in parallel. The outputs of these two classifiers will be combined and input to the third classifier, so that the final results can be obtained. Experimental results on benchmark data sets (20Newsgroups, Reuters-21578 and Classic3) have shown that the proposed 2RM model plus MLCLA framework improves the text classification performance by comparing with the existing flat text representation models (Term-based VSM, Term Semantic Kernel Model, Concept-based VSM, Concept Semantic Kernel Model and Term + Concept VSM) plus existing classification methods. © 2011 Elsevier Ltd. All rights reserved. 0 0
Analysis and enhancement of wikification for microblogs with context expansion Disambiguation context
Disambiguation to wikipedia (D2W)
Twitter
24th International Conference on Computational Linguistics - Proceedings of COLING 2012: Technical Papers English 2012 Disambiguation to Wikipedia (D2W) is the task of linking mentions of concepts in text to their corresponding Wikipedia entries. Most previous work has focused on linking terms in formal texts (e.g. newswire) to Wikipedia. Linking terms in short informal texts (e.g. tweets) is difficult for systems and humans alike as they lack a rich disambiguation context. We first evaluate an existing Twitter dataset as well as the D2W task in general. We then test the effects of two tweet context expansion methods, based on tweet authorship and topic-based clustering, on a state-of-the-art D2W system and evaluate the results. 0 0
Document Topic Extraction Based on Wikipedia Category Topic Extraction
Document Representation
Wikipedia Category
Semantic relatedness
CSO English 2011 0 0
Multi-view LDA for semantics-based document representation Latent dirichlet allocation
Semantics
Topic model
Wikipedia category
Journal of Computational Information Systems English 2011 Each document and word can be modeled as a mixture of topics by Latent Dirichlet Allocation (LDA), which does not contain any external semantic information. In this paper, we represent documents as two feature spaces consisting of words and Wikipedia categories respectively, and propose a new method called Multi-View LDA (M-LDA) by combining LDA with explicit human-defined concepts in Wikipedia. M-LDA improves document topic model by taking advantage of both two feature spaces and their mapping relationship. Experimental results on classification and clustering tasks show M-LDA outperforms traditional LDA. 0 0
Unsupervised feature weighting based on local feature relatedness Feature Relatedness
Feature Weighting
Semantics
Text Clustering
Lecture Notes in Computer Science English 2011 Feature weighting plays an important role in text clustering. Traditional feature weighting is determined by the syntactic relationship between feature and document (e.g. TF-IDF). In this paper, a semantically enriched feature weighting approach is proposed by introducing the semantic relationship between feature and document, which is implemented by taking account of the local feature relatedness - the relatedness between feature and its contextual features within each individual document. Feature relatedness is measured by two methods, document collection-based implicit relatedness measure and Wikipedia link-based explicit relatedness measure. Experimental results on benchmark data sets show that the new feature weighting approach surpasses traditional syntactic feature weighting. Moreover, clustering quality can be further improved by linearly combining the syntactic and semantic factors. The new feature weighting approach is also compared with two existing feature relatedness-based approaches which consider the global feature relatedness (feature relatedness in the entire feature space) and the inter-document feature relatedness (feature relatedness between different documents) respectively. In the experiments, the new feature weighting approach outperforms these two related work in clustering quality and costs much less computational complexity. 0 0
Semantics-based representation model for multi-layer text classification Multi-layer Classification
Representation Model
Semantics
Text Classification
Wikipedia
Lecture Notes in Computer Science English 2010 Text categorization is one of the most common themes in data mining and machine learning fields. Unlike structured data, unstructured text data is more complicated to be analyzed because it contains too much information, e.g., syntactic and semantic. In this paper, we propose a semantics-based model to represent text data in two levels. One level is for syntactic information and the other is for semantic information. Syntactic level represents each document as a term vector, and the component records tf-idf value of each term. The semantic level represents document with Wikipedia concepts related to terms in syntactic level. The syntactic and semantic information are efficiently combined by our proposed multi-layer classification framework. Experimental results on benchmark dataset (Reuters-21578) have shown that the proposed representation model plus proposed classification framework improves the performance of text classification by comparing with the flat text representation models (term VSM, concept VSM, term+concept VSM) plus existing classification methods. 0 0
Text clustering via term semantic units Compact representation
Term semantic units
Text clustering
Proceedings - 2010 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2010 English 2010 How best to represent text data is an important problem in text mining tasks including information retrieval, clustering, classification and etc. In this paper, we proposed a compact document representation with term semantic units which are identified from the implicit and explicit semantic information. Among it, the implicit semantic information is extracted from syntactic content via statistical methods such as latent semantic indexing and information bottleneck. The explicit semantic information is mined from the external semantic resource (Wikipedia). The proposed compact representation model can map a document collection in a low-dimension space (term semantic units which are much less than the number of all unique terms). Experimental results on real data sets have shown that the compact representation efficiently improve the performance of text clustering. 0 0