Long Chen

From WikiPapers
Jump to: navigation, search

Long Chen is an author.

Publications

Only those publications related to wikis are shown here.
Title Keyword(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Wiki3C: Exploiting wikipedia for context-aware concept categorization Context-aware concept categorization
Text mining
Wikipedia
WSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining English 2013 Wikipedia is an important human generated knowledge base containing over 21 million articles organized by millions of categories. In this paper, we exploit Wikipedia for a new task of text mining: Context-aware Concept Categorization. In the task, we focus on categorizing concepts according to their context. We exploit article link feature and category structure in Wikipedia, followed by introducing Wiki3C, an unsupervised and domain independent concept categorization approach based on context. In the approach, we investigate two strategies to select and filter Wikipedia articles for the category representation. Besides, a probabilistic model is employed to compute the semantic relatedness between two concepts in Wikipedia. Experimental evaluation using manually labeled ground truth shows that our proposed Wiki3C can achieve a noticeable improvement over the baselines without considering contextual information. 0 0
TCSST: Transfer classification of short & sparse text using external data Classification
External data
Short & sparse text mining
Transfer learning
Wikipedia
ACM International Conference Proceeding Series English 2012 Short & sparse text is becoming more prevalent on the web, such as search snippets, micro-blogs and product reviews. Accurately classifying short & sparse text has emerged as an important while challenging task. Existing work has considered utilizing external data (e.g. Wikipedia) to alleviate data sparseness, by appending topics detected from external data as new features. However, training a classifier on features concatenated from different spaces is not easy considering the features have different physical meanings and different significance to the classification task. Moreover, it exacerbates the "curse of dimensionality" problem. In this study, we propose a transfer classification method, TCSST, to exploit the external data to tackle the data sparsity issue. The transfer classifier will be learned in the original feature space. Considering that the labels of the external data may not be readily available or sufficiently enough, TCSST further exploits the unlabeled external data to aid the transfer classification. We develop novel strategies to allow TCSST to iteratively select high quality unlabeled external data to help with the classification. We evaluate the performance of TCSST on both benchmark as well as real-world data sets. Our experimental results demonstrate that the proposed method is effective in classifying very short & sparse text, consistently outperforming existing and baseline methods. 0 0
STAIRS: Towards efficient full-text filtering and dissemination in DHT environments Content dissemination
Content filtering
DHT
VLDB Journal English 2011 Nowadays "live" content, such as weblog, wikipedia, and news, is ubiquitous in the Internet. Providing users with relevant content in a timely manner becomes a challenging problem. Differing from Web search technologies and RSS feeds/reader applications, this paper envisions a personalized full-text content filtering and dissemination system in a highly distributed environment such as a Distributed Hash Table (DHT) based Peer-to-Peer (P2P) Network. Users subscribe to their interested content by specifying input keywords and thresholds as filters. Then, content is disseminated to those users having interest in it. In the literature, full-text document publishing in DHTs has suffered for a long time from the high cost of forwarding a document to home nodes of all distinct terms. It is aggravated by the fact that a document contains a large number of distinct terms (typically tens or thousands of terms per document). In this paper, we propose a set of novel techniques to overcome such a high forwarding cost by carefully selecting a very small number of meaningful terms (or key features) among candidate terms inside each document. Next, to reduce the average hop count per forwarding, we further prune irrelevant documents during the forwarding path. Experiments based on two real query logs and two real data sets demonstrate the effectiveness of our solution. 0 0
A lucene and maximum entropy model based hedge detection system CoNLL-2010: Shared Task - Fourteenth Conference on Computational Natural Language Learning, Proceedings of the Shared Task English 2010 This paper describes the approach to hedge detection we developed, in order to participate in the shared task at CoNLL-2010. A supervised learning approach is employed in our implementation. Hedge cue annotations in the training data are used as the seed to build a reliable hedge cue set. Maximum Entropy (MaxEnt) model is used as the learning technique to determine uncertainty. By making use of Apache Lucene, we are able to do fuzzy string match to extract hedge cues, and to incorporate part-of-speech (POS) tags in hedge cues. Not only can our system determine the certainty of the sentence, but is also able to find all the contained hedges. Our system was ranked third on the Wikipedia dataset. In later experiments with different parameters, we further improved our results, with a 0.612 F-score on the Wikipedia dataset, and a 0.802 F-score on the biological dataset. 0 0
Automatically weighting tags in XML collection Tag weighting model
Topic generalization
XML retrieval
International Conference on Information and Knowledge Management, Proceedings English 2010 In XML retrieval, nodes with different tags play different roles in XML documents and then tags should be reflected in the relevance ranking. An automatic method is proposed in this paper to infer the weights of tags. We first investigate 15 features about tags, and then select five of them based on the correlations between these features and manual tag weights. Using these features, a tag weight assignment model, ATG, is designed. We evaluate the performance of ATG on two real data sets, IEEECS and Wikipedia from two different perspectives. One is to evaluate the quality of the model by measuring the correlation between weights generated by our model and those given by experts. The other is to test the effectiveness of the model in improving retrieval performance. Experimental results show that the tag weights generated by ATG are highly correlated with the manually assigned weights and the ATG model improves retrieval effectiveness significantly. 0 0
Dynamic topic detection and tracking based on knowledge base Knowledge base
Topic detection
Topic tracking
Topic update
Proceedings - 2010 3rd IEEE International Conference on Broadband Network and Multimedia Technology, IC-BNMT2010 English 2010 In order to solve the sparse initial information problem when the topic model was established ever before, this paper establishes the Wikipedia based news event knowledge base. Referring to this knowledge base, we calculate the weight of the news model, make the similarity measurement based on the time distance, make the clustering based on time line, and apply the dynamic threshold strategy to detect and track the topics automatically in the news materials. The experiment result verifies the validity of this method. 0 0
Wikipedia-based semantic smoothing for the language modeling approach to information retrieval Information retrieval
Language model
Wikipedia
ECIR English 2010 0 0
Stairs: Towards efficient full-text filtering and dissemination in a DHT environment Proceedings - International Conference on Data Engineering English 2009 Nowadays contents in Internet like weblogs, wikipedia and news sites become "live". How to notify and provide users with the relevant contents becomes a challenge. Unlike conventional Web search technology or the RSS feed, this paper envisions a personalized full-text content filtering and dissemination system in a highly distributed environment such as a Distributed Hash Table (DHT). Users can subscribe to their interested contents by specifying some terms and threshold values for filtering. Then, published contents will be disseminated to the associated subscribers.We propose a novel and simple framework of filter registration and content publication, STAIRS. By the new framework, we propose three algorithms (default forwarding, dynamic forwarding and adaptive forwarding) to reduce the forwarding cost and false dismissal rate; meanwhile, the subscriber can receive the desired contents with no duplicates. In particular, the adaptive forwarding utilizes the filter information to significantly reduce the forwarding cost. Experiments based on two real query logs and two real datasets show the effectiveness of our proposed framework. 0 0
Improving text classification by using encyclopedia knowledge Proceedings - IEEE International Conference on Data Mining, ICDM English 2007 The exponential growth of text documents available on the Internet has created an urgent need for accurate, fast, and general purpose text classification algorithms. However, the "bag of words" representation used for these classification methods is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. In order to deal with this problem, we integrate background knowledge - in our application: Wikipedia - into the process of classifying text documents. The experimental evaluation on Reuters newsfeeds and several other corpus shows that our classification results with encyclopedia knowledge are much better than the baseline "bag of words" methods. 0 0