Zheng Chen

From WikiPapers
Jump to: navigation, search

Zheng Chen is an author.

Publications

Only those publications related to wikis are shown here.
Title Keyword(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Heterogeneous graph-based intent learning with queries, web pages and Wikipedia concepts Heterogeneous graph clustering
Search intent
Wikipedia
WSDM 2014 - Proceedings of the 7th ACM International Conference on Web Search and Data Mining English 2014 The problem of learning user search intents has attracted intensive attention from both industry and academia. However, state-of-the-art intent learning algorithms suffer from different drawbacks when only using a single type of data source. For example, query text has difficulty in distinguishing ambiguous queries; search log is bias to the order of search results and users' noisy click behaviors. In this work, we for the first time leverage three types of objects, namely queries, web pages and Wikipedia concepts collaboratively for learning generic search intents and construct a heterogeneous graph to represent multiple types of relationships between them. A novel unsupervised method called heterogeneous graph-based soft-clustering is developed to derive an intent indicator for each object based on the constructed heterogeneous graph. With the proposed co-clustering method, one can enhance the quality of intent understanding by taking advantage of different types of data, which complement each other, and make the implicit intents easier to interpret with explicit knowledge from Wikipedia concepts. Experiments on two real-world datasets demonstrate the power of the proposed method where it achieves a 9.25% improvement in terms of NDCG on search ranking task and a 4.67% enhancement in terms of Rand index on object co-clustering task compared to the best state-of-the-art method. 0 0
Extracting PROV provenance traces from Wikipedia history pages Design ACM International Conference Proceeding Series English 2013 Wikipedia History pages contain provenance metadata that describes the history of revisions of each Wikipedia article. We have developed a simple extractor which, starting from a user-specified article page, crawls through the graph of its associated history pages, and encodes the essential elements of those pages according to the PROV data model. The crawling is performed on the live pages using the Wikipedia REST interface. The resulting PROV provenance graphs are stored in a graph database (Neo4J), where they can be queried using the Cypher graph query language (proprietary to Neo4J), or traversed programmatically using the Neo4J Java Traversal API. 0 0
The category structure in Wikipedia: To analyze and know how it grows Category structure
Complex network
Growth
Wikipedia
Lecture Notes in Computer Science English 2013 Wikipedia is a famous encyclopedia and is applied to a lot of famous fields for many years, such as natural language processing. The category structure is used and analyzed in this paper. We take the important topological properties into account, such as the connectivity distribution. What's the most important of all is to analyze the growth of the structure from 2004 to 2012 in detail. In order to tell about the growth, the basic properties and the small-worldness is brought in. Some different edge attachment models based on the properties of nodes are tested in order to study how the properties of nodes influence the creation of edges. We are very interested in the phenomenon that the data in 2011 and 2012 is so strange and study the reason closely. Our results offer useful insights for the structure and the growth of the category structure. 0 0
The category structure in wikipedia: To analyze and know its quality using k-core decomposition Complex network
K-core
Overall topology
Quality
Wikipedia
Lecture Notes in Computer Science English 2013 Wikipedia is a famous and free encyclopedia. A network based on its category structure is built and then analyzed from various aspects, such as the connectivity distribution, evolution of the overall topology. As an innovative point of our paper, the model that is on the base of the k-core decomposition is used to analyze evolution of the overall topology and test the quality (that is, the error and attack tolerance) of the structure when nodes are removed. The model based on removal of edges is compared. Our results offer useful insights for the growth and the quality of the category structure, and the methods how to better organize the category structure. 0 0
Towards accurate distant supervision for relational facts extraction ACL 2013 - 51st Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference English 2013 Distant supervision (DS) is an appealing learning method which learns from existing relational facts to extract more from a text corpus. However, the accuracy is still not satisfying. In this paper, we point out and analyze some critical factors in DS which have great impact on accuracy, including valid entity type detection, negative training examples construction and ensembles. We propose an approach to handle these factors. By experimenting on Wikipedia articles to extract the facts in Freebase (the top 92 relations), we show the impact of these three factors on the accuracy of DS and the remarkable improvement led by the proposed approach. 0 0
Cross lingual text classification by mining multilingual topics from Wikipedia Cross lingual text classification
Topic modeling
Universal-topics
Wikipedia
Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 English 2011 This paper investigates how to effectively do cross lingual text classification by leveraging a large scale and multilingual knowledge base, Wikipedia. Based on the observation that each Wikipedia concept is described by documents of different languages, we adapt existing topic modeling algorithms for mining multilingual topics from this knowledge base. The extracted topics have multiple types of representations, with each type corresponding to one language. In this work, we regard such topics extracted from Wikipedia documents as universal-topics, since each topic corresponds with same semantic information of different languages. Thus new documents of different languages can be represented in a space using a group of universal-topics. We use these universal-topics to do cross lingual text classification. Given the training data labeled for one language, we can train a text classifier to classify the documents of another language by mapping all documents of both languages into the universal-topic space. This approach does not require any additional linguistic resources, like bilingual dictionaries, machine translation tools, or labeling data for the target language. The evaluation results indicate that our topic modeling approach is effective for building cross lingual text classifier. Copyright 2011 ACM. 0 0
Tag transformer Online user study
Structural web video recommendation
Tag cleaning
Tag transformer
Wikipedia category tree
MM'10 - Proceedings of the ACM Multimedia 2010 International Conference English 2010 Human annotations (titles and tags) of web videos facilitate most web video applications. However, the raw tags are noisy, sparse and structureless, which limit the effectiveness of tags. In this paper, we propose a tag transformer schema to solve these problems. We first eliminate those imprecise and meaningless tags with Wikipedia, and then transform the remaining tags to the Wikipedia category set to gather a precise, complete and structural description of the tags. Our experimental results on web video categorization demonstrate the superiority of the transformed space. We also apply tag transformer into the first study of using Wikipedia category system to structurally recommend the related videos. The online user study of the demo system suggests that our method could bring fantastic experience to the web users. 0 0
Top-down and bottom-up: A combined approach to slot filling Information extraction
Question answering
Slot Filling
Lecture Notes in Computer Science English 2010 The Slot Filling task requires a system to automatically distill information from a large document collection and return answers for a query entity with specified attributes ('slots'), and use them to expand the Wikipedia infoboxes. We describe two bottom-up Information Extraction style pipelines and a top-down Question Answering style pipeline to address this task. We propose several novel approaches to enhance these pipelines, including statistical answer re-ranking and Markov Logic Networks based cross-slot reasoning. We demonstrate that our system achieves state-of-the-art performance, with 3.1% higher precision and 2.6% higher recall compared with the best system in the KBP2009 evaluation. 0 0
Mining multilingual topics from Wikipedia English 2009 In this paper, we try to leverage a large-scale and multilingual knowledge base, Wikipedia, to help effectively analyze and organize Web information written in different languages. Based on the observation that one Wikipedia concept may be described by articles in different languages, we adapt existing topic modeling algorithm for mining multilingual topics from this knowledge base. The extracted 'universal' topics have multiple types of representations, with each type corresponding to one language. Accordingly, new documents of different languages can be represented in a space using a group of universal topics, which makes various multilingual Web applications feasible. 0 0
Understanding user's query intent with Wikipedia Query classification
Query intent
User intent
Wikipedia
WWW'09 - Proceedings of the 18th International World Wide Web Conference English 2009 Understanding the intent behind a user's query can help search engine to automatically route the query to some corresponding vertical search engines to obtain particularly relevant contents, thus, greatly improving user satisfaction. There are three major challenges to the query intent classification problem: (1) Intent representation; (2) Domain coverage and (3) Semantic interpretation. Current approaches to predict the user's intent mainly utilize machine learning techniques. However, it is difficult and often requires many human efforts to meet all these challenges by the statistical machine learning approaches. In this paper, we propose a general methodology to the problem of query intent classification. With very little human effort, our method can discover large quantities of intent concepts by leveraging Wikipedia, one of the best human knowledge base. The Wikipedia concepts are used as the intent representation space, thus, each intent domain is represented as a set of Wikipedia articles and categories. The intent of any input query is identified through mapping the query into the Wikipedia representation space. Compared with previous approaches, our proposed method can achieve much better coverage to classify queries in an intent domain even through the number of seed intent examples is very small. Moreover, the method is very general and can be easily applied to various intent domains. We demonstrate the effectiveness of this method in three different applications, i.e., travel, job, and person name. In each of the three cases, only a couple of seed intent queries are provided. We perform the quantitative evaluations in comparison with two baseline methods, and the experimental results show that our method significantly outperforms other approaches in each intent domain. Copyright is held by the International World Wide Web Conference Committee (IW3C2). 0 0
Understanding user's query intent with wikipedia Query classification
Query intent
User intent
Wikipedia
World Wide Web English 2009 0 0
Using Wikipedia knowledge to improve text classification Text classification
Thesaurus
Wikipedia
Knowl. Inf. Syst. English 2009 Text classification has been widely used to assist users with the discovery of useful information from the Internet. However, traditional classification methods are based on the {œBag} of Words? {(BOW)} representation, which only accounts for term frequency in the documents, and ignores important semantic relationships between key terms. To overcome this problem, previous work attempted to enrich text representation by means of manual intervention or automatic document expansion. The achieved improvement is unfortunately very limited, due to the poor coverage capability of the dictionary, and to the ineffectiveness of term expansion. In this paper, we automatically construct a thesaurus of concepts from Wikipedia. We then introduce a unified framework to expand the {BOW} representation with semantic relations (synonymy, hyponymy, and associative relations), and demonstrate its efficacy in enhancing previous approaches for text classification. Experimental results on several data sets show that the proposed approach, integrated with the thesaurus built from Wikipedia, can achieve significant improvements with respect to the baseline algorithm. 0 0
Enhancing text clustering by leveraging Wikipedia semantics English 2008 Most traditional text clustering methods are based on "bag of words" (BOW) representation based on frequency statistics in a set of documents. BOW, however, ignores the important information on the semantic relationships between key terms. To overcome this problem, several methods have been proposed to enrich text representation with external resource in the past, such as WordNet. However, many of these approaches suffer from some limitations: 1) WordNet has limited coverage and has a lack of effective word-sense disambiguation ability; 2) Most of the text representation enrichment strategies, which append or replace document terms with their hypernym and synonym, are overly simple. In this paper, to overcome these deficiencies, we first propose a way to build a concept thesaurus based on the semantic relations (synonym, hypernym, and associative relation) extracted from Wikipedia. Then, we develop a unified framework to leverage these semantic relations in order to enhance traditional content similarity measure for text clustering. The experimental results on Reuters and OHSUMED datasets show that with the help of Wikipedia thesaurus, the clustering performance of our method is improved as compared to previous methods. In addition, with the optimized weights for hypernym, synonym, and associative concepts that are tuned with the help of a few labeled data users provided, the clustering performance can be further improved. 0 0
Improving text classification by using encyclopedia knowledge Proceedings - IEEE International Conference on Data Mining, ICDM English 2007 The exponential growth of text documents available on the Internet has created an urgent need for accurate, fast, and general purpose text classification algorithms. However, the "bag of words" representation used for these classification methods is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. In order to deal with this problem, we integrate background knowledge - in our application: Wikipedia - into the process of classifying text documents. The experimental evaluation on Reuters newsfeeds and several other corpus shows that our classification results with encyclopedia knowledge are much better than the baseline "bag of words" methods. 0 0