Jian Hu

From WikiPapers
Jump to: navigation, search

Jian Hu is an author.


Only those publications related to wikis are shown here.
Title Keyword(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Cross lingual text classification by mining multilingual topics from Wikipedia Cross lingual text classification
Topic modeling
Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011 English 2011 This paper investigates how to effectively do cross lingual text classification by leveraging a large scale and multilingual knowledge base, Wikipedia. Based on the observation that each Wikipedia concept is described by documents of different languages, we adapt existing topic modeling algorithms for mining multilingual topics from this knowledge base. The extracted topics have multiple types of representations, with each type corresponding to one language. In this work, we regard such topics extracted from Wikipedia documents as universal-topics, since each topic corresponds with same semantic information of different languages. Thus new documents of different languages can be represented in a space using a group of universal-topics. We use these universal-topics to do cross lingual text classification. Given the training data labeled for one language, we can train a text classifier to classify the documents of another language by mapping all documents of both languages into the universal-topic space. This approach does not require any additional linguistic resources, like bilingual dictionaries, machine translation tools, or labeling data for the target language. The evaluation results indicate that our topic modeling approach is effective for building cross lingual text classifier. Copyright 2011 ACM. 0 0
The ontology of microbial phenotypes (OMP): A precomposed ontology based on cross products from multiple external ontologies that is used for guiding microbial phenotype annotation Annotation capture
Aristotelian definition
Bacterial phenotype
Cross product
Escherichia coli
Microbial phenotype. phenotype annotation
CEUR Workshop Proceedings English 2011 The Ontology* of Microbial Phenotypes (OMP) is being developed to standardize capture of phenotypic information, including both processes and physical characteristics, from microbes. The OMP team comprises ontologists, microbiologists, and annotators. and ontology development is being performed in conjunction with the development of a wiki designed for annotation capture. Term development is being guided by following, to as great an extent as possible, the structure of existing ontologies. All OMP terms have Aristotelian definitions, and. when appropriate, they have genus-differentia cross products composed of terms from external ontologies. Initially, OMP is being used to annotate the prokaryotic model organism Escherichia coli. Eventually we anticipate that diverse user groups will employ OMP for standardized annotation of various microbial phenotypes. much in the same way that, the Gene Ontology has standardized the annotation of gene products. Definitions of phenotypes and links to the original literature will facilitate the experimental characterization of phenotypes. 0 0
Visualizing revisions and building semantic network in Wikipedia Semantic Network
Proceedings - 2011 International Conference on Cloud and Service Computing, CSC 2011 English 2011 Wikipedia, one of the largest online encyclopedias, is competent to Britannica. Articles are subject to day to day changes by authors, and each such change is recorded as a new revision. In this paper, we visualize the article's revisions and build the semantic network between articles. First, we analyze the revisions difference of article and using color to show the revisions change. Second, through the article's classified information, we constructed a semantic network of articles' relationship. 0 0
Mining multilingual topics from Wikipedia English 2009 In this paper, we try to leverage a large-scale and multilingual knowledge base, Wikipedia, to help effectively analyze and organize Web information written in different languages. Based on the observation that one Wikipedia concept may be described by articles in different languages, we adapt existing topic modeling algorithm for mining multilingual topics from this knowledge base. The extracted 'universal' topics have multiple types of representations, with each type corresponding to one language. Accordingly, new documents of different languages can be represented in a space using a group of universal topics, which makes various multilingual Web applications feasible. 0 0
Understanding user's query intent with Wikipedia Query classification
Query intent
User intent
WWW'09 - Proceedings of the 18th International World Wide Web Conference English 2009 Understanding the intent behind a user's query can help search engine to automatically route the query to some corresponding vertical search engines to obtain particularly relevant contents, thus, greatly improving user satisfaction. There are three major challenges to the query intent classification problem: (1) Intent representation; (2) Domain coverage and (3) Semantic interpretation. Current approaches to predict the user's intent mainly utilize machine learning techniques. However, it is difficult and often requires many human efforts to meet all these challenges by the statistical machine learning approaches. In this paper, we propose a general methodology to the problem of query intent classification. With very little human effort, our method can discover large quantities of intent concepts by leveraging Wikipedia, one of the best human knowledge base. The Wikipedia concepts are used as the intent representation space, thus, each intent domain is represented as a set of Wikipedia articles and categories. The intent of any input query is identified through mapping the query into the Wikipedia representation space. Compared with previous approaches, our proposed method can achieve much better coverage to classify queries in an intent domain even through the number of seed intent examples is very small. Moreover, the method is very general and can be easily applied to various intent domains. We demonstrate the effectiveness of this method in three different applications, i.e., travel, job, and person name. In each of the three cases, only a couple of seed intent queries are provided. We perform the quantitative evaluations in comparison with two baseline methods, and the experimental results show that our method significantly outperforms other approaches in each intent domain. Copyright is held by the International World Wide Web Conference Committee (IW3C2). 0 0
Understanding user's query intent with wikipedia Query classification
Query intent
User intent
World Wide Web English 2009 0 0
Using Wikipedia knowledge to improve text classification Text classification
Knowl. Inf. Syst. English 2009 Text classification has been widely used to assist users with the discovery of useful information from the Internet. However, traditional classification methods are based on the {œBag} of Words? {(BOW)} representation, which only accounts for term frequency in the documents, and ignores important semantic relationships between key terms. To overcome this problem, previous work attempted to enrich text representation by means of manual intervention or automatic document expansion. The achieved improvement is unfortunately very limited, due to the poor coverage capability of the dictionary, and to the ineffectiveness of term expansion. In this paper, we automatically construct a thesaurus of concepts from Wikipedia. We then introduce a unified framework to expand the {BOW} representation with semantic relations (synonymy, hyponymy, and associative relations), and demonstrate its efficacy in enhancing previous approaches for text classification. Experimental results on several data sets show that the proposed approach, integrated with the thesaurus built from Wikipedia, can achieve significant improvements with respect to the baseline algorithm. 0 0
Enhancing text clustering by leveraging Wikipedia semantics English 2008 Most traditional text clustering methods are based on "bag of words" (BOW) representation based on frequency statistics in a set of documents. BOW, however, ignores the important information on the semantic relationships between key terms. To overcome this problem, several methods have been proposed to enrich text representation with external resource in the past, such as WordNet. However, many of these approaches suffer from some limitations: 1) WordNet has limited coverage and has a lack of effective word-sense disambiguation ability; 2) Most of the text representation enrichment strategies, which append or replace document terms with their hypernym and synonym, are overly simple. In this paper, to overcome these deficiencies, we first propose a way to build a concept thesaurus based on the semantic relations (synonym, hypernym, and associative relation) extracted from Wikipedia. Then, we develop a unified framework to leverage these semantic relations in order to enhance traditional content similarity measure for text clustering. The experimental results on Reuters and OHSUMED datasets show that with the help of Wikipedia thesaurus, the clustering performance of our method is improved as compared to previous methods. In addition, with the optimized weights for hypernym, synonym, and associative concepts that are tuned with the help of a few labeled data users provided, the clustering performance can be further improved. 0 0
Using Wikipedia for Co-clustering Based Cross-Domain Text Classification Cross-domain text classification
Transfer learning
ICDM English 2008 0 0
Improving text classification by using encyclopedia knowledge Proceedings - IEEE International Conference on Data Mining, ICDM English 2007 The exponential growth of text documents available on the Internet has created an urgent need for accurate, fast, and general purpose text classification algorithms. However, the "bag of words" representation used for these classification methods is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. In order to deal with this problem, we integrate background knowledge - in our application: Wikipedia - into the process of classifying text documents. The experimental evaluation on Reuters newsfeeds and several other corpus shows that our classification results with encyclopedia knowledge are much better than the baseline "bag of words" methods. 0 0