Haofen Wang

From WikiPapers
Jump to: navigation, search

Haofen Wang is an author.

Publications

Only those publications related to wikis are shown here.
Title Keyword(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Improving semi-supervised text classification by using wikipedia knowledge Clustering Based Classification
Semi-supervised Text Classification
Wikipedia
Lecture Notes in Computer Science English 2013 Semi-supervised text classification uses both labeled and unlabeled data to construct classifiers. The key issue is how to utilize the unlabeled data. Clustering based classification method outperforms other semi-supervised text classification algorithms. However, its achievements are still limited because the vector space model representation largely ignores the semantic relationships between words. In this paper, we propose a new approach to address this problem by using Wikipedia knowledge. We enrich document representation with Wikipedia semantic features (concepts and categories), propose a new similarity measure based on the semantic relevance between Wikipedia features, and apply this similarity measure to clustering based classification. Experiment results on several corpora show that our proposed method can effectively improve semi-supervised text classification performance. 0 0
A semantic approach to recommending text advertisements for images Crossmedia mining
Semantic matching
Visual contextual advertising
RecSys'12 - Proceedings of the 6th ACM Conference on Recommender Systems English 2012 In recent years, more and more images have been uploaded and published on the Web. Along with text Web pages, images have been becoming important media to place relevant advertisements. Visual contextual advertising, a young research area, refers to finding relevant text advertisements for a target image without any textual information (e.g., tags). There are two existing approaches, advertisement search based on image annotation, and more recently, advertisement matching based on feature translation between images and texts. However, the state of the art fails to achieve satisfactory results due to the fact that recommended advertisements are syntactically matched but semantically mismatched. In this paper, we propose a semantic approach to improving the performance of visual contextual advertising. More specifically, we exploit a large high-quality image knowledge base (ImageNet) and a widely-used text knowledge base (Wikipedia) to build a bridge between target images and advertisements. The image-advertisement match is built by mapping images and advertisements into the respective knowledge bases and then finding semantic matches between the two knowledge bases. The experimental results show that semantic match outperforms syntactic match significantly using test images from Flickr. We also show that our approach gives a large improvement of 16.4% on the precision of the top 10 matches over previous work, with more semantically relevant advertisements recommended. Copyright © 2012 by the Association for Computing Machinery, Inc. (ACM). 0 0
Bricking Semantic Wikipedia by relation population and predicate suggestion Predicate suggestion
Relation classification
Relation population
Semantic Wikipedia
Web Intelligence and Agent Systems English 2012 Semantic Wikipedia aims to enhance Wikipedia by adding explicit semantics to links between Wikipedia entities. However, we have observed that it currently suffers the following limitations: lack of semantic annotations and lack of semantic annotators. In this paper, we resort to relation population to automatically extract relations between any entity pair to enrich semantic data, and predicate suggestion to recommend proper relation labels to facilitate semantic annotating. Both tasks leverage relation classification which tries to classify extracted relation instances into predefined relations. However, due to the lack of labeled data and the excessiveness of noise in Semantic Wikipedia, existing approaches cannot be directly applied to these tasks to obtain high-quality annotations. In this paper, to tackle the above problems brought by Semantic Wikipedia, we use a label propagation algorithm and exploit semantic features like domain and range constraints on categories as well as linguistic features such as dependency trees of context sentences in Wikipedia articles. The experimental results on 7 typical relation types show the effectiveness and efficiency of our approach in dealing with both tasks. © 2012-IOS Press and the authors. All rights reserved. 0 0
EachWiki: Facilitating wiki authoring by annotation suggestion Category suggestion
Link suggestion
Semantic relation suggestion
ACM Transactions on Intelligent Systems and Technology English 2012 Wikipedia, one of the best-known wikis and the world's largest free online encyclopedia, has embraced the power of collaborative editing to harness collective intelligence. However, using such a wiki to create high-quality articles is not as easy as people imagine, given for instance the difficulty of reusing knowledge already available in Wikipedia. As a result, the heavy burden of upbuilding and maintaining the evergrowing online encyclopedia still rests on a small group of people. In this article, we aim at facilitating wiki authoring by providing annotation recommendations, thus lightening the burden of both contributors and administrators. We leverage the collective wisdom of the users by exploiting Semantic Web technologies with Wikipedia data and adopt a unified algorithm to support link, category, and semantic relation recommendation. A prototype system named EachWiki is proposed and evaluated. The experimental results show that it has achieved considerable improvements in terms of effectiveness, efficiency and usability. The proposed approach can also be applied to other wiki-based collaborative editing systems. 0 0
Infinite topic modelling for trend tracking hierarchical dirichlet process approaches with wikipedia semantic based method Hierarchical dirichlet process
News
Temporal analysis
Topic modelling
Wikipedia
KDIR 2012 - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval English 2012 The current affairs people concern closely vary in different periods and the evolution of trends corresponds to the reports of medias. This paper considers tracking trends by incorporating non-parametric Bayesian approaches with temporal information and presents two topic modelling methods. One utilizes an infinite temporal topic model which obtains the topic distribution over time by placing a time prior when discovering topics dynamically. In order to better organize the event trend, we present another progressive superposed topic model which simulates the whole evolutionary processes of topics, including new topics' generation, stable topics' evolution and old topics' vanishment, via a series of superposed topics distribution generated by hierarchical Dirichlet process. Both of the two approaches aim at solving the real-world task while avoiding Markov assumption and breaking the number limitation of topics. Meanwhile, we employ Wikipedia based semantic background knowledge to improve the discovered topics and their readability. The experiments are carried out on the corpus of BBC news about American Forum. The results demonstrate better organized topics, evolutionary processes of topics over time and model effectiveness. Copyright 0 0
Model and simulation of collective collaboration article edit in wikipedia based on CAS theory Collective collaboration article edit
Complex adaptive system (CAS)
Golden section
Model and simulation
Web collective intelligence
Wikipedia
Shanghai Ligong Daxue Xuebao/Journal of University of Shanghai for Science and Technology Chinese 2012 Wikipedia users were classified into five agents, including content creator, content modifier, content cleaner, diverse editor and content visitor, and a collective collaboration article edit model was established based on the complex adaptive system theory. The multi-agent simulation of collective collaboration article edit was achieved by using Netlogo software based on the configuration of agent appearance probabilities of different quality articles. Simulation results show that diverse editor is an important driving force for the improvement of article quality; the bigger the appearance probability of diverse editors is, the higher the article quality is. The self-modifying behavior of editors plays an important role in promoting article quality. When the configuration of agent appearance probabilities follows the golden section law, the collectiveperformance can tend to be maximum. There exists a process from seesaw-like complementarity to dynamic balance between word quantity and word meaning, the critical point of balance is close to the golden section point, and the article edit evolution follows the golden section law. The research deepens the knowledge of article edit evolution, web collective intelligence and social computing. 0 0
Towards better understanding and utilizing relations in DBpedia DBpedia
Relation understanding
Relation utilization
Web Intelligence and Agent Systems English 2012 This paper is concerned with the problems of understanding the relations in automatically extracted semantic datasets such as DBpedia and utilizing them in semantic queries such as SPARQL. Although DBpedia has achieved a great success in supporting convenient navigation and complex queries over the extracted semantic data from Wikipedia, the browsing mechanism and the organization of the relations in the extracted data are far from satisfactory. Some relations have anomalous names and are hard to be understood even by experts if looking at the relation names only; there exist synonymous and polysemous relations which may cause incomplete or noisy query results. In this paper, we propose to solve these problems by 1) exploiting the Wikipedia category system to facilitate relation understanding and query constraint selection, 2) exploring various relation representation models for similar/super-/sub-relation detection to help the users select proper relations in their queries. A prototype system has been implemented and extensive experiments are performed to illustrate the effectiveness of the proposed approach. © 2012-IOS Press and the authors. All rights reserved. 0 0
Lightweight integration of IR and DB for scalable hybrid search with integrated ranking support Hybrid search
Inverted index
IR and DB integration
Ranking
Scalable query processing
Journal of Web Semantics English 2011 The Web contains a large amount of documents and an increasing quantity of structured data in the form of RDF triples. Many of these triples are annotations associated with documents. While structured queries constitute the principal means to retrieve structured data, keyword queries are typically used for document retrieval. Clearly, a form of hybrid search that seamlessly integrates these formalisms to query both textual and structured data can address more complex information needs. However, hybrid search on the large scale Web environment faces several challenges. First, there is a need for repositories that can store and index a large amount of semantic data as well as textual data in documents, and manage them in an integrated way. Second, methods for hybrid query answering are needed to exploit the data from such an integrated repository. These methods should be fast and scalable, and in particular, they shall support flexible ranking schemes to return not all but only the most relevant results. In this paper, we present CE2, an integrated solution that leverages mature information retrieval and database technologies to support large scale hybrid search. For scalable and integrated management of data, CE2 integrates off-the-shelf database solutions with inverted indexes. Efficient hybrid query processing is supported through novel data structures and algorithms which allow advanced ranking schemes to be tightly integrated. Furthermore, a concrete ranking scheme is proposed to take features from both textual and structured data into account. Experiments conducted on DBpedia and Wikipedia show that CE2 can provide good performance in terms of both effectiveness and efficiency. © 2011 Elsevier B.V. All rights reserved. 0 0
Towards effective short text deep classification Classification
Large scale hierarchy
Short text
SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2011 Recently, more and more short texts (e.g., ads, tweets) appear on the Web. Classifying short texts into a large taxonomy like ODP or Wikipedia category system has become an important mining task to improve the performance of many applications such as contextual advertising and topic detection for micro-blogging. In this paper, we propose a novel multi-stage classification approach to solve the problem. First, explicit semantic analysis is used to add more features for both short texts and categories. Second, we leverage information retrieval technologies to fetch the most relevant categories for an input short text from thousands of candidates. Finally, a SVM classifier is applied on only a few selected categories to return the final answer. Our experimental results show that the proposed method achieved significant improvements on classification accuracy compared with several existing state of art approaches. 0 0
Zhishi.me - Weaving Chinese linking open data Lecture Notes in Computer Science English 2011 Linking Open Data (LOD) has become one of the most important community efforts to publish high-quality interconnected semantic data. Such data has been widely used in many applications to provide intelligent services like entity search, personalized recommendation and so on. While DBpedia, one of the LOD core data sources, contains resources described in multilingual versions and semantic data in English is proliferating, there is very few work on publishing Chinese semantic data. In this paper, we present Zhishi.me, the first effort to publish large scale Chinese semantic data and link them together as a Chinese LOD (CLOD). More precisely, we identify important structural features in three largest Chinese encyclopedia sites (i.e., Baidu Baike, Hudong Baike, and Chinese Wikipedia) for extraction and propose several data-level mapping strategies for automatic link discovery. As a result, the CLOD has more than 5 million distinct entities and we simply link CLOD with the existing LOD based on the multilingual characteristic of Wikipedia. Finally, we also introduce three Web access entries namely SPARQL endpoint, lookup interface and detailed data view, which conform to the principles of publishing data sources to LOD. 0 0
Wikipedia2Onto - building concept ontology automatically, experimenting with web image retrieval Ontology
Semantic concept
Web image classification
Wikipedia
Informatica (Ljubljana) English 2010 Given its effectiveness to better understand data, ontology has been used in various domains including cartificial intelligence, biomedical informatics and library science. What we have tried to promote is the use of ontology to better understand media (in particular, images) on the World Wide Web. This paper describes our preliminary attempt to construct a large-scale multi-modality ontology, called AutoMMOnto, for web image classification. Particularly, to enable the automation of text ontology construction, we take advantage of both structural and content features of Wikipedia and formalize real world objects in terms of concepts and relationships. For visual part, we train classifiers according to both global and local features, and generate middle-level concepts from the training images. A variant of the association rule mining algorithm is further developed to refine the built ontology. Our experimental results show that our method allows automatic construction of large-scale multi-modality ontology with high accuracy from challenging web image data set. 0 0
CE2 - Towards a large scale hybrid search engine with integrated ranking support Annotations
Hybrid search
Ranking
Scalable storage
International Conference on Information and Knowledge Management, Proceedings English 2008 The Web contains a large amount of documents and increasingly, also semantic data in the form of RDF triples. Many of these triples are annotations that are associated with documents. While structured query is the principal mean to retrieve semantic data, keyword queries are typically used for document retrieval. Clearly, a form of hybrid search that seamlessly integrates these formalisms to query both documents and semantic data can address more complex information needs. In this paper, we present CE2, an integrated solution that leverages mature database and information retrieval technologies to tackle challenges in hybrid search on the large scale. For scalable storage, CE2 integrates database with inverted indices. Hybrid query processing is supported in CE2 through novel algorithms and data structures, which allow for advanced ranking schemes to be integrated more tightly into the process. Experiments conducted on Dbpedia and Wikipedia show that CE2 can provide good performance in terms of both effectiveness and effciency. 0 0
Catriple: Extracting triples from wikipedia categories Lecture Notes in Computer Science English 2008 As an important step towards bootstrapping the Semantic Web, many efforts have been made to extract triples from Wikipedia because of its wide coverage, good organization and rich knowledge. One kind of important triples is about Wikipedia articles and their non-isa properties, e.g. (Beijing, country, China). Previous work has tried to extract such triples from Wikipedia infoboxes, article text and categories. The infobox-based and text-based extraction methods depend on the infoboxes and suffer from a low article coverage. In contrast, the category-based extraction methods exploit the widespread categories. However, they rely on predefined properties, which is too effort-consuming and explores only very limited knowledge in the categories. This paper automatically extracts properties and triples from the less explored Wikipedia categories so as to achieve a wider article coverage with less manual effort. We manage to realize this goal by utilizing the syntax and semantics brought by super-sub category pairs in Wikipedia. Our prototype implementation outputs about 10M triples with a 12-level confidence ranging from 47.0% to 96.4%, which cover 78.2% of Wikipedia articles. Among them, 1.27M triples have confidence of 96.4%. Applications can on demand use the triples with suitable confidence. 0 0
Making More Wikipedians: Facilitating Semantics Reuse for Wikipedia Authoring The Semantic Web English 2008 Wikipedia, a killer application in Web 2.0, has embraced the power of collaborative editing to harness collective intelligence. It can also serve as an ideal Semantic Web data source due to its abundance, influence, high quality and well-structuring. However, the heavy burden of up-building and maintaining such an enormous and ever-growing online encyclopedic knowledge base still rests on a very small group of people. Many casual users may still feel difficulties in writing high quality Wikipedia articles. In this paper, we use RDF graphs to model the key elements in Wikipedia authoring, and propose an integrated solution to make Wikipedia authoring easier based on RDF graph matching, expecting making more Wikipedians. Our solution facilitates semantics reuse and provides users with: 1) a link suggestion module that suggests and auto-completes internal links between Wikipedia articles for the user; 2) a category suggestion module that helps the user place her articles in correct categories. A prototype system is implemented and experimental results show significant improvements over existing solutions to link and category suggestion tasks. The proposed enhancements can be applied to attract more contributors and relieve the burden of professional editors, thus enhancing the current Wikipedia to make it an even better Semantic Web data source. 0 0
Ontology enhanced web image retrieval: Aided by wikipedia & spreading activation theory Ontology
Spreading activation
Wikipedia
Proceedings of the 1st International ACM Conference on Multimedia Information Retrieval, MIR2008, Co-located with the 2008 ACM International Conference on Multimedia, MM'08 English 2008 Ontology, as an efective approach to bridge the semantic gap in various domains, has attracted a lot of interests from multimedia researchers. Among the numerous possibilities enabled by ontology, we are particularly interested in ex- ploiting ontology for a better understanding of media task (particularly, images) on the World Wide Web. To achieve our goal, two open issues are inevitably in- volved: 1) How to avoid the tedious manual work for ontol- ogy construction? 2) What are the effective inference models when using an ontology? Recent works[11, 16] about ontol- ogy learned from Wikipedia has been reported in conferences targeting the areas of knowledge management and artificial intelligent. There are also reports of different inference mod- els being investigated[5, 13, 15]. However, so far there has not been any comprehensive solution. In this paper, we look at these challenges and attempt to provide a general solution to both questions. Through a careful analysis of the online encyclopedia Wikipedia's cate- gorization and page content, we choose it as our knowledge source and propose an automatic ontology construction ap- proach. We prove that it is a viable way to build ontology under various domains. To address the inference model is- sue, we provide a novel understanding of the ontology and consider it as a type of semantic network, which is similar to brain models in the cognitive research field. Spreading Activation Techniques, which have been proved to be a cor- rect information processing model in the semantic network, are consequently introduced for inference. We have imple- mented a prototype system with the developed solutions for web image retrieval. By comprehensive experiments on the canine category of the animal kingdom, we show that this is a scalable architecture for our proposed methods. Copyright 2008 ACM. 0 0
EachWiki: Suggest to be an easy-to-edit wiki interface for everyone CEUR Workshop Proceedings English 2007 In this paper, we present EachWiki, an extension of Semantic MediaWiki characterized by an intelligent suggestion mechanism. It aims to facilitate the wiki authoring by recommending the following elements: links, categories, and properties. We exploit the semantics of Wikipedia data and leverage the collective wisdom of web users to provide high quality annotation suggestions. The proposed mechanism not only improves the usability of Semantic MediaWiki but also speeds up its converging use of terminology. The suggestions are applied to relieve the burden of wiki authoring and attract more inexperienced contributors, thus making Semantic MediaWiki even better Semantic Web proto types and data source. 0 0
Exploit Semantic Information for Category Annotation Recommendation in Wikipedia Natural Language Processing and Information Systems English 2007 Compared with plain-text resources, the ones in “semi-semantic” web sites, such as Wikipedia, contain high-level semantic information which will benefit various automatically annotating tasks on themself. In this paper, we propose a “collaborative annotating” approach to automatically recommend categories for a Wikipedia article by reusing category annotations from its most similar articles and ranking these annotations by their confidence. In this approach, four typical semantic features in Wikipedia, namely incoming link, outgoing link, section heading and template item, are investigated and exploited as the representation of articles to feed the similarity calculation. The experiment results have not only proven that these semantic features improve the performance of category annotating, with comparison to the plain text feature; but also demonstrated the strength of our approach in discovering missing annotations and proper level ones for Wikipedia articles. 0 0
Exploit semantic information for category annotation recommendation in Wikipedia Collaborative annotating
Semantic features
Vector space model
Wikipedia category
Lecture Notes in Computer Science English 2007 Compared with plain-text resources, the ones in "semi-semantic" web sites, such as Wikipedia, contain high-level semantic information which will benefit various automatically annotating tasks on themself. In this paper, we propose a "collaborative annotating" approach to automatically recommend categories for a Wikipedia article by reusing category annotations from its most similar articles and ranking these annotations by their confidence. In this approach, four typical semantic features in Wikipedia, namely incoming link, outgoing link, section heading and template item, are investigated and exploited as the representation of articles to feed the similarity calculation. The experiment results have not only proven that these semantic features improve the performance of category annotating, with comparison to the plain text feature; but also demonstrated the strength of our approach in discovering missing annotations and proper level ones for Wikipedia articles. 0 0
Semplore: An IR approach to scalable hybrid query of Semantic Web data Lecture Notes in Computer Science English 2007 As an extension to the current Web, Semantic Web will not only contain structured data with machine understandable semantics but also textual information. While structured queries can be used to find information more precisely on the Semantic Web, keyword searches are still needed to help exploit textual information. It thus becomes very important that we can combine precise structured queries with imprecise keyword searches to have a hybrid query capability. In addition, due to the huge volume of information on the Semantic Web, the hybrid query must be processed in a very scalable way. In this paper, we define such a hybrid query capability that combines unary tree-shaped structured queries with keyword searches. We show how existing information retrieval (IR) index structures and functions can be reused to index semantic web data and its textual information, and how the hybrid query is evaluated on the index structure using IR engines in an efficient and scalable manner. We implemented this IR approach in an engine called Semplore. Comprehensive experiments on its performance show that it is a promising approach. It leads us to believe that it may be possible to evolve current web search engines to query and search the Semantic Web. Finally, we breifly describe how Semplore is used for searching Wikipedia and an IBM customer's product information. 0 0