Yafang Wang

From WikiPapers
Jump to: navigation, search

Yafang Wang is an author.

Publications

Only those publications related to wikis are shown here.
Title Keyword(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Heterogeneous graph-based intent learning with queries, web pages and Wikipedia concepts Heterogeneous graph clustering
Search intent
Wikipedia
WSDM 2014 - Proceedings of the 7th ACM International Conference on Web Search and Data Mining English 2014 The problem of learning user search intents has attracted intensive attention from both industry and academia. However, state-of-the-art intent learning algorithms suffer from different drawbacks when only using a single type of data source. For example, query text has difficulty in distinguishing ambiguous queries; search log is bias to the order of search results and users' noisy click behaviors. In this work, we for the first time leverage three types of objects, namely queries, web pages and Wikipedia concepts collaboratively for learning generic search intents and construct a heterogeneous graph to represent multiple types of relationships between them. A novel unsupervised method called heterogeneous graph-based soft-clustering is developed to derive an intent indicator for each object based on the constructed heterogeneous graph. With the proposed co-clustering method, one can enhance the quality of intent understanding by taking advantage of different types of data, which complement each other, and make the implicit intents easier to interpret with explicit knowledge from Wikipedia concepts. Experiments on two real-world datasets demonstrate the power of the proposed method where it achieves a 9.25% improvement in terms of NDCG on search ranking task and a 4.67% enhancement in terms of Rand index on object co-clustering task compared to the best state-of-the-art method. 0 0
Automatically building templates for entity summary construction LDA
Pattern mining
Summary template
Information Processing and Management English 2013 In this paper, we propose a novel approach to automatic generation of summary templates from given collections of summary articles. We first develop an entity-aspect LDA model to simultaneously cluster both sentences and words into aspects. We then apply frequent subtree pattern mining on the dependency parse trees of the clustered and labeled sentences to discover sentence patterns that well represent the aspects. Finally, we use the generated templates to construct summaries for new entities. Key features of our method include automatic grouping of semantically related sentence patterns and automatic identification of template slots that need to be filled in. Also, we implement a new sentence compression algorithm which use dependency tree instead of parser tree. We apply our method on five Wikipedia entity categories and compare our method with three baseline methods. Both quantitative evaluation based on human judgment and qualitative comparison demonstrate the effectiveness and advantages of our method. © 2012 Elsevier Ltd. All rights reserved. 0 0
Harvesting facts from textual web sources by constrained label propagation Knowledge harvesting
Label propagation
Temporal facts
International Conference on Information and Knowledge Management, Proceedings English 2011 There have been major advances on automatically constructing large knowledge bases by extracting relational facts from Web and text sources. However, the world is dynamic: periodic events like sports competitions need to be interpreted with their respective timepoints, and facts such as coaching a sports team, holding political or business positions, and even marriages do not hold forever and should be augmented by their respective timespans. This paper addresses the problem of automatically harvesting temporal facts with such extended time-awareness. We employ pattern-based gathering techniques for fact candidates and construct a weighted pattern-candidate graph. Our key contribution is a system called PRAVDA based on a new kind of label propagation algorithm with a judiciously designed loss function, which iteratively processes the graph to label good temporal facts for a given set of target relations. Our experiments with online news and Wikipedia articles demonstrate the accuracy of this method. 0 0
Temporal latent semantic analysis for collaboratively generated content: Preliminary results Algorithm
Experimentation
SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2011 Latent semantic analysis (LSA) has been intensively studied because of its wide application to Information Retrieval and Natural Language Processing. Yet, traditional models such as LSA only examine one (current) version of the document. However, due to the recent proliferation of collaboratively generated content such as threads in online forums, Collaborative Question Answering archives, Wikipedia, and other versioned content, the document generation process is now directly observable. In this study, we explore how this additional temporal information about the document evolution could be used to enhance the identification of latent document topics. Specifically, we propose a novel hidden-topic modeling algorithm, temporal Latent Semantic Analysis (tLSA), which elegantly extends LSA to modeling document revision history using tensor decomposition. Our experiments show that tLSA outperforms LSA on word relatedness estimation using benchmark data, and explore applications of tLSA for other tasks. 0 0
A classification algorithm of signed networks based on link analysis Node classification
Signed networks
Social network
2010 International Conference on Communications, Circuits and Systems, ICCCAS 2010 - Proceedings English 2010 In the signed networks the links between nodes can be either positive (means relations are friendship) or negative (means relations are rivalry or confrontation), which are very useful for analysis the real social network. After study data sets from Wikipedia and Slashdot networks, We find that the signs of links in the fundamental social networks can be used to classified the nodes and used to forecast the potential emerged sign of links in the future with high accuracy, using models that established across these diverse data sets. Based on the models, the proposed algorithm in the artwork provides perception into some of the underlying principles that extract from signed links in the networks. At the same time, the algorithm shed light on the social computing applications by which the attitude of a person toward another can be predicted from evidence provided by their around friends relationships. 0 0
Generating templates of entity summaries with an entity-aspect model and pattern mining ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference English 2010 In this paper, we propose a novel approach to automatic generation of summary templates from given collections of summary articles. This kind of summary templates can be useful in various applications. We first develop an entity-aspect LDA model to simultaneously cluster both sentences and words into aspects. We then apply frequent subtree pattern mining on the dependency parse trees of the clustered and labeled sentences to discover sentence patterns that well represent the aspects. Key features of our method include automatic grouping of semantically related sentence patterns and automatic identification of template slots that need to be filled in. We apply our method on five Wikipedia entity categories and compare our method with two baseline methods. Both quantitative evaluation based on human judgment and qualitative comparison demonstrate the effectiveness and advantages of our method. 0 0
Timely YAGO: Harvesting, querying, and visualizing temporal knowledge from Wikipedia Knowledge harvesting
Knowledge management
Ontology
Temporal fact extraction
Temporal queries
Wikipedia
Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings English 2010 Recent progress in information extraction has shown how to automatically build large ontologies from high-quality sources like Wikipedia. But knowledge evolves over time; facts have associated validity intervals. Therefore, ontologies should include time as a first-class dimension. In this paper, we introduce Timely YAGO, which extends our previously built knowledge base YAGO with temporal aspects. This prototype system extracts temporal facts from Wikipedia infoboxes, categories, and lists in articles, and integrates these into the Timely YAGO knowledge base. We also support querying temporal facts, by temporal predicates in a SPARQL-style language. Visualization of query results is provided in order to better understand of the dynamic nature of knowledge. Copyright 2010 ACM. 0 0
Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia Wikipedia
Knowledge harvesting
Knowledge management
Ontology
Temporal fact extraction
Temporal queries
EDBT English 2010 0 0
Using the past to score the present: Extending term weighting models through Revision History Analysis Collaboratively generated content
Retrieval models
Term weighting
International Conference on Information and Knowledge Management, Proceedings English 2010 The generative process underlies many information retrieval models, notably statistical language models. Yet these models only examine one (current) version of the document, effectively ignoring the actual document generation process. We posit that a considerable amount of information is encoded in the document authoring process, and this information is complementary to the word occurrence statistics upon which most modern retrieval models are based. We propose a new term weighting model, Revision History Analysis (RHA), which uses the revision history of a document (e.g., the edit history of a page in Wikipedia) to redefine term frequency - a key indicator of document topic/relevance for many retrieval models and text processing tasks. We then apply RHA to document ranking by extending two state-of-the-art text retrieval models, namely, BM25 and the generative statistical language model (LM). To the best of our knowledge, our paper is the first attempt to directly incorporate document authoring history into retrieval models. Empirical results show that RHA provides consistent improvements for state-of-the-art retrieval models, using standard retrieval tasks and benchmarks. 0 0
Comprehensive query-dependent fusion using regression-on-folksonomies: A case study of multimodal music search Folksonomy
Multimodal search
Music
Query-dependent fusion
MM'09 - Proceedings of the 2009 ACM Multimedia Conference, with Co-located Workshops and Symposiums English 2009 The combination of heterogeneous knowledge sources has been widely regarded as an effective approach to boost retrieval accuracy in many information retrieval domains. While various technologies have been recently developed for information retrieval, multimodal music search has not kept pace with the enormous growth of data on the Internet. In this paper, we study the problem of integrating multiple online information sources to conduct effective query dependent fusion (QDF) of multiple search experts for music retrieval. We have developed a novel framework to construct a knowledge space of users' information need from online folksonomy data. With this innovation, a large number of comprehensive queries can be automatically constructed to train a better generalized QDF system against unseen user queries. In addition, our framework models QDF problem by regression of the optimal combination strategy on a query. Distinguished from the previous approaches, the regression model of QDF (RQDF) offers superior modeling capability with less constraints and more efficient computation. To validate our approach, a large scale test collection has been collected from different online sources, such as Last.fm, Wikipedia, and YouTube. All test data will be released to the public for better research synergy in multimodal music search. Our performance study indicates that the accuracy, efficiency, and robustness of the multimodal music search can be improved significantly by the proposed folksonomy-RQDF approach. In addition, since no human involvement is required to collect training examples, our approach offers great feasibility and practicality in system development. Copyright 2009 ACM. 0 0
Learning landmarks by exploiting social media Lecture Notes in Computer Science English 2009 This paper introduces methods for automatic annotation of landmark photographs via learning textual tags and visual features of landmarks from landmark photographs that are appropriately location-tagged from social media. By analyzing spatial distributions of text tags from Flickr's geotagged photos, we identify thousands of tags that likely refer to landmarks. Further verification by utilizing Wikipedia articles filters out non-landmark tags. Association analysis is used to find the containment relationship between landmark tags and other geographic names, thus forming a geographic hierarchy. Photographs relevant to each landmark tag were retrieved from Flickr and distinctive visual features were extracted from them. The results form ontology for landmarks, including their names, equivalent names, geographic hierarchy, and visual features. We also propose an efficient indexing method for content-based landmark search. The resultant ontology could be used in tag suggestion and content-relevant re-ranking. 0 0
MagicCube: Choosing the best snippet for each aspect of an entity Entity
MagicCube
Snippet
Wiki
International Conference on Information and Knowledge Management, Proceedings English 2009 Wikis are currently used in business to provide knowledge management systems, especially for individual organizations. However, building wikis manually is a laborious and time-consuming work. To assist founding wikis, we propose a methodology in this paper to automatically select the best snippets for entities as their initial explanations. Our method consists of two steps. First, we focus on extracting snippets from a given set of web pages for each entity. Starting from a seed sentence, a snippet grows up by adding the most relevant neighboring sentences into itself. The sentences are chosen by the Snippet Growth Model, which employs a distance function and an influence function to make decisions. Secondly, we pick out the best snippet for each aspect of an entity. The combination of all the selected snippets serves as the primary description of the entity. We present three ever-increasing methods to handle selection process. Experimental results based on a real data set show that our proposed method works effectively in producing primary descriptions for entities such as employee names. Copyright 2009 ACM. 0 0
An approach to deep web crawling by sampling Proceedings - 2008 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2008 English 2008 Crawling deep web is the process of collecting data from search interfaces by issuing queries. With wide availability of programmable interface encoded in web services, deep web crawling has received a large variety of applications. One of the major challenges crawling deep web is the selection of the queries so that most of the data can be retrieved at a low cost. We propose a general method in this regard. In order to minimize the duplicates retrieved, we reduced the problem of selecting an optimal set of queries from a sample of the data source into the well-known set-covering problem and adopt a classical algorithm to resolve it. To verify that the queries selected from a sample also produce a good result for the entire data source, we carried out a set of experiments on large corpora including Wikipedia and Reuters. We show that our sampling-based method is effective by empirically proving that 1) The queries selected from samples can harvest most of the data in the original database; 2) The queries with low overlapping rate in samples will also result in a low overlapping rate in the original database; and 3) The size of the sample and the size of the terms from where to select the queries do not need to be very large. 0 0
EachWiki: Suggest to be an easy-to-edit wiki interface for everyone CEUR Workshop Proceedings English 2007 In this paper, we present EachWiki, an extension of Semantic MediaWiki characterized by an intelligent suggestion mechanism. It aims to facilitate the wiki authoring by recommending the following elements: links, categories, and properties. We exploit the semantics of Wikipedia data and leverage the collective wisdom of web users to provide high quality annotation suggestions. The proposed mechanism not only improves the usability of Semantic MediaWiki but also speeds up its converging use of terminology. The suggestions are applied to relieve the burden of wiki authoring and attract more inexperienced contributors, thus making Semantic MediaWiki even better Semantic Web proto types and data source. 0 0
Exploit semantic information for category annotation recommendation in Wikipedia Collaborative annotating
Semantic features
Vector space model
Wikipedia category
Lecture Notes in Computer Science English 2007 Compared with plain-text resources, the ones in "semi-semantic" web sites, such as Wikipedia, contain high-level semantic information which will benefit various automatically annotating tasks on themself. In this paper, we propose a "collaborative annotating" approach to automatically recommend categories for a Wikipedia article by reusing category annotations from its most similar articles and ranking these annotations by their confidence. In this approach, four typical semantic features in Wikipedia, namely incoming link, outgoing link, section heading and template item, are investigated and exploited as the representation of articles to feed the similarity calculation. The experiment results have not only proven that these semantic features improve the performance of category annotating, with comparison to the plain text feature; but also demonstrated the strength of our approach in discovering missing annotations and proper level ones for Wikipedia articles. 0 0