Thanh Tran

From WikiPapers
Jump to: navigation, search

Thanh Tran is an author.

Publications

Only those publications related to wikis are shown here.
Title Keyword(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Analysing the duration of trending topics in twitter using wikipedia Temporal analysis
Time series
Twitter
Wikipedia
WebSci 2014 - Proceedings of the 2014 ACM Web Science Conference English 2014 The analysis of trending topics in Twitter is a goldmine for a variety of studies and applications. However, the contents of topics vary greatly from daily routines to major public events, enduring from a few hours to weeks or months. It is thus helpful to distinguish trending topics related to real- world events with those originated within virtual communi- ties. In this paper, we analyse trending topics in Twitter using Wikipedia as reference for studying the provenance of trending topics. We show that among difierent factors, the duration of a trending topic characterizes exogenous Twitter trending topics better than endogenous ones. Copyright 0 0
A hybrid method for detecting outdated information in Wikipedia infoboxes Entity Search
Information extraction
Pattern Learning
Wikipedia Update
Proceedings - 2013 RIVF International Conference on Computing and Communication Technologies: Research, Innovation, and Vision for Future, RIVF 2013 English 2013 Wikipedia has grown fast and become a major information resource for users as well as for many knowledge bases derived from it. However it is still edited manually while the world is changing rapidly. In this paper, we propose a method to detect outdated attribute values in Wikipedia infoboxes by using facts extracted from the general Web. Our proposed method extracts new information by combining pattern-based approach with entity-search-based approach to deal with the diversity of natural language presentation forms of facts on the Web. Our experimental results show that the achieved accuracies of the proposed method are 70% and 82% respectively on the chief-executive-officer attribute and the number-of-employees attribute in company infoboxes. It significantly improves the accuracy of the single pattern-based or entity-search-based method. The results also reveal the striking truth about the outdated status of Wikipedia. 0 0
Lightweight integration of IR and DB for scalable hybrid search with integrated ranking support Hybrid search
Inverted index
IR and DB integration
Ranking
Scalable query processing
Journal of Web Semantics English 2011 The Web contains a large amount of documents and an increasing quantity of structured data in the form of RDF triples. Many of these triples are annotations associated with documents. While structured queries constitute the principal means to retrieve structured data, keyword queries are typically used for document retrieval. Clearly, a form of hybrid search that seamlessly integrates these formalisms to query both textual and structured data can address more complex information needs. However, hybrid search on the large scale Web environment faces several challenges. First, there is a need for repositories that can store and index a large amount of semantic data as well as textual data in documents, and manage them in an integrated way. Second, methods for hybrid query answering are needed to exploit the data from such an integrated repository. These methods should be fast and scalable, and in particular, they shall support flexible ranking schemes to return not all but only the most relevant results. In this paper, we present CE2, an integrated solution that leverages mature information retrieval and database technologies to support large scale hybrid search. For scalable and integrated management of data, CE2 integrates off-the-shelf database solutions with inverted indexes. Efficient hybrid query processing is supported through novel data structures and algorithms which allow advanced ranking schemes to be tightly integrated. Furthermore, a concrete ranking scheme is proposed to take features from both textual and structured data into account. Experiments conducted on DBpedia and Wikipedia show that CE2 can provide good performance in terms of both effectiveness and efficiency. © 2011 Elsevier B.V. All rights reserved. 0 0
Clustering XML documents using frequent subtrees Clustering
Frequent mining
Frequent subtrees
INEX
Structure and content
Wikipedia
XML document mining
Lecture Notes in Computer Science English 2009 This paper presents an experimental study conducted over the INEX 2008 Document Mining Challenge corpus using both the structure and the content of XML documents for clustering them. The concise common substructures known as the closed frequent subtrees are generated using the structural information of the XML documents. The closed frequent subtrees are then used to extract the constrained content from the documents. A matrix containing the term distribution of the documents in the dataset is developed using the extracted constrained content. The k-way clustering algorithm is applied to the matrix to obtain the required clusters. In spite of the large number of documents in the INEX 2008 Wikipedia dataset, the proposed frequent subtree-based clustering approach was successful in clustering the documents. This approach significantly reduces the dimensionality of the terms used for clustering without much loss in accuracy. 0 0
Semantic Wiki Search English 2009 Semantic wikis extend wiki platforms with the ability to represent structured information in a machine-processable way. On top of the structured information in the wiki, novel ways to search, browse, and present the wiki content become possible. However, while powerful query languages offer new opportunities for semantic search, the syntax of formal query languages is not adequate for end users. In this work we present an approach to semantic search that combines the expressiveness and capabilities of structured queries with the simplicity of keyword interfaces and faceted search. Users articulate their information need in keywords, which are translated into structured, conjunctive queries. This translation may result in multiple possible interpretations of the information need, which can then be selected and further refined by the user via facets. We have implemented this approach to semantic search as an extension to Semantic MediaWiki. The results of a user study in the SMW-based community portal semanticweb.org show the efficiency and effectiveness of the approach as well as its ease of use. 0 0
Utilizing the structure and content information for XML document clustering Clustering
INEX 2008
LSK
Wikipedia
Lecture Notes in Computer Science English 2009 This paper reports on the experiments and results of a clustering approach used in the INEX 2008 document mining challenge. The clustering approach utilizes both the structure and content information of the Wikipedia XML document collection. A latent semantic kernel (LSK) is used to measure the semantic similarity between XML documents based on their content features. The construction of a latent semantic kernel involves the computing of singular vector decomposition (SVD). On a large feature space matrix, the computation of SVD is very expensive in terms of time and memory requirements. Thus in this clustering approach, the dimension of the document space of a term-document matrix is reduced before performing SVD. The document space reduction is based on the common structural information of the Wikipedia XML document collection. The proposed clustering approach has shown to be effective on the Wikipedia collection in the INEX 2008 document mining challenge. 0 0
CE2 - Towards a large scale hybrid search engine with integrated ranking support Annotations
Hybrid search
Ranking
Scalable storage
International Conference on Information and Knowledge Management, Proceedings English 2008 The Web contains a large amount of documents and increasingly, also semantic data in the form of RDF triples. Many of these triples are annotations that are associated with documents. While structured query is the principal mean to retrieve semantic data, keyword queries are typically used for document retrieval. Clearly, a form of hybrid search that seamlessly integrates these formalisms to query both documents and semantic data can address more complex information needs. In this paper, we present CE2, an integrated solution that leverages mature database and information retrieval technologies to tackle challenges in hybrid search on the large scale. For scalable storage, CE2 integrates database with inverted indices. Hybrid query processing is supported in CE2 through novel algorithms and data structures, which allow for advanced ranking schemes to be integrated more tightly into the process. Experiments conducted on Dbpedia and Wikipedia show that CE2 can provide good performance in terms of both effectiveness and effciency. 0 0
Clustering XML documents using closed frequent subtrees: A structural similarity approach Clustering
Frequent Mining
Frequent subtrees
INEX
Structural mining
XML document mining
Lecture Notes in Computer Science English 2008 This paper presents the experimental study conducted over the INEX 2007 Document Mining Challenge corpus employing a frequent subtree-based incremental clustering approach. Using the structural information of the XML documents, the closed frequent subtrees are generated. A matrix is then developed representing the closed frequent subtree distribution in documents. This matrix is used to progressively cluster the XML documents. In spite of the large number of documents in INEX 2007 Wikipedia dataset, the proposed frequent subtree-based incremental clustering approach was successful in clustering the documents. 0 0
Document clustering using incremental and pairwise approaches Clustering
Content
INEX 2007
Structure
XML
Lecture Notes in Computer Science English 2008 This paper presents the experiments and results of a clustering approach for clustering of the large Wikipedia dataset in the INEX 2007 Document Mining Challenge. The clustering approach employed makes use of an incremental clustering method and a pairwise clustering method. The approach enables us to perform the clustering task on a large dataset by first reducing the dimension of the dataset to an undefined number of clusters using the incremental method. The lower-dimension dataset is then clustered to a required number of clusters using the pairwise method. In this way, clustering of the large number of documents is performed successfully and the accuracy of the clustering solution is achieved. 0 0