Clustering

From WikiPapers
Jump to: navigation, search

clustering is included as keyword or extra keyword in 0 datasets, 0 tools and 25 publications.

Datasets

There is no datasets for this keyword.

Tools

There is no tools for this keyword.


Publications

Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Conceptual clustering Boubacar A.
Niu Z.
Lecture Notes in Electrical Engineering English 2014 Traditional clustering methods are unable to describe the generated clusters. Conceptual clustering is an important and active research area that aims to efficiently cluster and explain the data. Previous conceptual clustering approaches provide descriptions that do not use a human comprehensible knowledge. This paper presents an algorithm which uses Wikipedia concepts to process a clustering method. The generated clusters overlap each other and serve as a basis for an information retrieval system. The method has been implemented in order to improve the performance of the system. It reduces the computation cost. 0 0
MIGSOM: A SOM algorithm for large scale hyperlinked documents inspired by neuronal migration Kotaro Nakayama
Yutaka Matsuo
Lecture Notes in Computer Science English 2014 The SOM (Self Organizing Map), one of the most popular unsupervised machine learning algorithms, maps high-dimensional vectors into low-dimensional data (usually a 2-dimensional map). The SOM is widely known as a "scalable" algorithm because of its capability to handle large numbers of records. However, it is effective only when the vectors are small and dense. Although a number of studies on making the SOM scalable have been conducted, technical issues on scalability and performance for sparse high-dimensional data such as hyperlinked documents still remain. In this paper, we introduce MIGSOM, an SOM algorithm inspired by new discovery on neuronal migration. The two major advantages of MIGSOM are its scalability for sparse high-dimensional data and its clustering visualization functionality. In this paper, we describe the algorithm and implementation in detail, and show the practicality of the algorithm in several experiments. We applied MIGSOM to not only experimental data sets but also a large scale real data set: Wikipedia's hyperlink data. 0 0
An automatic approach for generating tables in semantic wikis Al-Husain L.
El-Masri S.
Journal of Theoretical and Applied Information Technology English 2012 Wiki is well-known content management systems. Semantic wikis extends the classical wikis with semantic annotations that made its contents more structured. Tabular representations of information have a considerable value, especially in wikis which are rich in content and contain large amount of information. For this reason, we propose an approach for automatically generating tables for representing the semantic data contained in wiki articles. The proposed approach composed of three steps (1) extract the semantic data of Typed Links and Attributes from the wiki articles and call them Article Properties (2) cluster the collection of wiki articles based on extracted properties from the first step, and (3) construct the table that aggregates the shared properties between articles and present them in two-dimensions. The proposed approach is based on a simple heuristic which is the number of properties that are shared between wiki articles. © 2005 - 2012 JATIT & LLS. All rights reserved. 0 0
Catching the drift - Indexing implicit knowledge in chemical digital libraries Kohncke B.
Tonnies S.
Balke W.-T.
Lecture Notes in Computer Science English 2012 In the domain of chemistry the information gathering process is highly focused on chemical entities. But due to synonyms and different entity representations the indexing of chemical documents is a challenging process. Considering the field of drug design, the task is even more complex. Domain experts from this field are usually not interested in any chemical entity itself, but in representatives of some chemical class showing a specific reaction behavior. For describing such a reaction behavior of chemical entities the most interesting parts are their functional groups. The restriction of each chemical class is somehow also related to the entities' reaction behavior, but further based on the chemist's implicit knowledge. In this paper we present an approach dealing with this implicit knowledge by clustering chemical entities based on their functional groups. However, since such clusters are generally too unspecific, containing chemical entities from different chemical classes, we further divide them into sub-clusters using fingerprint based similarity measures. We analyze several uncorrelated fingerprint/similarity measure combinations and show that the most similar entities with respect to a query entity can be found in the respective sub-cluster. Furthermore, we use our approach for document retrieval introducing a new similarity measure based on Wikipedia categories. Our evaluation shows that the sub-clustering leads to suitable results enabling sophisticated document retrieval in chemical digital libraries. 0 0
Clustering Wikipedia infoboxes to discover their types Nguyen T.H.
Nguyen H.D.
Viviane Moreira
Juliana Freire
ACM International Conference Proceeding Series English 2012 Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters. 0 0
Extracting knowledge from web search engine results Kanavos A.
Theodoridis E.
Tsakalidis A.
Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI English 2012 Nowadays, people frequently use search engines in order to find the information they need on the web. However, usually web search engines return web page references in a global ranking making it difficult to the users to browse different topics captured in the result set and thus making it difficult to find quickly the desired web pages. There is need for special computational systems, that will discover knowledge in these web search results providing the user with the possibility to browse different topics contained in a given result set. In this paper, we focus on the problem of determining different thematic groups on web search engine results that existing web search engines provide. We propose a novel system that exploits a set of reformulation strategies so as to help users gain more relevant results to their desired query. It additionally tries to discover among the result set different topic groups, according to the various meanings of the provided query. The proposed method utilizes a number of semantic annotation techniques using Knowledge Bases, like Word Net and Wikipedia, in order to perceive the different senses of each query term. Finally, the method annotates the extracted topics using information derived from the clusters and presents them to the end user. 0 0
A self organizing document map algorithm for large scale hyperlinked data inspired by neuronal migration Kotaro Nakayama
Yutaka Matsuo
Proceedings of the 20th International Conference Companion on World Wide Web, WWW 2011 English 2011 Web document clustering is one of the research topics that is being pursued continuously due to the large variety of applications. Since Web documents usually have variety and diversity in terms of domains, content and quality, one of the technical difficulties is to find a reasonable number and size of clusters. In this research, we pay attention to SOMs (Self Organizing Maps) because of their capability of visualized clustering that helps users to investigate characteristics of data in detail. The SOM is widely known as a "scalable" algorithm because of its capability to handle large numbers of records. However, it is effective only when the vectors are small and dense. Although several research efforts on making the SOM scalable have been conducted, technical issues on scalability and performance for sparse high-dimensional data such as hyperlinked documents still remain. In this paper, we introduce MIGSOM, an SOM algorithm inspired by a recent discovery on neuronal migration. The two major advantages of MIGSOM are its scalability for sparse high-dimensional data and its clustering visualization functionality. In this paper, we describe the algorithm and implementation, and show the practicality of the algorithm by applying MIGSOM to a huge scale real data set: Wikipedia's hyperlink data. 0 0
Beyond the bag-of-words paradigm to enhance information retrieval applications Paolo Ferragina Proceedings - 4th International Conference on SImilarity Search and APplications, SISAP 2011 English 2011 The typical IR-approach to indexing, clustering, classification and retrieval, just to name a few, is the one based on the bag-of-words paradigm. It eventually transforms a text into an array of terms, possibly weighted (with tf-idf scores or derivatives), and then represents that array via points in highly-dimensional space. It is therefore syntactical and unstructured, in the sense that different terms lead to different dimensions. Co-occurrence detection and other processing steps have been thus proposed (see e.g. LSI, Spectral analysis [7]) to identify the existence of those relations, but yet everyone is aware of the limitations of this approach especially in the expanding context of short (and thus poorly composed) texts, such as the snippets of search-engine results, the tweets of a Twitter channel, the items of a news feed, the posts of a blog, or the advertisement messages, etc.. A good deal of recent work is attempting to go beyond this paradigm by enriching the input text with additional structured annotations. This general idea has been declined in the literature in two distinct ways. One consists of extending the classic term-based vector-space model with additional dimensions corresponding to features (concepts) extracted from an external knowledge base, such as DMOZ, Wikipedia, or even the whole Web (see e.g. [4, 5, 12]). The pro of this approach is to extend the bag-of-words scheme with more concepts, thus possibly allowing the identification of related texts which are syntactically far apart. The cons resides in the contamination of these vectors by un-related (but common) concepts retrieved via the syntactic queries. The second way consists of identifying in the input text short-and-meaningful sequences of terms (aka spots) which are then connected to unambiguous concepts drawn from a catalog. The catalog can be formed by either a small set of specifically recognized types, most often People and Locations (aka Named Entities, see e.g. [13, 14]), or it can consists of millions of concepts drawn from a large knowledge base, such as Wikipedia. This latter catalog is ever-expanding and currently offers the best trade-off between a catalog with a rigorous structure but with low coverage (like WordNet, CYC, TAP), and a large text collection with wide coverage but unstructured and noised content (like the whole Web). To understand how this annotation works, let us consider the following short news: "Diego Maradona won against Mexico". The goal of the annotation is to detect "Diego Maradona" and"Mexico" as spots, and then hyper-link them with theWikipedia pages which deal with the ex Argentina's coach and the football team of Mexico. The annotator uses as spots the anchor texts which occur in Wikipedia pages, and as possible concepts for each spot the (possibly many) pages pointed in Wikipedia by that spot/anchor 0 0
Enhancing accessibility of microblogging messages using semantic knowledge Hu X.
Tang L.
Hongyan Liu
International Conference on Information and Knowledge Management, Proceedings English 2011 The volume of microblogging messages is increasing exponentially with the popularity of microblogging services. With a large number of messages appearing in user interfaces, it hinders user accessibility to useful information buried in disorganized, incomplete, and unstructured text messages. In order to enhance user accessibility, we propose to aggregate related microblogging messages into clusters and automatically assign them semantically meaningful labels. However, a distinctive feature of microblogging messages is that they are much shorter than conventional text documents. These messages provide inadequate term co occurrence information for capturing semantic associations. To address this problem, we propose a novel framework for organizing unstructured microblogging messages by transforming them to a semantically structured representation. The proposed framework first captures informative tree fragments by analyzing a parse tree of the message, and then exploits external knowledge bases (Wikipedia and WordNet) to enhance their semantic information. Empirical evaluation on a Twitter dataset shows that our framework significantly outperforms existing state-of-the-art methods. 0 0
Overview of the INEX 2010 XML mining track: Clustering and classification of XML documents De Vries C.M.
Nayak R.
Kutty S.
Shlomo Geva
Tagarelli A.
Lecture Notes in Computer Science English 2011 This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2010 XML Mining track. The report also describes the approaches and results obtained by participants. 0 0
WikiDev 2.0: Facilitating software development teams Fokaefs M.
Brendan Tansey
Veselin Ganev
Bauer K.
Eleni Stroulia
Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR English 2011 Software development is fundamentally a collaborative task. Developers, sometimes geographically distributed, collectively work on different parts of a project. The challenge of ensuring that their contributions consistently build on one another is a major concern for collaborative development and implies concerns with effective communication, task administration and exchange of documents and information concerning the project. In this demo, we present WikiDev 2.0, a lightweight wiki-based tool suite that enhances collaboration within software development teams. WikiDev 2.0 integrates information from multiple development tools and displays the results through its wikibased front-end. The tool also offers several analysis techniques and visualizations that improve the project-status awareness of the team. 0 0
A content-based image retrieval system based on unsupervised topological learning Rogovschi N.
Grozavu N.
Proc. - 6th Intl. Conference on Advanced Information Management and Service, IMS2010, with ICMIA2010 - 2nd International Conference on Data Mining and Intelligent Information Technology Applications English 2010 Internet offers to its users an ever-increasing number of information. Among those, the multimodal data (images, text, video, sound) are widely requested by users, and there is a strong need for effective ways to process and to manage it, respectively. Most of existed algorithms/frameworks are doing only images annotations and the search is doing by this annotations, or combined with some clustering results, but most of them do not allow a quick browsing of these images. Even if the search is very quickly, but if the number of images is very large, the system must give the possibility to the user to browse this data. In this paper, an image retrieval system is presented, including detailed descriptions of used lwo-SOM (local weighting observations Self-Organizing Map) approach and a new interactive learning process using user information/response. Also, we show the use of unsupervised learning on an images dataset, we do not dispose of the labels, and we will not take into account the corresponding text for the images. The used real dataset contains 17812 images extracted from wikipedia pages, each of which is characterized by its color and texture. 0 0
Enhancing Short Text Clustering with Small External Repositories Petersen H.
Poon J.
Conferences in Research and Practice in Information Technology Series English 2010 The automatic clustering of textual data according to their semantic concepts is a challenging, yet important task. Choosing an appropriate method to apply when clustering text depends on the nature of the documents being analysed. For example, traditional clustering algorithms can struggle to correctly model collections of very short text due to their extremely sparse nature. In recent times, much attention has been directed to finding methods for adequately clustering short text. Many popular approaches employ large, external document repositories, such as Wikipedia or the Open Directory Project, to incorporate additional world knowledge into the clustering process. However the sheer size of many of these external collections can make these techniques difficult or time consuming to apply. This paper also employs external document collections to aid short text clustering performance. The external collections are referred to in this paper as Background Knowledge. In contrast to most previous literature a separate collection of Background Knowledge is obtained for each short text dataset. However, this Background Knowledge contains several orders of magnitude fewer documents than commonly used repositories like Wikipedia. A simple approach is described where the Background Knowledge is used to re-express short text in terms of a much richer feature space. A discussion of how best to cluster documents in this feature space is presented. A solution is proposed, and an experimental evaluation is performed that demonstrates significant improvement over clustering based on standard metrics with several publicly available datasets represented in the richer feature space. 0 0
Overview of the INEX 2009 XML mining track: Clustering and classification of XML documents Nayak R.
De Vries C.M.
Kutty S.
Shlomo Geva
Ludovic Denoyer
Patrick Gallinari
Lecture Notes in Computer Science English 2010 This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2009 XML Mining track. The report also describes the approaches and results obtained by the different participants. 0 0
The sustainability of corporate wikis: A time-series analysis of activity patterns Ofer Arazy
Arie Croitoru
ACM Trans. Manage. Inf. Syst. English 2010 0 0
Clustering XML documents using frequent subtrees Kutty S.
Thanh Tran
Nayak R.
Yanyan Li
Lecture Notes in Computer Science English 2009 This paper presents an experimental study conducted over the INEX 2008 Document Mining Challenge corpus using both the structure and the content of XML documents for clustering them. The concise common substructures known as the closed frequent subtrees are generated using the structural information of the XML documents. The closed frequent subtrees are then used to extract the constrained content from the documents. A matrix containing the term distribution of the documents in the dataset is developed using the extracted constrained content. The k-way clustering algorithm is applied to the matrix to obtain the required clusters. In spite of the large number of documents in the INEX 2008 Wikipedia dataset, the proposed frequent subtree-based clustering approach was successful in clustering the documents. This approach significantly reduces the dimensionality of the terms used for clustering without much loss in accuracy. 0 0
Exploiting internal and external semantics for the clustering of short texts using world knowledge Hu X.
Sun N.
Zhang C.
Chua T.-S.
International Conference on Information and Knowledge Management, Proceedings English 2009 Clustering of short texts, such as snippets, presents great challenges in existing aggregated search techniques due to the problem of data sparseness and the complex semantics of natural language. As short texts do not provide sufficient term occurring information, traditional text representation methods, such as bag of words" model, have several limitations when directly applied to short texts tasks. In this paper, we propose a novel framework to improve the performance of short texts clustering by exploiting the internal semantics from original text and external concepts from world knowledge. The proposed method employs a hierarchical three-level structure to tackle the data sparsity problem of original short texts and reconstruct the corresponding feature space with the integration of multiple semantic knowledge bases - Wikipedia and WordNet. Empirical evaluation with Reuters and real web dataset demonstrates that our approach is able to achieve significant improvement as compared to the state-of-the-art methods. Copyright 2009 ACM. 0 0
Identifying document topics using the Wikipedia category network Peter Schönhofen Web Intelli. and Agent Sys. English 2009 In the last few years the size and coverage of Wikipedia, a community edited, freely available on-line encyclopedia has reached the point where it can be effectively used to identify topics discussed in a document, similarly to an ontology or taxonomy. In this paper we will show that even a fairly simple algorithm that exploits only the titles and categories of Wikipedia articles can characterize documents by Wikipedia categories surprisingly well. We test the reliability of our method by predicting categories of Wikipedia articles themselves based on their bodies, and also by performing classification and clustering on 20 Newsgroups and RCV1, representing documents by their Wikipedia categories instead of (or in addition to) their texts. 0 1
The life cycle of corporate wikis:An analysis of activity patterns Ofer Arazy
Arie Croitoru
Jang S.
19th Workshop on Information Technologies and Systems, WITS 2009 English 2009 Following the success of wikis on the internet (e.g. Wikipedia), corporations have begun adopting wikis. Preliminary evidence suggests that wiki is a sustainable collaboration tool and that wikis deployment is experiencing massive success. The objective of this paper is to provide a large scale evaluation of corporate wikis life cycles. We analyze and categorize the temporal activity patterns of more than thirteen thousand wikis in one multinational organization over a 29 months period. This clustering problem poses some unique challenges, and required the development of novel extensions to existing algorithms. We identified four clusters and their prototypical activity patterns. Our findings show that, contrary to what has been suggested in previous studies, most corporate wikis become inactive after a relatively short period, and less than 20% of wikis show continuous activity. Implications for research and practice are discussed. 0 0
Using Wikipedia as a Reference for Extracting Semantic Information from a Text Andrea Prato
Marco Ronchetti
SEMAPRO English 2009 0 0
Using Wikipedia as a reference for extracting semantic information from a text Andrea Prato
Marco Ronchetti
3rd International Conference on Advances in Semantic Processing - SEMAPRO 2009 English 2009 In this paper we present an algorithm that, using Wikipedia as a reference, extracts semantic information from an arbitrary text. Our algorithm refines a procedure proposed by others, which mines all the text contained in the whole Wikipedia. Our refinement, based on a clustering approach, exploits the semantic information contained in certain types of Wikipedia hyperlinks, and also introduces an analysis based on multi-words. Our algorithm outperforms current methods in that the output contains many less false positives. We were also able to understand which (structural) part of the texts provides most of the semantic information extracted by the algorithm. 0 0
Utilizing the structure and content information for XML document clustering Thanh Tran
Kutty S.
Nayak R.
Lecture Notes in Computer Science English 2009 This paper reports on the experiments and results of a clustering approach used in the INEX 2008 document mining challenge. The clustering approach utilizes both the structure and content information of the Wikipedia XML document collection. A latent semantic kernel (LSK) is used to measure the semantic similarity between XML documents based on their content features. The construction of a latent semantic kernel involves the computing of singular vector decomposition (SVD). On a large feature space matrix, the computation of SVD is very expensive in terms of time and memory requirements. Thus in this clustering approach, the dimension of the document space of a term-document matrix is reduced before performing SVD. The document space reduction is based on the common structural information of the Wikipedia XML document collection. The proposed clustering approach has shown to be effective on the Wikipedia collection in the INEX 2008 document mining challenge. 0 0
Clustering XML documents using closed frequent subtrees: A structural similarity approach Kutty S.
Thanh Tran
Nayak R.
Yanyan Li
Lecture Notes in Computer Science English 2008 This paper presents the experimental study conducted over the INEX 2007 Document Mining Challenge corpus employing a frequent subtree-based incremental clustering approach. Using the structural information of the XML documents, the closed frequent subtrees are generated. A matrix is then developed representing the closed frequent subtree distribution in documents. This matrix is used to progressively cluster the XML documents. In spite of the large number of documents in INEX 2007 Wikipedia dataset, the proposed frequent subtree-based incremental clustering approach was successful in clustering the documents. 0 0
Document clustering using incremental and pairwise approaches Thanh Tran
Nayak R.
Bruza P.
Lecture Notes in Computer Science English 2008 This paper presents the experiments and results of a clustering approach for clustering of the large Wikipedia dataset in the INEX 2007 Document Mining Challenge. The clustering approach employed makes use of an incremental clustering method and a pairwise clustering method. The approach enables us to perform the clustering task on a large dataset by first reducing the dimension of the dataset to an undefined number of clusters using the incremental method. The lower-dimension dataset is then clustered to a required number of clusters using the pairwise method. In this way, clustering of the large number of documents is performed successfully and the accuracy of the clustering solution is achieved. 0 0
Which "Apple" are you talking about ? Rahurkar M.A.
Dan Roth
Huang T.S.
Proceeding of the 17th International Conference on World Wide Web 2008, WWW'08 English 2008 In a higher level task such as clustering of web results or word sense disambiguation, knowledge of all possible distinct concepts in which an ambiguous word can be expressed would be advantageous, for instance in determining the number of clusters in case of clustering web search results. We propose an algorithm to generate such a ranked list of distinct concepts associated with an ambiguous word. Concepts which are popular in terms of usage are ranked higher. We evaluate the coverage of the concepts inferred from our algorithm on the results retrieved by querying the ambiguous word using a major search engine and show a coverage of 85% for top 30 documents averaged over all keywords. 0 0