Data processing

From WikiPapers
Jump to: navigation, search

data processing is included as keyword or extra keyword in 0 datasets, 14 tools and 32 publications.

Datasets

There is no datasets for this keyword.

Tools

Tool Operating System(s) Language(s) Programming language(s) License Description Image
DiffDB Java DiffDB are made of DiffIndexer and DiffSearcher.
Ikiwiki Cross-platform English Ikiwiki supports to store a wiki as a git repository.
Infobox2rdf Cross-platform English Perl GPL v3 infobox2rdf generates huge RDF datasets from the infobox data in Wikipedia dump files.
MediaWiki API MediaWiki API provides direct, high-level access to the data contained in MediaWiki databases.
MediaWiki Utilities Cross-platform English Python MIT license MediaWiki Utilities is a collection of utilities for working with XML data dumps generated for Wikimedia projects and other MediaWiki wikis.
Sioc MediaWiki Cross-platform English Sioc MediaWiki is a RDF exporter for MediaWiki's wikis.
Wiki Edit History Analyzer Cross-platform English Java Wiki Edit History Analyzer processes the MediaWiki revision history and produces summaries of edit actions performed. Basic edit actions include insert, delete, replace, and move; high-level edit actions include spelling correction, wikify, etc.
Wiki2XML parser Cross-platform English Python Wiki2XML parser parsers Wikipedia dump file into well-structured XML.
WikiPrep Cross-platform English Perl GPL v2 WikiPrep is a Perl script for preprocessing Wikipedia XML dumps.
Wikia-census Cross-platform Python
Jupyter Notebooks
wikia-census A script to generate a census of all the Wikia's wikis.

Census collected and analysis here: https://www.kaggle.com/abeserra/wikia-census/

Source code here: https://github.com/Grasia/wiki-scripts/tree/master/wikia_census
Wikihadoop Wikihadoop makes it possible to use MapReduce jobs using Hadoop on the compressed XML dump files.
Wikipedia Extractor Cross-platform English Python GPL v3
Wikipedia Miner
Wikipedia-map-reduce Cross-platform English Java Apache License 2.0 Wikipedia-map-reduce is a java software library that allows analysis of Wikipedia at the revision-text level.


Publications

Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
A customized Web portal for the genome of the ctenophore Mnemiopsis leidyi Moreland R.T.
Nguyen A.-D.
Ryan J.F.
Schnitzler C.E.
Koch B.J.
Siewert K.
Wolfsberg T.G.
Baxevanis A.D.
BMC Genomics English 2014 Background: Mnemiopsis leidyi is a ctenophore native to the coastal waters of the western Atlantic Ocean. A number of studies on Mnemiopsis have led to a better understanding of many key biological processes, and these studies have contributed to the emergence of Mnemiopsis as an important model for evolutionary and developmental studies. Recently, we sequenced, assembled, annotated, and performed a preliminary analysis on the 150-megabase genome of the ctenophore, Mnemiopsis. This sequencing effort has produced the first set of whole-genome sequencing data on any ctenophore species and is amongst the first wave of projects to sequence an animal genome de novo solely using next-generation sequencing technologies.Description: The Mnemiopsis Genome Project Portal (http://research.nhgri.nih.gov/mnemiopsis/) is intended both as a resource for obtaining genomic information on Mnemiopsis through an intuitive and easy-to-use interface and as a model for developing customized Web portals that enable access to genomic data. The scope of data available through this Portal goes well beyond the sequence data available through GenBank, providing key biological information not available elsewhere, such as pathway and protein domain analyses; it also features a customized genome browser for data visualization.Conclusions: We expect that the availability of these data will allow investigators to advance their own research projects aimed at understanding phylogenetic diversity and the evolution of proteins that play a fundamental role in metazoan development. The overall approach taken in the development of this Web site can serve as a viable model for disseminating data from whole-genome sequencing projects, framed in a way that best-serves the specific needs of the scientific community. © 2014 Moreland et al.; licensee BioMed Central Ltd. 0 0
On the influence propagation of web videos Liu J.
Yang Y.
Huang Z.
Shen H.T.
IEEE Transactions on Knowledge and Data Engineering English 2014 We propose a novel approach to analyze how a popular video is propagated in the cyberspace, to identify if it originated from a certain sharing-site, and to identify how it reached the current popularity in its propagation. In addition, we also estimate their influences across different websites outside the major hosting website. Web video is gaining significance due to its rich and eye-ball grabbing content. This phenomenon is evidently amplified and accelerated by the advance of Web 2.0. When a video receives some degree of popularity, it tends to appear on various websites including not only video-sharing websites but also news websites, social networks or even Wikipedia. Numerous video-sharing websites have hosted videos that reached a phenomenal level of visibility and popularity in the entire cyberspace. As a result, it is becoming more difficult to determine how the propagation took place-was the video a piece of original work that was intentionally uploaded to its major hosting site by the authors, or did the video originate from some small site then reached the sharing site after already getting a good level of popularity, or did it originate from other places in the cyberspace but the sharing site made it popular. Existing study regarding this flow of influence is lacking. Literature that discuss the problem of estimating a video's influence in the whole cyberspace also remains rare. In this article we introduce a novel framework to identify the propagation of popular videos from its major hosting site's perspective, and to estimate its influence. We define a Unified Virtual Community Space (UVCS) to model the propagation and influence of a video, and devise a novel learning method called Noise-reductive Local-and-Global Learning (NLGL) to effectively estimate a video's origin and influence. Without losing generality, we conduct experiments on annotated dataset collected from a major video sharing site to evaluate the effectiveness of the framework. Surrounding the collected videos and their ranks, some interesting discussions regarding the propagation and influence of videos as well as user behavior are also presented. 0 0
Open domain question answering using Wikipedia-based knowledge model Ryu P.-M.
Jang M.-G.
Kim H.-K.
Information Processing and Management English 2014 This paper describes the use of Wikipedia as a rich knowledge source for a question answering (QA) system. We suggest multiple answer matching modules based on different types of semi-structured knowledge sources of Wikipedia, including article content, infoboxes, article structure, category structure, and definitions. These semi-structured knowledge sources each have their unique strengths in finding answers for specific question types, such as infoboxes for factoid questions, category structure for list questions, and definitions for descriptive questions. The answers extracted from multiple modules are merged using an answer merging strategy that reflects the specialized nature of the answer matching modules. Through an experiment, our system showed promising results, with a precision of 87.1%, a recall of 52.7%, and an F-measure of 65.6%, all of which are much higher than the results of a simple text analysis based system. © 2014 Elsevier Ltd. All rights reserved. 0 0
A framework for benchmarking entity-annotation systems Cornolti M.
Paolo Ferragina
Massimiliano Ciaramita
WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web English 2013 In this paper we design and implement a benchmarking framework for fair and exhaustive comparison of entity-annotation systems. The framework is based upon the definition of a set of problems related to the entity-annotation task, a set of measures to evaluate systems performance, and a systematic comparative evaluation involving all publicly available datasets, containing texts of various types such as news, tweets and Web pages. Our framework is easily-extensible with novel entity annotators, datasets and evaluation measures for comparing systems, and it has been released to the public as open source1. We use this framework to perform the first extensive comparison among all available entity annotators over all available datasets, and draw many interesting conclusions upon their efficiency and effectiveness. We also draw conclusions between academic versus commercial annotators. Copyright is held by the International World Wide Web Conference Committee (IW3C2). 0 0
Automatic keyphrase annotation of scientific documents using Wikipedia and genetic algorithms Joorabchi A.
Mahdi A.E.
Journal of Information Science English 2013 Topical annotation of documents with keyphrases is a proven method for revealing the subject of scientific and research documents to both human readers and information retrieval systems. This article describes a machine learning-based keyphrase annotation method for scientific documents that utilizes Wikipedia as a thesaurus for candidate selection from documents' content. We have devised a set of 20 statistical, positional and semantical features for candidate phrases to capture and reflect various properties of those candidates that have the highest keyphraseness probability. We first introduce a simple unsupervised method for ranking and filtering the most probable keyphrases, and then evolve it into a novel supervised method using genetic algorithms. We have evaluated the performance of both methods on a third-party dataset of research papers. Reported experimental results show that the performance of our proposed methods, measured in terms of consistency with human annotators, is on a par with that achieved by humans and outperforms rival supervised and unsupervised methods. 0 0
Computing semantic relatedness using Wikipedia features Hadj Taieb M.A.
Ben Aouicha M.
Ben Hamadou A.
Knowledge-Based Systems English 2013 Measuring semantic relatedness is a critical task in many domains such as psychology, biology, linguistics, cognitive science and artificial intelligence. In this paper, we propose a novel system for computing semantic relatedness between words. Recent approaches have exploited Wikipedia as a huge semantic resource that showed good performances. Therefore, we utilized the Wikipedia features (articles, categories, Wikipedia category graph and redirection) in a system combining this Wikipedia semantic information in its different components. The approach is preceded by a pre-processing step to provide for each category pertaining to the Wikipedia category graph a semantic description vector including the weights of stems extracted from articles assigned to the target category. Next, for each candidate word, we collect its categories set using an algorithm for categories extraction from the Wikipedia category graph. Then, we compute the semantic relatedness degree using existing vector similarity metrics (Dice, Overlap and Cosine) and a new proposed metric that performed well as cosine formula. The basic system is followed by a set of modules in order to exploit Wikipedia features to quantify better as possible the semantic relatedness between words. We evaluate our measure based on two tasks: comparison with human judgments using five datasets and a specific application "solving choice problem". Our result system shows a good performance and outperforms sometimes ESA (Explicit Semantic Analysis) and TSA (Temporal Semantic Analysis) approaches. © 2013 Elsevier B.V. All rights reserved. 0 0
An efficient voice enabled web content retrieval system for limited vocabulary Bharath Ram G.R.
Jayakumaur R.
Narayan R.
Shahina A.
Khan A.N.
Communications in Computer and Information Science English 2012 Retrieval of relevant information is becoming increasingly difficult owing to the presence of an ocean of information in the World Wide Web. Users in need of quick access to specific information are sub-jected to a series of web re-directions before finally arriving at the page that contains the required information. In this paper, an optimal voice based web content retrieval system is proposed that makes use of an open source speech recognition engine to deal with voice inputs. The proposed system performs a quicker retrieval of relevant content from Wikipedia and instantly presents the textual information along with the related image to the user. This search is faster than the conventional web content retrieval technique. The current system is built with limited vocabulary but can be extended to support a larger vocabulary. Additionally, the system is also scalable to retrieve content from few other sources of information apart from Wikipedia. 0 0
Analysis and enhancement of wikification for microblogs with context expansion Cassidy T.
Ji H.
Lev Ratinov
Zubiaga A.
Houkuan Huang
24th International Conference on Computational Linguistics - Proceedings of COLING 2012: Technical Papers English 2012 Disambiguation to Wikipedia (D2W) is the task of linking mentions of concepts in text to their corresponding Wikipedia entries. Most previous work has focused on linking terms in formal texts (e.g. newswire) to Wikipedia. Linking terms in short informal texts (e.g. tweets) is difficult for systems and humans alike as they lack a rich disambiguation context. We first evaluate an existing Twitter dataset as well as the D2W task in general. We then test the effects of two tweet context expansion methods, based on tweet authorship and topic-based clustering, on a state-of-the-art D2W system and evaluate the results. 0 0
Analysis on construction of information commons of Wiki-based Olympic Library Ma Q. Proceedings - 2012 International Conference on Computer Science and Information Processing, CSIP 2012 English 2012 As one of the WEB2.0 technologies, wiki technology emerged in the early 21st century after Beijing Olympics. This study explores how to put such technology into application by effectively using the Olympic legacy of Beijing Olympic Games - Olympic Library. It uses literature methods, and combines with practical work experiences. Firstly, it introduces the overview of the Library of Capital Institute of Physical Education in the Olympic Library Project, and the development of Olympic Library in post-Olympic period; then, it collates the basic concepts of information commons (IC) within the industry, including the concept of IC, composing elements and the IC construction profiles in national libraries; finally, based on the existing conditions and the Olympic libraries advantages and combined with the rapid development of the digital environment, it discusses the application of wiki technology, the principles and ideas to achieve the innovative development of Olympic Library through the construction of information commons. 0 0
Annotating words using wordnet semantic glosses Szymanski J.
Duch W.
Lecture Notes in Computer Science English 2012 An approach to the word sense disambiguation (WSD) relaying on the WordNet synsets is proposed. The method uses semantically tagged glosses to perform a process similar to the spreading activation in semantic network, creating ranking of the most probable meanings for word annotation. Preliminary evaluation shows quite promising results. Comparison with the state-of-the-art WSD methods indicates that the use of WordNet relations and semantically tagged glosses should enhance accuracy of word disambiguation methods. 0 0
Entity matching for semistructured data in the Cloud Paradies M.
Malaika S.
Simeon J.
Khatchadourian S.
Sattler K.-U.
Proceedings of the ACM Symposium on Applied Computing English 2012 The rapid expansion of available information, on the Web or inside companies, is increasing. With Cloud infrastructure maturing (including tools for parallel data processing, text analytics, clustering, etc.), there is more interest in integrating data to produce higher-value content. New challenges, notably include entity matching over large volumes of heterogeneous data. In this paper, we describe an approach for entity matching over large amounts of semistructured data in the Cloud. The approach combines ChuQL[4], a recently proposed extension of XQuery with MapReduce, and a blocking technique for entity matching which can be efficiently executed on top of MapReduce. We illustrate the proposed approach by applying it to extract automatically and enrich references in Wikipedia and report on an experimental evaluation of the approach. 0 0
Exploiting Wikipedia for cross-lingual and multilingual information retrieval Sorg P.
Philipp Cimiano
Data and Knowledge Engineering English 2012 In this article we show how Wikipedia as a multilingual knowledge resource can be exploited for Cross-Language and Multilingual Information Retrieval (CLIR/MLIR). We describe an approach we call Cross-Language Explicit Semantic Analysis (CL-ESA) which indexes documents with respect to explicit interlingual concepts. These concepts are considered as interlingual and universal and in our case correspond either to Wikipedia articles or categories. Each concept is associated to a text signature in each language which can be used to estimate language-specific term distributions for each concept. This knowledge can then be used to calculate the strength of association between a term and a concept which is used to map documents into the concept space. With CL-ESA we are thus moving from a Bag-Of-Words model to a Bag-Of-Concepts model that allows language-independent document representations in the vector space spanned by interlingual and universal concepts. We show how different vector-based retrieval models and term weighting strategies can be used in conjunction with CL-ESA and experimentally analyze the performance of the different choices. We evaluate the approach on a mate retrieval task on two datasets: JRC-Acquis and Multext. We show that in the MLIR settings, CL-ESA benefits from a certain level of abstraction in the sense that using categories instead of articles as in the original ESA model delivers better results. © 2012 Elsevier B.V. All rights reserved. 0 0
Horizontal search method for Wikipedia category grouping Myunggwon Hwang
Song S.K.
Kim D.J.
Hanmin Jung
Jeong D.H.
Ko H.
Proceedings - 2012 IEEE Int. Conf. on Green Computing and Communications, GreenCom 2012, Conf. on Internet of Things, iThings 2012 and Conf. on Cyber, Physical and Social Computing, CPSCom 2012 English 2012 Category hierarchies, which show the basic relationship between concepts, are utilized as fundamental clues for semantic information processing in diverse research fields. These research works have employed Wikipedia due to its high coverage of real-world concepts and data reliability. Wikipedia also constructs a category hierarchy, and defines various categories according to the common characteristics of a concept. However, some limitations have been uncovered in the use of a vertical search (especially top-down) to form a set of domain categories. In order to overcome these limitations, this paper proposes a horizontal search method, and uses Wikipedia components to measure the similarity between categories. In an experimental evaluation, we confirm that our method shows a wide coverage and high precision for similar (domain) category grouping. 0 0
MOTIF-RE: Motif-based hypernym/hyponym relation extraction from wikipedia links Wei B.
Liu J.
Jun Ma
Zheng Q.
Weinan Zhang
Feng B.
Lecture Notes in Computer Science English 2012 Hypernym/hyponym relation extraction plays an essential role in taxonomy learning. The conventional methods based on lexico-syntactic patterns or machine learning usually make use of content-related features. In this paper, we find that the proportions of hyperlinks with different semantic type vary markedly in different network motifs. Based on this observation, we propose MOTIF-RE, an algorithm of extracting hypernym/hyponym relation from Wikipedia hyperlinks. The extraction process consists of three steps: 1) Build a directed graph from a set of domain-specific Wikipedia articles. 2) Count the occurrences of hyperlinks in every three-node network motif and create a feature vector for every hyperlink. 3) Train a classifier to identify semantic relation of hyperlinks. We created three domain-specific Wikipedia article sets to test MOTIF-RE. Experiments on individual dataset show that MOTIF-RE outperforms the baseline algorithm by about 30% in terms of F1-measure. Cross-domain experimental results show similar, which proves that MOTIF-RE has fairly good domain adaptation ability. 0 0
Mining a multilingual association dictionary from Wikipedia for cross-language information retrieval Ye Z.
Huang J.X.
He B.
Hong Lin
Journal of the American Society for Information Science and Technology English 2012 Wikipedia is characterized by its dense link structure and a large number of articles in different languages, which make it a notable Web corpus for knowledge extraction and mining, in particular for mining the multilingual associations. In this paper, motivated by a psychological theory of word meaning, we propose a graph-based approach to constructing a cross-language association dictionary (CLAD) from Wikipedia, which can be used in a variety of cross-language accessing and processing applications. In order to evaluate the quality of the mined CLAD, and to demonstrate how the mined CLAD can be used in practice, we explore two different applications of the mined CLAD to cross-language information retrieval (CLIR). First, we use the mined CLAD to conduct cross-language query expansion; and, second, we use it to filter out translation candidates with low translation probabilities. Experimental results on a variety of standard CLIR test collections show that the CLIR retrieval performance can be substantially improved with the above two applications of CLAD, which indicates that the mined CLAD is of sound quality. 0 0
Research on the construction of open education resources based on semantic wiki Mu S.
Xiaodan Zhang
Zuo P.
Lecture Notes in Computer Science English 2012 Since the MIT's OpenCourseWare project in 2001, open education resources movement has gone through more than ten years. Except for the fruitful results, some problems of resource construction are also exposed. Part of open education resources projects cannot be carried out or even were forced to drop out for a shortage of personnel or funds. A lack of uniform norms or standards leads to the duplication of resource construction and low resource utilization. Semantic media Wiki combines the openness, self-organization and collaboration of Wiki with the structured knowledge in the Semantic Web, which meets the needs of resource co-construction and sharing in open education resources movement. In this study, based on the online course Education Information Processing, we explore the Semantic MediaWiki's application in the open education resources construction. 0 0
Self organizing maps for visualization of categories Szymanski J.
Duch W.
Lecture Notes in Computer Science English 2012 Visualization of Wikipedia categories using Self Organizing Maps shows an overview of categories and their relations, helping to narrow down search domains. Selecting particular neurons this approach enables retrieval of conceptually similar categories. Evaluation of neural activations indicates that they form coherent patterns that may be useful for building user interfaces for navigation over category structures. 0 0
Categorization of wikipedia articles with spectral clustering Szymanski J. Lecture Notes in Computer Science English 2011 The article reports application of clustering algorithms for creating hierarchical groups within Wikipedia articles. We evaluate three spectral clustering algorithms based on datasets constructed with usage of Wikipedia categories. Selected algorithm has been implemented in the system that categorize Wikipedia search results in the fly. 0 0
Finding patterns in behavioral observations by automatically labeling forms of wikiwork in Barnstars David W. McDonald
Sara Javanmardi
Mark Zachry
WikiSym 2011 Conference Proceedings - 7th Annual International Symposium on Wikis and Open Collaboration English 2011 Our everyday observations about the behaviors of others around us shape how we decide to act or interact. In social media the ability to observe and interpret others' behavior is limited. This work describes one approach to leverage everyday behavioral observations to develop tools that could improve understanding and sense making capabilities of contributors, managers and researchers of social media systems. One example of behavioral observation is Wikipedia Barnstars. Barnstars are a type of award recognizing the activities of Wikipedia editors. We mine the entire English Wikipedia to extract barnstar observations. We develop a multi-label classifier based on a random forest technique to recognize and label distinct forms of observed and acknowledged activity. We evaluate the classifier through several means including use of separate training and testing datasets and the by application of the classifier to previously unlabeled data. We use the classifier to identify Wikipedia editors who have been observed with some predominant types of behavior and explore whether those patterns of behavior are evident and how observers seem to be making the observations. We discuss how these types of activity observations can be used to develop tools and potentially improve understanding and analysis in wikis and other online communities. 0 1
Measuring similarities between technical terms based on Wikipedia Myunggwon Hwang
Jeong D.-H.
Seungwoo Lee
Hanmin Jung
Proceedings - 2011 IEEE International Conferences on Internet of Things and Cyber, Physical and Social Computing, iThings/CPSCom 2011 English 2011 Measuring similarities between terms is useful for semantic information processing such as query expansion and WSD (Word Sense Disambiguation). This study aims at identifying technologies closely related to emerging technologies. Thus, we propose a hybrid method using both category and internal link information in Wikipedia, which is the largest database that everyone can share and edit its contents. Comparative experimental results with a state-of-theart WLM (Wikipedia Link-based Measure) show that this proposed method works better than each single method. 0 0
Mining fuzzy domain ontology based on concept vector from Wikipedia Category Network Lu C.-Y.
Ho S.-W.
Chung J.-M.
Hsu F.-Y.
Lee H.-M.
Ho J.-M.
Proceedings - 2011 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT 2011 English 2011 Ontology is essential in the formalization of domain knowledge for effective human-computer interactions (i.e., expert-finding). Many researchers have proposed approaches to measure the similarity between concepts by accessing fuzzy domain ontology. However, engineering of the construction of domain ontologies turns out to be labor intensive and tedious. In this paper, we propose an approach to mine domain concepts from Wikipedia Category Network, and to generate the fuzzy relation based on a concept vector extraction method to measure the relatedness between a single term and a concept. Our methodology can conceptualize domain knowledge by mining Wikipedia Category Network. An empirical experiment is conducted to evaluate the robustness by using TREC dataset. Experiment results show the constructed fuzzy domain ontology derived by proposed approach can discover robust fuzzy domain ontology with satisfactory accuracy in information retrieval tasks. 0 0
Self-organizing map representation for clustering Wikipedia search results Szymanski J. Lecture Notes in Computer Science English 2011 The article presents an approach to automated organization of textual data. The experiments have been performed on selected sub-set of Wikipedia. The Vector Space Model representation based on terms has been used to build groups of similar articles extracted from Kohonen Self-Organizing Maps with DBSCAN clustering. To warrant efficiency of the data processing, we performed linear dimensionality reduction of raw data using Principal Component Analysis. We introduce hierarchical organization of the categorized articles changing the granularity of SOM network. The categorization method has been used in implementation of the system that clusters results of keyword-based search in Polish Wikipedia. 0 0
Semantic relation extraction for automatically building domain ontology using a link grammar Choi J.
Choi C.
Choi D.
Koh J.
Kim P.
Proceedings of the 2011 ACM Research in Applied Computation Symposium, RACS 2011 English 2011 Ontology which is the fundamental for semantic web is getting important while the semantic information processing is developed. To build and extend ontologies, much of research has been proposed such as semi-automatic and full-automatic methods through analyzing raw text documents. However, the methods based on document set in specific domain have a limitation that the ontology depends on the domain. Therefore, this research has used Wikipedia document set for extension of generalized ontology. The Wikipedia contains unrestricted subjects and one document is filled with one subject in detail. Moreover, since the content is written by domain specialist, we can say that Wikipedia provides trustworthy contents. This research which deals with ontology extension, it extracts important sentences from Wikipedia documents and the sentences are structured by Link Grammar. Finally, the ontology extension is accomplished through conceptualization step which grasps subjects, objects, and predicates from the structured contents. In the performance evaluation which compared our method to the other method using Context-Free Grammar showed better accuracy more than 12% points. 0 0
Using Mahout for clustering Wikipedia's latest articles: A comparison between k-means and fuzzy c-means in the cloud Rong C.
Esteves R.M.
Proceedings - 2011 3rd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2011 English 2011 this paper compares k-means and fuzzy c-means for clustering a noisy realistic and big dataset. We made the comparison using a free cloud computing solution Apache Mahout/ Hadoop and Wikipedia's latest articles. In the past the usage of these two algorithms was restricted to small datasets. As so, studies were based on artificial datasets that do not represent a real document clustering situation. With this ongoing research we found that in a noisy dataset, fuzzy c-means can lead to worse cluster quality than k-means. The convergence speed of k-means is not always faster. We found as well that Mahout is a promise clustering technology but the preprocessing tools are not developed enough for an efficient dimensionality reduction. From our experience the use of the Apache Mahout is premature. 1 0
Wiki-induced cognitive elaboration in project teams: An empirical study YanChun Zhang
Fang Y.
He W.
International Conference on Information Systems 2011, ICIS 2011 English 2011 Researchers have exerted increasing efforts to understand how wikis can be used to improve team performance. Previous studies have mainly focused on the effect of the quantity of wiki use on performance in wiki-based communities; however, only inconclusive results have been obtained. Our study focuses on the quality of wiki use in a team context. We develop a construct of wiki-induced cognitive elaboration, and explore its nomological network in the team context. Integrating the literatures on wiki and distributed cognition, we propose that wiki-induced cognitive elaboration influences team performance through knowledge integration among team members. We also identify its team-based antecedents, including task involvement, critical norm, task reflexivity, time pressure and process accountability, by drawing on the motivated information processing literature. The research model is empirically tested using multiple-source survey data collected from 46 wiki-based student project teams. The theoretical and practical implications of our findings are also discussed. 0 0
Analysis of structural relationships for hierarchical cluster labeling Muhr M.
Roman Kern
Michael Granitzer
SIGIR 2010 Proceedings - 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval English 2010 Cluster label quality is crucial for browsing topic hierarchies obtained via document clustering. Intuitively, the hierarchical structure should influence the labeling accuracy. However, most labeling algorithms ignore such structural properties and therefore, the impact of hierarchical structures on the labeling accuracy is yet unclear. In our work we integrate hierarchical information, i.e. sibling and parent-child relations, in the cluster labeling process. We adapt standard labeling approaches, namely Maximum Term Frequency, Jensen-Shannon Divergence, χ 2 Test, and Information Gain, to take use of those relationships and evaluate their impact on 4 different datasets, namely the Open Directory Project, Wikipedia, TREC Ohsumed and the CLEF IP European Patent dataset. We show, that hierarchical relationships can be exploited to increase labeling accuracy especially on high-level nodes. 0 0
Retrieving landmark and non-landmark images from community photo collections Yannis Avrithis
Yannis Kalantidis
Giorgos Tolias
Evaggelos Spyrou
MM'10 - Proceedings of the ACM Multimedia 2010 International Conference English 2010 State of the art data mining and image retrieval in community photo collections typically focus on popular subsets, e.g. images containing landmarks or associated to Wikipedia articles. We propose an image clustering scheme that, seen as vector quantization compresses a large corpus of images by grouping visually consistent ones while providing a guaranteed distortion bound. This allows us, for instance, to represent the visual content of all thousands of images depicting the Parthenon in just a few dozens of scene maps and still be able to retrieve any single, isolated, non-landmark image like a house or graffiti on a wall. Starting from a geo-tagged dataset, we first group images geographically and then visually, where each visual cluster is assumed to depict different views of the the same scene. We align all views to one reference image and construct a 2D scene map by preserving details from all images while discarding repeating visual features. Our indexing, retrieval and spatial matching scheme then operates directly on scene maps. We evaluate the precision of the proposed method on a challenging one-million urban image dataset. 0 0
A web recommender system based on dynamic sampling of user information access behaviors Jilin Chen
Shtykh R.Y.
Jin Q.
Proceedings - IEEE 9th International Conference on Computer and Information Technology, CIT 2009 English 2009 In this study, we propose a Gradual Adaption Model for a Web recommender system. This model is used to track users' focus of interests and its transition by analyzing their information access behaviors, and recommend appropriate information. A set of concept classes are extracted from Wikipedia. The pages accessed by users are classified by the concept classes, and grouped into three terms of short, medium and long periods, and two categories of remarkable and exceptional for each concept class, which are used to describe users' focus of interests, and to establish reuse probability of each concept class in each term for each user by Full Bayesian Estimation as well. According to the reuse probability and period, the information that a user is likely to be interested in is recommended. In this paper, we propose a new approach by which short and medium periods are determined based on dynamic sampling of user information access behaviors. We further present experimental simulation results, and show the validity and effectiveness of the proposed system. 0 0
Measuring Wikipedia: A hands-on tutorial Luca de Alfaro
Felipe Ortega
WikiSym English 2009 This tutorial is an introduction to the best methodologies, tools and practices for Wikipedia research. The tutorial will be led by Luca de Alfaro (Wiki Lab at UCSC, California, USA) and Felipe Ortega (Libresoft, URJC, Madrid, Spain). Both cumulate several years of practical experience exploring and processing Wikipedia data [1], [2], [3]. As well, their respective research groups have led the development of two cutting-edge software tools (WikiTrust and WikiXRay), for analyzing Wikipedia. WikiTrust implements an author reputation system, and a text trust system, for wikis. WikiXRay is a tool automating the quantitative analysis of any language version of Wikipedia (in general, any wiki based on MediaWiki). Copyright 0 0
Ontology enhanced web image retrieval: Aided by wikipedia & spreading activation theory Haofen Wang
Xing Jiang
Chia L.-T.
Tan A.-H.
Proceedings of the 1st International ACM Conference on Multimedia Information Retrieval, MIR2008, Co-located with the 2008 ACM International Conference on Multimedia, MM'08 English 2008 Ontology, as an efective approach to bridge the semantic gap in various domains, has attracted a lot of interests from multimedia researchers. Among the numerous possibilities enabled by ontology, we are particularly interested in ex- ploiting ontology for a better understanding of media task (particularly, images) on the World Wide Web. To achieve our goal, two open issues are inevitably in- volved: 1) How to avoid the tedious manual work for ontol- ogy construction? 2) What are the effective inference models when using an ontology? Recent works[11, 16] about ontol- ogy learned from Wikipedia has been reported in conferences targeting the areas of knowledge management and artificial intelligent. There are also reports of different inference mod- els being investigated[5, 13, 15]. However, so far there has not been any comprehensive solution. In this paper, we look at these challenges and attempt to provide a general solution to both questions. Through a careful analysis of the online encyclopedia Wikipedia's cate- gorization and page content, we choose it as our knowledge source and propose an automatic ontology construction ap- proach. We prove that it is a viable way to build ontology under various domains. To address the inference model is- sue, we provide a novel understanding of the ontology and consider it as a type of semantic network, which is similar to brain models in the cognitive research field. Spreading Activation Techniques, which have been proved to be a cor- rect information processing model in the semantic network, are consequently introduced for inference. We have imple- mented a prototype system with the developed solutions for web image retrieval. By comprehensive experiments on the canine category of the animal kingdom, we show that this is a scalable architecture for our proposed methods. Copyright 2008 ACM. 0 0
Sub-symbolic mapping of cyc microtheories in data-driven "conceptual" spaces Pilato G.
Augello A.
Scriminaci M.
Vassallo G.
Gaglio S.
Lecture Notes in Computer Science English 2007 The presented work aims to combine statistical and cognitive-oriented approaches with symbolic ones so that a conceptual similarity relationship layer can be added to a Cyc KB microtheory. Given a specific microtheory, a LSA-inspired conceptual space is inferred from a corpus of texts created using both ad hoc extracted pages from the Wikipedia repository and the built-in comments about the concepts of the specific Cyc microtheory. Each concept is projected in the conceptual space and the desired layer of subsymbolic relationships between concepts is created. This procedure can help a user in finding the concepts that are "sub-symbolically conceptually related" to a new concept that he wants to insert in the microtheory. Experimental results involving two Cyc microtheories are also reported. 0 0
Proceedings of WikiSym'06 - 2006 International Symposium on Wikis No author name available Proceedings of WikiSym'06 - 2006 International Symposium on Wikis English 2006 The proceedings contain 26 papers. The topics discussed include: how and why wikipedia works; how and why wikipedia works: an interview with Angela Beesley, Elisabeth Bauer, and Kizu Naoko; intimate information: organic hypertext structure and incremental; the augmented wiki; wiki uses in teaching and learning; the future of wikis; translation the wiki way; the radeox wiki render engine; is there a space for the teacher in a WIKI?; wikitrails: augmenting wiki structure for collaborative, interdisciplinary learning; towards wikis as semantic hypermedia; constrained wiki: an oxymoron?; corporate wiki users: results of a survey; workshop on wikipedia research; wiki markup standard workshop; wiki-based knowledge engineering: second workshop on semantic wikis; semantic wikipedia;and ontowiki: community-driven ontology engineering and ontology usage based on wikis. 0 0