Wikipedia

From WikiPapers
Jump to: navigation, search

wikipedia is included as keyword or extra keyword in 2 datasets, 2 tools and 2777 publications.

Datasets

Dataset Size Language Description
EPIC/Oxford Wikipedia quality assessment English EPIC/Oxford Wikipedia quality assessment This dataset comprises the full, anonymized set of responses from the blind assessment of a sample of Wikipedia articles across languages and disciplines by academic experts. The study was conducted in 2012 by EPIC and the University of Oxford and sponsored by the Wikimedia Foundation.
Wikipedia search data Wikipedia search data are logs about search queries by visitors.

Tools

Tool Operating System(s) Language(s) Programming language(s) License Description Image
Wikipedia Recent Changes Map Web English JavaScript Wikipedia Recent Changes Map is a web tool that displays a world map showing anonymous edits to Wikipedia, geolocated by IP.
WikipediaVision Web English WikipediaVision is a web-based tool that shows anonymous edits to Wikipedia (almost) in real-time.


Publications

Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Wikipédia et bibliothèques : agir en commun Sylvain Machefert Communs du savoir et bibliothèques French 2017 Synthèse des actions possibles pour les bibliothèques autour des projets Wikimedia. 0 0
Similar Gaps, Different Origins? Women Readers and Editors at Greek Wikipedia Ioannis Protonotarios
Vasiliki Sarimpei
Jahna Otterbacher
Tenth International AAAI Conference on Web and Social Media English 17 May 2016 As a global, multilingual project, Wikipedia could serve as a repository for the world’s knowledge on an astounding range of topics. However, questions of participation and diversity among editors continue to be burning issues. We present the first targeted study of participants at Greek Wikipedia, with the goal of better understanding their motivations. Smaller Wikipedias play a key role in fostering the project’s global character, but typically receive little attention from researchers. We developed two survey instruments, administered in Greek, based on the 2011 Wikipedia Readership and Editors Surveys. Consistent with previous studies, we found a gender gap, with women making up only 38% and 15% of readers and editors, respectively, and with men editors being much more active. Our data suggest two salient explanations: 1) women readers more often lack confidence with respect to their knowledge and technical skills as compared to men, and 2) women’s behaviors may be driven by personal motivations such as enjoyment and learning, rather than by “leaving their mark” on the community, a concern more common among men. Interestingly, while similar proportions of men and women readers use multiple language editions, more women contribute to English Wikipedia in addition to the Greek language community. Future research should consider how this impacts their participation at Greek Wikipedia 11 0
Generating Article Placeholders from Wikidata for Wikipedia Lucie-Aimée Kaffee International Media and Computing, HTW Berlin 4 March 2016 The major objective of this thesis is to increase the access to open and free

knowledge in Wikipedia by developing a MediaWiki extension called Arti- clePlaceholder. ArticlePlaceholders are content pages in Wikipedia auto-generated from information provided by Wikidata. The criterias for the extension were developed with a requirement analysis and subsequently implemented. This thesis gives an introduction on the fundamentals of the project and includes the personas, scenarios, user-stories, non-functional and functional requirements for the requirement analysis. The analysis was done in order to implement the features needed to achieve the goal of providing more information for under-resourced languages. The implementation of these requirements is the main part of the following

thesis.
0 0
Competencias informacionales básicas y uso de Wikipedia en entornos educativos Jesús Tramullas Gestión de la Innovación en Educación Superior/Journal of Innovation Management in Higher Education Spanish 2016 This paper reviews the relationship of Wikipedia with educational processes, thereby adopting an approach based on the approximation that offers information literacy approach. To this end it is necessary to: a) review the common criticism of Wikipedia; b) review the basic concepts of information literacy; c) propose a generic integration framework of Wikipedia in teaching and learning. To do this, aims to establish a blueprint for the design, implementation and development of processes and actions of information literacy, integrated into the core educational process, using Wikipedia. Finally, suggests the use of information literacy as a continuous element within the overall process of teaching and learning. 0 0
Postures d’opposition à Wikipédia en milieu intellectuel en France Alexandre Moatti Wikipédia, objet scientifique non identifié French 23 November 2015 6 0
Change in access after digitization: Ethnographic collections in Wikipedia Trilce Navarrete
Karol J. Borowiecki
ACEI Working Paper Series English October 2015 The raison d’être of memory institutions revolves around collecting, preserving and giving access to heritage collections. Increasingly, access takes place in social networked markets characterized by communities of users that serve to select and rank content to facilitate reuse. Publication of heritage in such digital medium transforms patterns of consumption. We performed a quantitative analysis on the access to a museum collection and compared results before and after publication on Wikimedia. Analysis of the difference in access showed two main results: first, access to collections increased substantially online. From a selection of the most viewed objects, access grew from an average of 156,000 onsite visitors per year (or 15.5 million in a century) to over 1.5 million views online per year (or 7.9 million in five years). Second, we find a long tail in both mediums, where 8% of objects were exhibited onsite and 11% of available objects online were used in Wikipedia articles (representing 1% of the total collection). We further document differences in consumer preference for type of object, favouring 3D onsite and 2D online, as well as topic and language preference, favouring Wikipedia articles about geography and in English. Online publication is hence an important complement to onsite exhibitions to increase access to collections. Results shed light on online consumption of heritage content by consumers who may not necessarily visit heritage sites. 0 0
A Platform for Visually Exploring the Development of Wikipedia Articles Erik Borra
David Laniado
Esther Weltevrede
Michele Mauri
Giovanni Magni
Tommaso Venturini
Paolo Ciuccarelli
Richard Rogers
Andreas Kaltenbrunner
ICWSM '15 - 9th International AAAI Conference on Web and Social Media English May 2015 When looking for information on Wikipedia, Internet users generally just read the latest version of an article. However, in its back-end there is much more: associated to each article are the edit history and talk pages, which together entail its full evolution. These spaces can typically reach thousands of contributions, and it is not trivial to make sense of them by manual inspection. This issue also affects Wikipedians, especially the less experienced ones, and constitutes a barrier for new editor engagement and retention. To address these limitations, Contropedia offers its users unprecedented access to the development of an article, using wiki links as focal points. 0 0
Societal Controversies in Wikipedia Articles Erik Borra
Esther Weltevrede
Paolo Ciuccarelli
Andreas Kaltenbrunner
David Laniado
Giovanni Magni
Michele Mauri
Richard Rogers
CHI '15 - Proceedings of the 33rd annual ACM conference on Human factors in computing systems English April 2015 Collaborative content creation inevitably reaches situations where different points of view lead to conflict. We focus on Wikipedia, the free encyclopedia anyone may edit, where disputes about content in controversial articles often reflect larger societal debates. While Wikipedia has a public edit history and discussion section for every article, the substance of these sections is difficult to phantom for Wikipedia users interested in the development of an article and in locating which topics were most controversial. In this paper we present Contropedia, a tool that augments Wikipedia articles and gives insight into the development of controversial topics. Contropedia uses an efficient language agnostic measure based on the edit history that focuses on wiki links to easily identify which topics within a Wikipedia article have been most controversial and when. 0 0
Factors That Influence the Quality of Crowdsourcing Al Sohibani M.
Al Osaimi N.
Al Ehaidib R.
Al Muhanna S.
Dahanayake A.
Advances in Intelligent Systems and Computing 2015 Crowdsourcing is a technique that aims to obtain data, ideas, and funds, conduct tasks, or even solve problems with the aid of a group of people. It's a useful technique to save money and time. The quality of data is an issue that confronts crowdsourcing websites; as the data is obtained from the crowd, and how they control the quality of data. In some of the crowdsourcing websites they have implemented mechanisms in order to manage the data quality; such as, rating, reporting, or using specific tools. In this paper, five crowdsourcing websites: Wikipedia, Amazon Mechanical Turk, YouTube, Rally Fighter, and Kickstarter are studied as cases in order to identify the possible quality assurance methods or techniques that are useful to represent crowdsourcing data. A survey is conducted to gather general opinion about the range of reliability of crowdsourcing sites, their passion and contribution to improve the contents of these sites. Combining those to the available knowledge in the crowdsourcing research, the paper highlights the factors that influence the data quality in crowdsourcing. © Springer International Publishing Switzerland 2015. 0 0
Wikipedia como objeto de investigación Jesús Tramullas Anuario ThinkEPI Spanish 2015 This short paper analyzes Wikipedia as an object of scientific research, contrasting various studies dealing with that popular encyclopedia. The conclusion is that Wikipedia, as a manifestation of collaborative production and consumption of knowledge, is a valid subject of scientific research. 0 0
Motivations for Contributing to Health-Related Articles on Wikipedia: An Interview Study Farič N
Potts HWW
Journal of Medical Internet Research English 3 December 2014 Background: Wikipedia is one of the most accessed sources of health information online. The current English-language Wikipedia contains more than 28,000 articles pertaining to health.

Objective: The aim was to characterize individuals’ motivations for contributing to health content on the English-language Wikipedia.

Methods: A set of health-related articles were randomly selected and recent contributors invited to complete an online questionnaire and follow-up interview (by Skype, by email, or face-to-face). Interviews were transcribed and analyzed using thematic analysis and a realist grounded theory approach.

Results: A total of 32 Wikipedians (31 men) completed the questionnaire and 17 were interviewed. Those completing the questionnaire had a mean age of 39 (range 12-59) years; 16 had a postgraduate qualification, 10 had or were currently studying for an undergraduate qualification, 3 had no more than secondary education, and 3 were still in secondary education. In all, 15 were currently working in a health-related field (primarily clinicians). The median period for which they have been an active editing Wikipedia was 3-5 years. Of this group, 12 were in the United States, 6 were in the United Kingdom, 4 were in Canada, and the remainder from another 8 countries. Two-thirds spoke more than 1 language and 90% (29/32) were also active contributors in domains other than health. Wikipedians in this study were identified as health professionals, professionals with specific health interests, students, and individuals with health problems. Based on the interviews, their motivations for editing health-related content were summarized in 5 strongly interrelated categories: education (learning about subjects by editing articles), help (wanting to improve and maintain Wikipedia), responsibility (responsibility, often a professional responsibility, to provide good quality health information to readers), fulfillment (editing Wikipedia as a fun, relaxing, engaging, and rewarding activity), and positive attitude to Wikipedia (belief in the value of Wikipedia). An additional factor, hostility (from other contributors), was identified that negatively affected Wikipedians’ motivations.

Conclusions: Contributions to Wikipedia’s health-related content in this study were made by both health specialists and laypeople of varying editorial skills. Their motivations for contributing stem from an inherent drive based on values, standards, and beliefs. It became apparent that the community who most actively monitor and edit health-related articles is very small. Although some contributors correspond to a model of “knowledge philanthropists,” others were focused on maintaining articles (improving spelling and grammar, organization, and handling vandalism). There is a need for more people to be involved in Wikipedia’s health-related content.
0 0
Wikipédia et bibliothèques. Une production commune des savoirs? Rémi Mathis Bibliothèque(s) French 24 October 2014 Référence consultée spontanément par des centaines de millions d’internautes, Wikipédia s’estime plus proche d’une bibliothèque que d’une encyclopédie. Un rapprochement qui s’explique par un positionnement partagé dans le champ des biens communs. 0 1
Fidarsi di Wikipedia Simone Dezaiacomo Italian 15 July 2014 Lo scopo dello studio è comprendere i fenomeni alla base della fiducia degli utenti verso l'enciclopedia online Wikipedia. Per farlo è necessario prima di tutto comprendere e modellizzare l'organizzazione della struttura dei processi socio-produttivi sottostanti alla produzione del contenuto di Wikipedia, procedendo quindi nel verificare empiricamente e descrivere le capacità di autocorrezione della stessa. Oltre a quelli utilizzati in questo studio, saranno anche descritti gli approcci e i risultati trattati in letteratura, riportando i principali studi che nel corso degli anni hanno affrontato questi argomenti, sebbene mantenendoli indipendenti.

Per comprendere la struttura della community degli editor di Wikipedia, si è ipotizzata l'esistenza di un modello di tipo Core-Periphery. Per studiare il modello sono state eseguite delle analisi su dati derivanti da un campione di pagine della versione italiana di Wikipedia. I risultati ottenuti dall'analisi di queste informazioni rappresentano le basi utilizzate per la selezione delle pagine oggetto dell'iniezione degli errori, costituendo un metodo per stimare le diverse probabilità di autocorrezione per ciascuna pagina. Per quanto riguarda le capacità di resilienza di Wikipedia, i risultati sono ricavati utilizzando un approccio empirico. Questo consiste nell'inserimento di errori all'interno del campione di pagine sotto specifici vincoli metodologici per poi valutare in quanto tempo e con quali modalità questi errori vengono corretti.

E' stata effettuata un'analisi specifica per la scelta delle tipologie di errore e delle variabili da considerare nell'inserimento di questi.

Questa analisi ha portato alla definizione di 2 esperimenti tra loro distinti, i cui risultati portano ad interessanti conclusioni sia visti separatamente che combinati tra loro. Sulla base dei risultati di questi esperimenti è stato possibile discutere sulle capacità di autocorrezione del sistema, elemento chiave nello studio delle dinamiche della fiducia verso Wikipedia.
0 0
Situated Interaction in a Multilingual Spoken Information Access Framework Niklas Laxström
Kristiina Jokinen
Graham Wilcock
IWSDS 2014 English 18 January 2014 0 0
A case study of contributor behavior in Q&A site and tags: The importance of prominent profiles in community productivity Furtado A.
Oliveira N.
Andrade N.
Journal of the Brazilian Computer Society 2014 Background: Question-and-answer (Q&A) sites have shown to be a valuable resource for helping people to solve their everyday problems. These sites currently enable a large number of contributors to exchange expertise by different ways (creating questions, answers or comments, and voting in these), and it is noticeable that they contribute in diverse amounts and create content of varying quality. Methods: Concerned with diversity of behaviors, this paper advances present knowledge about Q&A sites by performing a cluster analysis with a multifaceted view of contributors that account for their motivations and abilities to identify the most common behavioral profiles in these sites. Results: By examining all contributors' activity from a large site named Super User, we unveil nine behavioral profiles that group users according to the quality and quantity of their contributions. Based on these profiles, we analyze the community composition and the importance of each profile in the site's productivity. Moreover, we also investigate seven tag communities from Super User aiming to experiment with the generality of our results. In this context, the same nine profiles were found, and it was also observed that there is a remarkable similarity between the composition and productivity of the communities defined by the seven tags and the site itself. Conclusions: The profiles uncovered enhance the overall understanding of how Q&A sites work and knowing these profiles can support the site's management. Furthermore, an analysis of particularities in the tag communities comparison relates the variation in behavior to the typical behavior of each tag community studied, what also draws implications for creating administrative strategies. © 2014 Furtado; licensee Springer. 0 0
A composite kernel approach for dialog topic tracking with structured domain knowledge from Wikipedia Soo-Hwan Kim
Banchs R.E.
Hua Li
52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference English 2014 Dialog topic tracking aims at analyzing and maintaining topic transitions in ongoing dialogs. This paper proposes a composite kernel approach for dialog topic tracking to utilize various types of domain knowledge obtained from Wikipedia. Two kernels are defined based on history sequences and context trees constructed based on the extracted features. The experimental results show that our composite kernel approach can significantly improve the performances of topic tracking in mixed-initiative human-human dialogs. 0 0
A correlation-based semantic model for text search Sun J.
Bin Wang
Yang X.
Lecture Notes in Computer Science English 2014 With the exponential growth of texts on the Internet, text search is considered a crucial problem in many fields. Most of the traditional text search approaches are based on "bag of words" text representation based on frequency statics. However, these approaches ignore the semantic correlation of words in the text. So this may lead to inaccurate ranking of the search results. In this paper, we propose a new Wikipedia-based similar text search approach that the words in the texts and query text could be semantic correlated in Wikipedia. We propose a new text representation model and a new text similarity metric. Finally, the experiments on the real dataset demonstrate the high precision, recall and efficiency of our approach. 0 0
A cross-cultural comparison on contributors' motivations to online knowledge sharing: Chinese vs. Germans Zhu B.
Gao Q.
Nohdurft E.
Lecture Notes in Computer Science English 2014 Wikipedia is the most popular online knowledge sharing platform in western countries. However, it is not widely accepted in eastern countries. This indicates that culture plays a key role in determining users' acceptance of online knowledge sharing platforms. The purpose of this study is to investigate the cultural differences between Chinese and Germans in motivations for sharing knowledge, and further examine the impacts of these motives on the actual behavior across two cultures. A questionnaire was developed to explore the motivation factors and actual behavior of contributors. 100 valid responses were received from Chinese and 34 responses from the Germans. The results showed that the motivations were significantly different between Chinese and Germans. The Chinese had more consideration for others and cared more about receiving reward and strengthening the relationship, whereas Germans had more concerns about losing competitiveness. The impact of the motives on the actual behavior was also different between Chinese and Germans. 0 0
A framework for automated construction of resource space based on background knowledge Yu X.
Peng L.
Huang Z.
Zhuge H.
Future Generation Computer Systems English 2014 Resource Space Model is a kind of data model which can effectively and flexibly manage the digital resources in cyber-physical system from multidimensional and hierarchical perspectives. This paper focuses on constructing resource space automatically. We propose a framework that organizes a set of digital resources according to different semantic dimensions combining human background knowledge in WordNet and Wikipedia. The construction process includes four steps: extracting candidate keywords, building semantic graphs, detecting semantic communities and generating resource space. An unsupervised statistical language topic model (i.e., Latent Dirichlet Allocation) is applied to extract candidate keywords of the facets. To better interpret meanings of the facets found by LDA, we map the keywords to Wikipedia concepts, calculate word relatedness using WordNet's noun synsets and construct corresponding semantic graphs. Moreover, semantic communities are identified by GN algorithm. After extracting candidate axes based on Wikipedia concept hierarchy, the final axes of resource space are sorted and picked out through three different ranking strategies. The experimental results demonstrate that the proposed framework can organize resources automatically and effectively.©2013 Published by Elsevier Ltd. All rights reserved. 0 0
A latent variable model for discourse-Aware concept and entity disambiguation Angela Fahrni
Michael Strube
14th Conference of the European Chapter of the Association for Computational Linguistics 2014, EACL 2014 English 2014 This paper takes a discourse-oriented perspective for disambiguating common and proper noun mentions with respect to Wikipedia. Our novel approach models the relationship between disambiguation and aspects of cohesion using Markov Logic Networks with latent variables. Considering cohesive aspects consistently improves the disambiguation results on various commonly used data sets. 0 0
A novel system for the semi automatic annotation of event images McParlane P.J.
Jose J.M.
SIGIR 2014 - Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2014 With the rise in popularity of smart phones, taking and sharing photographs has never been more openly accessible. Further, photo sharing websites, such as Flickr, have made the distribution of photographs easy, resulting in an increase of visual content uploaded online. Due to the laborious nature of annotating images, however, a large percentage of these images are unannotated making their organisation and retrieval difficult. Therefore, there has been a recent research focus on the automatic and semi-automatic process of annotating these images. Despite the progress made in this field, however, annotating images automatically based on their visual appearance often results in unsatisfactory suggestions and as a result these models have not been adopted in photo sharing websites. Many methods have therefore looked to exploit new sources of evidence for annotation purposes, such as image context for example. In this demonstration, we instead explore the scenario of annotating images taken at a large scale events where evidences can be extracted from a wealth of online textual resources. Specifically, we present a novel tag recommendation system for images taken at a popular music festival which allows the user to select relevant tags from related Tweets and Wikipedia content, thus reducing the workload involved in the annotation process. Copyright 2014 ACM. 0 0
A perspective-aware approach to search: Visualizing perspectives in news search results Qureshi M.A.
O'Riordan C.
Pasi G.
SIGIR 2014 - Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2014 The result set from a search engine for any user's query may exhibit an inherent perspective due to issues with the search engine or issues with the underlying collection. This demonstration paper presents a system that allows users to specify at query time a perspective together with their query. The system then presents results from well-known search engines with a visualization of the results which allows the users to quickly surmise the presence of the perspective in the returned set. 0 0
A piece of my mind: A sentiment analysis approach for online dispute detection Lei Wang
Cardie C.
52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference English 2014 We investigate the novel task of online dispute detection and propose a sentiment analysis solution to the problem: we aim to identify the sequence of sentence-level sentiments expressed during a discussion and to use them as features in a classifier that predicts the DISPUTE/NON-DISPUTE label for the discussion as a whole. We evaluate dispute detection approaches on a newly created corpus of Wikipedia Talk page disputes and find that classifiers that rely on our sentiment tagging features outperform those that do not. The best model achieves a very promising F1 score of 0.78 and an accuracy of 0.80. 0 0
A seed based method for dictionary translation Krajewski R.
Rybinski H.
Kozlowski M.
Lecture Notes in Computer Science English 2014 The paper refers to the topic of automatic machine translation. The proposed method enables translating a dictionary by means of mining repositories in the source and target repository, without any directly given relationships connecting two languages. It consists of two stages: (1) translation by lexical similarity, where words are compared graphically, and (2) translation by semantic similarity, where contexts are compared. Polish and English version of Wikipedia were used as multilingual corpora. The method and its stages are thoroughly analyzed. The results allow implementing this method in human-in-the-middle systems. 0 0
Academic opinions of Wikipedia and open access publishing Xiao L.
Askin N.
Online Information Review English 2014 Purpose - The purpose of this paper is to examine academics' awareness of and attitudes towards Wikipedia and Open Access journals for academic publishing to better understand the perceived benefits and challenges of these models. Design/methodology/approach - Bases for analysis include comparison of the models, enumeration of their advantages and disadvantages, and investigation of Wikipedia's web structure in terms of potential for academic publishing. A web survey was administered via department-based invitations and listservs. Findings - The survey results show that: Wikipedia has perceived advantages and challenges in comparison to the Open Access model; the academic researchers' increased familiarity is associated with increased comfort with these models; and the academic researchers' attitudes towards these models are associated with their familiarity, academic environment, and professional status. Research limitations/implications - The major limitation of the study is sample size. The result of a power analysis with GPower shows that authors could only detect big effects in this study at statistical power 0.95. The authors call for larger sample studies that look further into this topic. Originality/value - This study contributes to the increasing interest in adjusting methods of creating and disseminating academic knowledge by providing empirical evidence of the academics' experiences and attitudes towards the Open Access and Wikipedia publishing models. This paper provides a resource for researchers interested in scholarly communication and academic publishing, for research librarians, and for the academic community in general. Copyright © 2014 Emerald Group Publishing Limited. All rights reserved. 0 0
An automatic sameAs link discovery from Wikipedia Kagawa K.
Susumu Tamagawa
Takahira Yamaguchi
Lecture Notes in Computer Science English 2014 Spelling variants of words or word sense ambiguity takes many costs in such processes as Data Integration, Information Searching, data pre-processing for Data Mining, and so on. It is useful to construct relations between a word or phrases and a representative name of the entity to meet these demands. To reduce the costs, this paper discusses how to automatically discover "sameAs" and "meaningOf" links from Japanese Wikipedia. In order to do so, we gathered relevant features such as IDF, string similarity, number of hypernym, and so on. We have identified the link-based score on salient features based on SVM results with 960,000 anchor link pairs. These case studies show us that our link discovery method goes well with more than 70% precision/ recall rate. 0 0
An evaluation framework for cross-lingual link discovery Tang L.-X.
Shlomo Geva
Andrew Trotman
Xu Y.
Itakura K.Y.
Information Processing and Management English 2014 Cross-Lingual Link Discovery (CLLD) is a new problem in Information Retrieval. The aim is to automatically identify meaningful and relevant hypertext links between documents in different languages. This is particularly helpful in knowledge discovery if a multi-lingual knowledge base is sparse in one language or another, or the topical coverage in each language is different; such is the case with Wikipedia. Techniques for identifying new and topically relevant cross-lingual links are a current topic of interest at NTCIR where the CrossLink task has been running since the 2011 NTCIR-9. This paper presents the evaluation framework for benchmarking algorithms for cross-lingual link discovery evaluated in the context of NTCIR-9. This framework includes topics, document collections, assessments, metrics, and a toolkit for pooling, assessment, and evaluation. The assessments are further divided into two separate sets: manual assessments performed by human assessors; and automatic assessments based on links extracted from Wikipedia itself. Using this framework we show that manual assessment is more robust than automatic assessment in the context of cross-lingual link discovery. 0 0
An information retrieval expansion model based on Wikipedia Gan L.X.
Tu W.
Advanced Materials Research English 2014 Query expansion is one of the key technologies for improving precision and recall in information retrieval. In order to overcome limitations of single corpus, in this paper, semantic characteristics of Wikipedia corpus is combined with the standard corpus to extract more rich relationship between terms for construction of a steady Markov semantic network. Information of the entity pages and disambiguation pages in Wikipedia is comprehensively utilized to classify query terms to improve query classification accuracy. Related candidates with high quality can be used for query expansion according to semantic pruning. The proposal in our work is benefit to improve retrieval performance and to save search computational cost. 0 0
Analysing the duration of trending topics in twitter using wikipedia Thanh Tran
Georgescu M.
Zhu X.
Kanhabua N.
WebSci 2014 - Proceedings of the 2014 ACM Web Science Conference English 2014 The analysis of trending topics in Twitter is a goldmine for a variety of studies and applications. However, the contents of topics vary greatly from daily routines to major public events, enduring from a few hours to weeks or months. It is thus helpful to distinguish trending topics related to real- world events with those originated within virtual communi- ties. In this paper, we analyse trending topics in Twitter using Wikipedia as reference for studying the provenance of trending topics. We show that among difierent factors, the duration of a trending topic characterizes exogenous Twitter trending topics better than endogenous ones. Copyright 0 0
Analysis of the accuracy and readability of herbal supplement information on Wikipedia Phillips J.
Lam C.
Palmisano L.
Journal of the American Pharmacists Association English 2014 Objective: To determine the completeness and readability of information found in Wikipedia for leading dietary supplements and assess the accuracy of this information with regard to safety (including use during pregnancy/lactation), contraindications, drug interactions, therapeutic uses, and dosing. Design: Cross-sectional analysis of Wikipedia articles. Interventions: The contents of Wikipedia articles for the 19 top-selling herbal supplements were retrieved on July 24, 2012, and evaluated for organization, content, accuracy (as compared with information in two leading dietary supplement references) and readability. Main Outcome Measures: Accuracy of Wikipedia articles. Results: No consistency was noted in how much information was included in each Wikipedia article, how the information was organized, what major categories were used, and where safety and therapeutic information was located in the article. All articles in Wikipedia contained information on therapeutic uses and adverse effects but several lacked information on drug interactions, pregnancy, and contraindications. Wikipedia articles had 26%-75% of therapeutic uses and 76%-100% of adverse effects listed in the Natural Medicines Comprehensive Database and/or Natural Standard. Overall, articles were written at a 13.5-grade level, and all were at a ninth-grade level or above. Conclusion: Articles in Wikipedia in mid-2012 for the 19 top-selling herbal supplements were frequently incomplete, of variable quality, and sometimes inconsistent with reputable sources of information on these products. Safety information was particularly inconsistent among the articles. Patients and health professionals should not rely solely on Wikipedia for information on these herbal supplements when treatment decisions are being made. 0 0
Approach for building high-quality domain ontology based on the Chinese Wikipedia Wu T.
Tang Z.
Xiao K.
ICIC Express Letters English 2014 In this paper, we propose a new approach for building high-quality domain ontology based on the Chinese Wikipedia. In contrast to traditional Wikipedia ontologies, such as DBpedia and YAGO, the domain ontology built in this paper consist of highquality articles. We make use of the C4.5 algorithm to hunt high-quality articles from specific domain in Wikipedia. As a result, a domain ontology is built accordingly. 0 0
Arabic text categorization based on arabic wikipedia Yahya A.
Salhi A.
ACM Transactions on Asian Language Information Processing English 2014 This article describes an algorithm for categorizing Arabic text, relying on highly categorized corpus-based datasets obtained from the Arabic Wikipedia by using manual and automated processes to build and customize categories. The categorization algorithm was built by adopting a simple categorization idea then moving forward to more complex ones. We applied tests and filtration criteria to reach the best and most efficient results that our algorithm can achieve. The categorization depends on the statistical relations between the input (test) text and the reference (training) data supported by well-defined Wikipedia-based categories. Our algorithm supports two levels for categorizing Arabic text; categories are grouped into a hierarchy of main categories and subcategories. This introduces a challenge due to the correlation between certain subcategories and overlap between main categories. We argue that our algorithm achieved good performance compared to other methods reported in the literature. 0 0
Are we all online content creators now? Web 2.0 and digital divides Brake D.R. Journal of Computer-Mediated Communication English 2014 Despite considerable interest in online content creation there has been comparatively little academic analysis of the distribution of such practices, both globally and among social groups within countries. Drawing on theoretical frameworks used in digital divide studies, I outline differences in motivation, access, skills, and usage that appear to underlie and perpetuate differences in online content creation practices between social groups. This paper brings together existing studies and new analyses of existing survey datasets. Together they suggest online content creators tend to be from relatively privileged groups and the content of online services based on their contributions may be biased towards what is most interesting or relevant to them. Some implications of these findings for policymakers and researchers are considered. 0 0
Augmenting concept definition in gloss vector semantic relatedness measure using wikipedia articles Pesaranghader A.
Rezaei A.
Lecture Notes in Electrical Engineering English 2014 Semantic relatedness measures are widely used in text mining and information retrieval applications. Considering these automated measures, in this research paper we attempt to improve Gloss Vector relatedness measure for more accurate estimation of relatedness between two given concepts. Generally, this measure, by constructing concepts definitions (Glosses) from a thesaurus, tries to find the angle between the concepts' gloss vectors for the calculation of relatedness. Nonetheless, this definition construction task is challenging as thesauruses do not provide full coverage of expressive definitions for the particularly specialized concepts. By employing Wikipedia articles and other external resources, we aim at augmenting these concepts' definitions. Applying both definition types to the biomedical domain, using MEDLINE as corpus, UMLS as the default thesaurus, and a reference standard of 68 concept pairs manually rated for relatedness, we show exploiting available resources on the Web would have positive impact on final measurement of semantic relatedness. 0 0
Automatic extraction of property norm-like data from large text corpora Kelly C.
Devereux B.
Korhonen A.
Cognitive Science English 2014 Traditional methods for deriving property-based representations of concepts from text have focused on either extracting only a subset of possible relation types, such as hyponymy/hypernymy (e.g., car is-a vehicle) or meronymy/metonymy (e.g., car has wheels), or unspecified relations (e.g., car-petrol). We propose a system for the challenging task of automatic, large-scale acquisition of unconstrained, human-like property norms from large text corpora, and discuss the theoretical implications of such a system. We employ syntactic, semantic, and encyclopedic information to guide our extraction, yielding concept-relation-feature triples (e.g., car be fast, car require petrol, car cause pollution), which approximate property-based conceptual representations. Our novel method extracts candidate triples from parsed corpora (Wikipedia and the British National Corpus) using syntactically and grammatically motivated rules, then reweights triples with a linear combination of their frequency and four statistical metrics. We assess our system output in three ways: lexical comparison with norms derived from human-generated property norm data, direct evaluation by four human judges, and a semantic distance comparison with both WordNet similarity data and human-judged concept similarity ratings. Our system offers a viable and performant method of plausible triple extraction: Our lexical comparison shows comparable performance to the current state-of-the-art, while subsequent evaluations exhibit the human-like character of our generated properties. 0 0
Automatically detecting corresponding edit-turn-pairs in Wikipedia Daxenberger J.
Iryna Gurevych
52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference English 2014 In this study, we analyze links between edits in Wikipedia articles and turns from their discussion page. Our motivation is to better understand implicit details about the writing process and knowledge flow in collaboratively created resources. Based on properties of the involved edit and turn, we have defined constraints for corresponding edit-turn-pairs. We manually annotated a corpus of 636 corresponding and non-corresponding edit-turn-pairs. Furthermore, we show how our data can be used to automatically identify corresponding edit-turn-pairs. With the help of supervised machine learning, we achieve an accuracy of 87 for this task. 0 0
Behavioral aspects in the interaction between wikipedia and its users Reinoso A.J.
Ortega-Valiente J.
Studies in Computational Intelligence English 2014 Wikipedia continues to be the most well-known on-line encyclopedia and receives the visits of millions of users on a daily basis. Its contents correspond to almost all the knowledge areas and are altruistically contributed by individuals and organizations. In addition, users are encouraged to add their own contributions according to the Wikipedia's own supporting paradigm. Its progression to a mass phenomenon has propitiated many studies and research initiatives. Therefore, topics such as the quality of the published contents or the authoring of its contributions have been widely developed. However, very few attention has been paid to the behavioral aspects characterizing the interaction between Wikipedia and its users. Henceforth, this chapter aims to determine the habits exhibited by users when browsing the Wikipedia pages. Particularly, we will focus on visits and contributions, as they constitute the two most common forms of interaction. Our study is based on a sample of the requests submitted to Wikipedia, and its results are twofold: on the one hand, it provides different metrics concerning users' behavior and, on the other, presents particular comparisons among different Wikipedia editions. 0 0
Beyond the encyclopedia: Collective memories in Wikipedia Michela Ferron
Paolo Massa
Memory Studies English 2014 Collective memory processes have been studied from many different perspectives. For example, while psychology has investigated collaborative recall in small groups, other research traditions have focused on flashbulb memories or on the cultural processes involved in the formation of collective memories of entire nations. In this article, considering the online encyclopedia Wikipedia as a global memory place, we analyze online commemoration patterns of traumatic events. We extracted 88 articles and talk pages related to traumatic events, and using logistic regression, we analyzed their edit activity comparing it with more than 370,000 other Wikipedia pages. Results show that the relative amount of edits during anniversaries can significantly distinguish between pages related to traumatic events and other pages. The logistic regression results, together with the transcription of a group of messages exchanged by the users during the anniversaries of the September 11 attacks and the Virginia Tech massacre, suggest that commemoration activities take place in Wikipedia, opening the way to the quantitative study of online collective memory building processes on a large scale. 0 0
Bipartite editing prediction in wikipedia Chang Y.-J.
Tsai Y.-C.
Kao H.-Y.
Journal of Information Science and Engineering English 2014 Link prediction problems aim to project future interactions among members in a social network that have not communicated with each other in the past. Classical approaches for link prediction usually use local information, which considers the similarity of two nodes, or structural information such as the immediate neighborhood. However, when using a bipartite graph to represent activity, there is no straightforward similarity measurement between two linking nodes. However, when a bipartite graph shows two nodes of different types, they will not have any common neighbors, so the local model will need to be adjusted if the users' goal is to predict bipartite relations. In addition to local information regarding similarity, when dealing with link predictions in a social network, it is natural to employ community information to improve the prediction accuracy. In this paper, we address the link prediction problem in the bipartite editing graph used in Wikipedia and also examine the structure of community in this edit graph. As Wikipedia is one of the successful member-maintained online communities, extracting the community information and solving its bipartite link prediction problem will shed light on the process of content creation. In addition, to the best of our knowledge, the problem of using community information in bipartite for predicting the link occurrence has not been clearly addressed. Hence we have designed and integrated two bipartite-specific approaches to predict the link occurrence: First, the supervised learning approach, which is built around the adjusted features of a local model and, second, the community-awareness approach, which utilizes community information. Experiments conducted on the Wikipedia collection show that in terms of F1-measure, our approaches generates an 11% improvement over the general methods based on the K-Nearest Neighbor. In addition to this, we also investigate the structure of communities in the editing network and suggest a different approach to examining the communities involved in Wikipedia. 0 0
Boosting terminology extraction through crosslingual resources Cajal S.
Rodriguez H.
Procesamiento de Lenguaje Natural English 2014 Terminology Extraction is an important Natural Language Processing task with multiple applications in many areas. The task has been approached from different points of view using different techniques. Language and domain independent systems have been proposed as well. Our contribution in this paper focuses on the improvements on Terminology Extraction using crosslingual resources and specifically the Wikipedia and on the use of a variant of PageRank for scoring the candidate terms. 0 0
Bootstrapping Wikipedia to answer ambiguous person name queries Gruetze T.
Gjergji Kasneci
Zuo Z.
Naumann F.
Proceedings - International Conference on Data Engineering English 2014 Some of the main ranking features of today's search engines reflect result popularity and are based on ranking models, such as PageRank, implicit feedback aggregation, and more. While such features yield satisfactory results for a wide range of queries, they aggravate the problem of search for ambiguous entities: Searching for a person yields satisfactory results only if the person in question is represented by a high-ranked Web page and all required information are contained in this page. Otherwise, the user has to either reformulate/refine the query or manually inspect low-ranked results to find the person in question. A possible approach to solve this problem is to cluster the results, so that each cluster represents one of the persons occurring in the answer set. However clustering search results has proven to be a difficult endeavor by itself, where the clusters are typically of moderate quality. A wealth of useful information about persons occurs in Web 2.0 platforms, such as Wikipedia, LinkedIn, Facebook, etc. Being human-generated, the information on these platforms is clean, focused, and already disambiguated. We show that when searching with ambiguous person names the information from Wikipedia can be bootstrapped to group the results according to the individuals occurring in them. We have evaluated our methods on a hand-labeled dataset of around 5,000 Web pages retrieved from Google queries on 50 ambiguous person names. 0 0
Bots, bespoke, code and the materiality of software platforms Geiger R.S. Information Communication and Society English 2014 This article introduces and discusses the role of bespoke code in Wikipedia, which is code that runs alongside a platform or system, rather than being integrated into server-side codebases by individuals with privileged access to the server. Bespoke code complicates the common metaphors of platforms and sovereignty that we typically use to discuss the governance and regulation of software systems through code. Specifically, the work of automated software agents (bots) in the operation and administration of Wikipedia is examined, with a focus on the materiality of code. As bots extend and modify the functionality of sites like Wikipedia, but must be continuously operated on computers that are independent from the servers hosting the site, they involve alternative relations of power and code. Instead of taking for granted the pre-existing stability of Wikipedia as a platform, bots and other bespoke code require that we examine not only the software code itself, but also the concrete, historically contingent material conditions under which this code is run. To this end, this article weaves a series of autobiographical vignettes about the author's experiences as a bot developer alongside more traditional academic discourse. 0 0
Bridging temporal context gaps using time-aware re-contextualization Ceroni A.
Tran N.K.
Kanhabua N.
Niederee C.
SIGIR 2014 - Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2014 Understanding a text, which was written some time ago, can be compared to translating a text from another language. Complete interpretation requires a mapping, in this case, a kind of time-travel translation between present context knowledge and context knowledge at time of text creation. In this paper, we study time-aware re-contextualization, the challenging problem of retrieving concise and complementing information in order to bridge this temporal context gap. We propose an approach based on learning to rank techniques using sentence-level context information extracted from Wikipedia. The employed ranking combines relevance, complementarity and time-awareness. The effectiveness of the approach is evaluated by contextualizing articles from a news archive collection using more than 7,000 manually judged relevance pairs. To this end, we show that our approach is able to retrieve a significant number of relevant context information for a given news article. Copyright 2014 ACM. 0 0
Building distant supervised relation extractors Nunes T.
Schwabe D.
Proceedings - 2014 IEEE International Conference on Semantic Computing, ICSC 2014 English 2014 A well-known drawback in building machine learning semantic relation detectors for natural language is the lack of a large number of qualified training instances for the target relations in multiple languages. Even when good results are achieved, the datasets used by the state-of-the-art approaches are rarely published. In order to address these problems, this work presents an automatic approach to build multilingual semantic relation detectors through distant supervision combining two of the largest resources of structured and unstructured content available on the Web, DBpedia and Wikipedia. We map the DBpedia ontology back to the Wikipedia text to extract more than 100.000 training instances for more than 90 DBpedia relations for English and Portuguese languages without human intervention. First, we mine the Wikipedia articles to find candidate instances for relations described in the DBpedia ontology. Second, we preprocess and normalize the data filtering out irrelevant instances. Finally, we use the normalized data to construct regularized logistic regression detectors that achieve more than 80% of F-Measure for both English and Portuguese languages. In this paper, we also compare the impact of different types of features on the accuracy of the trained detector, demonstrating significant performance improvements when combining lexical, syntactic and semantic features. Both the datasets and the code used in this research are available online. 0 0
Capturing scholar's knowledge from heterogeneous resources for profiling in recommender systems Amini B.
Ibrahim R.
Othman M.S.
Selamat A.
Expert Systems with Applications 2014 In scholars' recommender systems, acquisition knowledge for construction profiles is crucial because profiles provide fundamental information for accurate recommendation. Despite the availability of various knowledge resources, identification and collecting extensive knowledge in an unobtrusive manner is not straightforward. In order to capture scholars' knowledge, some questions must be answered: what knowledge resource is appropriate for profiling, how knowledge items can be unobtrusively captured, and how heterogeneity among different knowledge resources should be resolved. To address these issues, we first model the scholars' academic behavior and extract different knowledge items, diffused over the Web including mediated profiles in digital libraries, and then integrate those heterogeneous knowledge items by Wikipedia. Additionally, we analyze the correlation between knowledge items and partition the scholars' research areas for multi-disciplinary profiling. Compared to the state-of-the-art, the result of empirical evaluation shows the efficiency of our approach in terms of completeness and accuracy. © 2014 Elsevier Ltd. All rights reserved. 0 0
Changes in college students' perceptions of use of web-based resources for academic tasks with Wikipedia projects: A preliminary exploration Traphagan T.
Traphagan J.
Neavel Dickens L.
Resta P.
Interactive Learning Environments English 2014 Motivated by the need to facilitate Net Generation students' information literacy (IL), or more specifically, to promote student understanding of legitimate, effective use of Web-based resources, this exploratory study investigated how analyzing, writing, posting, and monitoring Wikipedia entries might help students develop critical perspectives related to the legitimacy of Wikipedia and other publicly accessible Web-based resources for academic tasks. Results of survey and interview data analyses from two undergraduate courses indicated that undergraduate students typically prefer using publicly accessible Web-based resources to traditional academic resources, such as scholarly journal articles and books both in print and digital form; furthermore, they view the former as helpful academic tools with various utilities. Results also suggest that the Wikipedia activity, integrated into regular course curriculum, led students to gain knowledge about processes of Web-based information creation, become more critical of information on the Web, and evaluate the use of publicly accessible Web-based resources for academic purposes. Such changes appear more conspicuous with first year than with upper division students. The findings suggest that experiential opportunities to grapple with the validity of publicly accessible Web-based resources may prepare students better for their college and professional careers. The study results also indicate the need for integrating multiple existing frameworks for IL into one comprehensive framework to better understand various aspects of students' knowledge, use, and production of information from cognitive and technical perspectives and for a variety of purposes. 0 0
Cheap talk and editorial control Newton J. B.E. Journal of Theoretical Economics English 2014 This paper analyzes simple models of editorial control. Starting from the framework developed by Krishna and Morgan (2001a), we analyze two-sender models of cheap talk where one or more of the senders has the power to veto messages before they reach the receiver. A characterization of the most informative equilibria of such models is given. It is shown that editorial control never aids communication and that for small biases in the senders' preferences relative to those of the receiver, necessary and sufficient conditions for information transmission to be adversely affected are (i) that the senders have opposed preferences relative to the receiver and (ii) that both senders have powers of editorial control. It is shown that the addition of further senders beyond two weakly decreases information transmission when senders exercising editorial control are anonymous, and weakly increases information transmission when senders exercising editorial control are observed. 0 0
Chinese and Korean cross-lingual issue news detection based on translation knowledge of Wikipedia Zhao S.
Tsolmon B.
Lee K.-S.
Lee Y.-S.
Lecture Notes in Electrical Engineering English 2014 Cross-lingual issue news and analyzing the news content is an important and challenging task. The core of the cross-lingual research is the process of translation. In this paper, we focus on extracting cross-lingual issue news from the Twitter data of Chinese and Korean. We propose translation knowledge method for Wikipedia concepts as well as the Chinese and Korean cross-lingual inter-Wikipedia link relations. The relevance relations are extracted from the category and the page title of Wikipedia. The evaluation achieved a performance of 83% in average precision in the top 10 extracted issue news. The result indicates that our method is an effective for cross-lingual issue news detection. 0 0
Collaborative projects (social media application): About Wikipedia, the free encyclopedia Kaplan A.
Haenlein M.
Business Horizons English 2014 Collaborative projects-defined herein as social media applications that enable the joint and simultaneous creation of knowledge-related content by many end-users-have only recently received interest among a larger group of academics. This is surprising since applications such as wikis, social bookmarking sites, online forums, and review sites are probably the most democratic form of social media and reflect well the idea of user-generated content. The purpose of this article is to provide insight regarding collaborative projects; the concept of wisdom of crowds, an essential condition for their functioning; and the motivation of readers and contributors. Specifically, we provide advice on how firms can leverage collaborative projects as an essential element of their online presence to communicate both externally with stakeholders and internally among employees. We also discuss how to address situations in which negative information posted on collaborative projects can become a threat and PR crisis for firms. 0 0
Collective memory in Poland: A reflection in street names Radoslaw Nielek
Wawer A.
Adam Wierzbicki
Lecture Notes in Computer Science English 2014 Our article starts with an observation that street names fall into two general types: generic and historically inspired. We analyse street names distributions (of the second type) as a window to nation-level collective memory in Poland. The process of selecting street names is determined socially, as the selections reflect the symbols considered important to the nation-level society, but has strong historical motivations and determinants. In the article, we seek for these relationships in the available data sources. We use Wikipedia articles to match street names with their textual descriptions and assign them to the time points. We then apply selected text mining and statistical techniques to reach quantitative conclusions. We also present a case study: the geographical distribution of two particular street names in Poland to demonstrate the binding between history and political orientation of regions. 0 0
Comparative analysis of text representation methods using classification Szymanski J. Cybernetics and Systems English 2014 In our work, we review and empirically evaluate five different raw methods of text representation that allow automatic processing of Wikipedia articles. The main contribution of the article - evaluation of approaches to text representation for machine learning tasks - indicates that the text representation is fundamental for achieving good categorization results. The analysis of the representation methods creates a baseline that cannot be compensated for even by sophisticated machine learning algorithms. It confirms the thesis that proper data representation is a prerequisite for achieving high-quality results of data analysis. Evaluation of the text representations was performed within the Wikipedia repository by examination of classification parameters observed during automatic reconstruction of human-made categories. For that purpose, we use a classifier based on a support vector machines method, extended with multilabel and multiclass functionalities. During classifier construction we observed parameters such as learning time, representation size, and classification quality that allow us to draw conclusions about text representations. For the experiments presented in the article, we use data sets created from Wikipedia dumps. We describe our software, called Matrixu, which allows a user to build computational representations of Wikipedia articles. The software is the second contribution of our research, because it is a universal tool for converting Wikipedia from a human-readable form to a form that can be processed by a machine. Results generated using Matrixu can be used in a wide range of applications that involve usage of Wikipedia data. 0 0
Comparing the pulses of categorical hot events in Twitter and Weibo Shuai X.
Xiaojiang Liu
Xia T.
Wu Y.
Guo C.
HT 2014 - Proceedings of the 25th ACM Conference on Hypertext and Social Media English 2014 The fragility and interconnectivity of the planet argue compellingly for a greater understanding of how different communities make sense of their world. One of such critical demands relies on comparing the Chinese and the rest of the world (e.g., Americans), where communities' ideological and cultural backgrounds can be significantly different. While traditional studies aim to learn the similarities and differences between these communities via high-cost user studies, in this paper we propose a much more efficient method to compare different communities by utilizing social media. Specifically, Weibo and Twitter, the two largest microblogging systems, are employed to represent the target communities, i.e. China and the Western world (mainly United States), respectively. Meanwhile, through the analysis of the Wikipedia page-click log, we identify a set of categorical 'hot events' for one month in 2012 and search those hot events in Weibo and Twitter corpora along with timestamps via information retrieval methods. We further quantitatively and qualitatively compare users' responses to those events in Twitter and Weibo in terms of three aspects: popularity, temporal dynamic, and information diffusion. The comparative results show that although the popularity ranking of those events are very similar, the patterns of temporal dynamics and information diffusion can be quite different. 0 0
Computer-supported collaborative accounts of major depression: Digital rhetoric on Quora and Wikipedia Rughinis C.
Huma B.
Matei S.
Rughinis R.
Iberian Conference on Information Systems and Technologies, CISTI English 2014 We analyze digital rhetoric in two computer-supported collaborative settings of writing and learning, focusing on major depression: Wikipedia and Quora. We examine the procedural rhetoric of access to and interaction with information, and the textual rhetoric of individual and aggregated entries. Through their different organization of authorship, publication and reading, the two settings create divergent accounts of depression. Key points of difference include: focus on symptoms and causes vs. experiences and advice, use of lists vs. metaphors and narratives, a/temporal structure, and personal and relational knowledge. 0 0
Conceptual clustering Boubacar A.
Niu Z.
Lecture Notes in Electrical Engineering English 2014 Traditional clustering methods are unable to describe the generated clusters. Conceptual clustering is an important and active research area that aims to efficiently cluster and explain the data. Previous conceptual clustering approaches provide descriptions that do not use a human comprehensible knowledge. This paper presents an algorithm which uses Wikipedia concepts to process a clustering method. The generated clusters overlap each other and serve as a basis for an information retrieval system. The method has been implemented in order to improve the performance of the system. It reduces the computation cost. 0 0
Continuous temporal Top-K query over versioned documents Lan C.
YanChun Zhang
Chunxiao Xing
Chenliang Li
Lecture Notes in Computer Science English 2014 The management of versioned documents has attracted researchers' attentions in recent years. Based on the observation that decision-makers are often interested in finding the set of objects that have continuous behavior over time, we study the problem of continuous temporal top-k query. With a given a query, continuous temporal top-k search finds the documents that frequently rank in the top-k during a time period and take the weights of different time intervals into account. Existing works regarding querying versioned documents have focused on adding the constraint of time, however lacked to consider the continuous ranking of objects and weights of time intervals. We propose a new interval window-based method to address this problem. Our method can get the continuous temporal top-k results while using interval windows to support time and weight constraints simultaneously. We use data from Wikipedia to evaluate our method. 0 0
Counter narratives and controversial crimes: The Wikipedia article for the 'Murder of Meredith Kercher' Page R. Language and Literature English 2014 Narrative theorists have long recognised that narrative is a selective mode of representation. There is always more than one way to tell a story, which may alter according to its teller, audience and the social or historical context in which the story is told. But multiple versions of the 'same' events are not always valued in the same way: some versions may become established as dominant accounts, whilst others may be marginalised or resist hegemony as counter narratives (Bamberg and Andrews, 2004). This essay explores the potential of Wikipedia as a site for positioning counter and dominant narratives. Through the analysis of linearity and tellership (Ochs and Capps, 2001) as exemplified through revisions of a particular article ('Murder of Meredith Kercher'), I show how structural choices (open versus closed sequences) and tellership (single versus multiple narrators) function as mechanisms to prioritise different dominant narratives over time and across different cultural contexts. The case study points to the dynamic and relative nature of dominant and counter narratives. In the 'Murder of Meredith Kercher' article the counter narratives of the suspects' guilt or innocence and their position as villains or victims depended on national context, and changed over time. The changes in the macro-social narratives are charted in the micro-linguistic analysis of structure, citations and quoted speech in four selected versions of the article, taken from the English and Italian Wikipedias. 0 0
Creating a phrase similarity graph from wikipedia Stanchev L. Proceedings - 2014 IEEE International Conference on Semantic Computing, ICSC 2014 English 2014 The paper addresses the problem of modeling the relationship between phrases in English using a similarity graph. The mathematical model stores data about the strength of the relationship between phrases expressed as a decimal number. Both structured data from Wikipedia, such as that the Wikipedia page with title 'Dog' belongs to the Wikipedia category 'Domesticated animals', and textual descriptions, such as that the Wikipedia page with title 'Dog' contains the word 'wolf' thirty one times are used in creating the graph. The quality of the graph data is validated by comparing the similarity of pairs of phrases using our software that uses the graph with results of studies that were performed with human subjects. To the best of our knowledge, our software produces better correlation with the results of both the Miller and Charles study and the WordSimilarity-353 study than any other published research. 0 0
Cross-language and cross-encyclopedia article linking using mixed-language topic model and hypernym translation Wang Y.-C.
Wu C.-K.
Tsai R.T.-H.
52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference English 2014 Creating cross-language article links among different online encyclopedias is now an important task in the unification of multilingual knowledge bases. In this paper, we propose a cross-language article linking method using a mixed-language topic model and hypernym translation features based on an SVM model to link English Wikipedia and Chinese Baidu Baike, the most widely used Wiki-like encyclopedia in China. To evaluate our approach, we compile a data set from the top 500 Baidu Baike articles and their corresponding English Wiki articles. The evaluation results show that our approach achieves 80.95% in MRR and 87.46% in recall. Our method does not heavily depend on linguistic characteristics and can be easily extended to generate crosslanguage article links among different online encyclopedias in other languages. 0 0
Crowd-based appraisal and description of archival records at the State Archives Baden-Württemberg Naumann K.
Ziwes F.-J.
Archiving 2014 - Final Program and Proceedings English 2014 Appraisal and description are core processes at historical archives. This article gives an account of innovative methodologies in this field using crowd-sourced information to (1st) identify which files are of interest for the public, (2nd) enable agency staff to extract and transfer exactly those files selected for permanent retention and (3rd) ease the description and cataloguing of the transferred objects. It defines the extent of outsourcing used at the State Archives (Landesarchiv Baden-Württemberg LABW), describes case studies and touches issues of change management. Data sources are government databases and geodatabases, commercial data on court decisions, the name tags of German Wikipedia, and bio-bibliographical metadata of the State Libraries and the German National Library. 0 0
Designing a trust evaluation model for open-knowledge communities Yang X.
Qiang Qiu
Yu S.
Tahir H.
British Journal of Educational Technology English 2014 The openness of open-knowledge communities (OKCs) leads to concerns about the knowledge quality and reliability of such communities. This confidence crisis has become a major factor limiting the healthy development of OKCs. Earlier studies on trust evaluation for Wikipedia considered disadvantages such as inadequate influencing factors and separated the treatment of trustworthiness for users and resources. A new trust evaluation model for OKCs - the two-way interactive feedback model - is developed in this study. The model has two core components: resource trustworthiness (RT) and user trustworthiness (UT). The model is based on more interaction data, considers the interrelation between RT and UT, and better represents the features of interpersonal trust in reality. Experimental simulation and trial operation for the Learning Cell System, a novel open-knowledge community developed for ubiquitous learning, show that the model accurately evaluates RT and UT in this example OKC environment. 0 0
Designing information savvy societies: An introduction to assessability Andrea Forte
Andalibi N.
Park T.
Willever-Farr H.
Conference on Human Factors in Computing Systems - Proceedings English 2014 This paper provides first steps toward an empirically grounded design vocabulary for assessable design as an HCI response to the global need for better information literacy skills. We present a framework for synthesizing literatures called the Interdisciplinary Literacy Framework and use it to highlight gaps in our understanding of information literacy that HCI as a field is particularly well suited to fill. We report on two studies that lay a foundation for developing guidelines for assessable information system design. The first is a study of Wikipedians', librarians', and laypersons' information assessment practices from which we derive two important features of assessable designs: Information provenance and stewardship. The second is an experimental study in which we operationalize these concepts in designs and test them using Amazon Mechanical Turk (MTurk). 0 0
Developing creativity competency of engineers Waychal P.K. ASEE Annual Conference and Exposition, Conference Proceedings English 2014 The complete agreement of all stakeholders on the importance of developing the creativity competency of engineering graduates motivated us to undertake this study. We chose a senior-level course in Software Testing and Quality Assurance which offered an excellent platform for the experiment as both testing and quality assurance activities can be executed using either routine or mechanical methods or highly creative ones. The earlier attempts reported in literature to develop the creativity competency do not appear to be systematic i.e. they do not follow the measurement ->action plan ->measurement cycle. The measurements, wherever done, are based on the Torrance Test of Critical Thinking (TTCT) and the Myers Briggs Type Indicator (MBTI). We found these tests costly and decided to search for an appropriate alternative that led us to the Felder Solomon Index of Learning Style (ILS). The Sensing / Intuition dimension of the ILS, like MBTI, is originated in Carl Jung's Theory of Psychological Types. Since a number of MBTI studies have used the dimension for assessing creativity, we posited that the same ILS dimension could be used to measure the competency. We carried out pre-ILS assessment, designed and delivered the course with a variety of activities that could potentially enhance creativity, and carried out course-end post-ILS assessment. Although major changes would not normally be expected after a one-semester course, a hypothesis in the study was that a shift from sensing toward intuition on learning style profiles would be observed, and indeed it was. A paired t- Test indicated that the pre-post change in the average sensing/intuition preference score was statistically significant (p = 0.004). While more research and direct assessment of competency is needed to be able to draw definitive conclusions about both the use of the instrument for measuring creativity and the efficacy of the course structure and contents in developing the competency, the results suggest that the approach is worth exploring. 0 0
Development of a semantic and syntactic model of natural language by means of non-negative matrix and tensor factorization Anisimov A.
Marchenko O.
Taranukha V.
Vozniuk T.
Lecture Notes in Computer Science English 2014 A method for developing a structural model of natural language syntax and semantics is proposed. Syntactic and semantic relations between parts of a sentence are presented in the form of a recursive structure called a control space. Numerical characteristics of these data are stored in multidimensional arrays. After factorization, the arrays serve as the basis for the development of procedures for analyses of natural language semantics and syntax. 0 0
Editing beyond articles: Diversity & dynamics of teamwork in open collaborations Morgan J.T.
Gilbert M.
David W. McDonald
Mark Zachry
English 2014 We report a study of Wikipedia in which we use a mixedmethods approach to understand how participation in specialized workgroups called WikiProjects has changed over the life of the encyclopedia. While previous work has analyzed the work of WikiProjects in supporting the development of articles within particular subject domains, the collaborative role of WikiProjects that do not fit this conventional mold has not been empirically examined. We combine content analysis, interviews and analysis of edit logs to identify and characterize these alternative WikiProjects and the work they do. Our findings suggest that WikiProject participation reflects community concerns and shifts in the community's conception of valued work over the past six years. We discuss implications for other open collaborations that need flexible, adaptable coordination mechanisms to support a range of content creation, curation and community maintenance tasks. Copyright 0 0
Effectively detecting topic boundaries in a news video by using wikipedia Kim J.W.
Cho S.-H.
International Journal of Software Engineering and its Applications English 2014 With the development of internet technology, traditional TV news providers start sharing theirs news videos on the Web. As the number of TV news videos on the Web is constantly increasing, there is an impending need for effective mechanisms that are able to reduce the navigational overhead significantly with a given collection of TV news videos. Naturally, a TV news video contains a series of stories that are not related to each other, and thus building indexing structures based on the entire contents of it might be ineffective. An alternative and more promising strategy is to first find topic boundaries in a given news video based on topical coherence, and then build index structures for each coherent unit. Thus, the main goal of this paper is to develop an effective technique to detect topic boundaries of a given news video. The topic boundaries identified by our algorithm are then used to build indexing structures in order to support effective navigation guides and searches. The proposed method in this paper leverages Wikipedia to map the original contents of a news video from the keyword-space into the concept-space, and finds topic boundaries by using the contents represented in the concept-space. The experimental results show that the proposed technique provides significant precision gains in finding topic boundaries of a news video. 0 0
Elite size and resilience impact on global system structuration in social media Matei S.A.
Tan W.
Mingjie Zhu
Che-Hung Liu
Bertino E.
Foote J.
2014 International Conference on Collaboration Technologies and Systems, CTS 2014 English 2014 The paper examines the role played by the most productive members of social media systems on leading the project and influencing the degree of project structuration. The paper focuses on findings of a large computational social science project that examines Wikipedia.1 0 0
Encoding document semantic into binary codes space Yu Z.
Xuan Zhao
Lei Wang
Lecture Notes in Computer Science English 2014 We develop a deep neural network model to encode document semantic into compact binary codes with the elegant property that semantically similar documents have similar embedding codes. The deep learning model is constructed with three stacked auto-encoders. The input of the lowest auto-encoder is the representation of word-count vector of a document, while the learned hidden features of the deepest auto-encoder are thresholded to be binary codes to represent the document semantic. Retrieving similar document is very efficient by simply returning the documents whose codes have small Hamming distances to that of the query document. We illustrate the effectiveness of our model on two public real datasets - 20NewsGroup and Wikipedia, and the experiments demonstrate that the compact binary codes sufficiently embed the semantic of documents and bring improvement in retrieval accuracy. 0 0
Entity ranking based on Wikipedia for related entity finding Jinghua Zhang
Qu Y.
Shui Y.
Tian S.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development Chinese 2014 Entity ranking is a very important step for related entity finding (REF). Although researchers have done many works about "entity ranking based on Wikipedia for REF", there still exists some issues: the semi-automatic acquirement of target-type, the coarse-grained target-type, the binary judgment of entity-type relevancy and ignoring the effects of stop words in calculation of entity-relation relevancy. This paper designs a framework, which ranks entities through the calculation of a triple-combination (including entity relevancy, entity-type relevancy and entity-relation relevancy) and acquires the best combination-method through the comparisons of experimental results. A novel approach is proposed to calculate the entity-type relevancy. It can automatically acquire the fine-grained target-type and the discriminative rules of its hyponym Wikipedia-categories through inductive learning, and calculate entity-type relevancy through counting the number of categories which meet the discriminative rules. Also, this paper proposes a "cut stop words to rebuild relation" approach to calculate the entity-relation relevancy between candidate entity and source entity. Experiment results demonstrate that the proposed approaches can effectively improve the entity-ranking results and reduce the time consumed in calculating. 0 0
Evaluating the helpfulness of linked entities to readers Yamada I.
Ito T.
Usami S.
Takagi S.
Hideaki Takeda
Takefuji Y.
HT 2014 - Proceedings of the 25th ACM Conference on Hypertext and Social Media English 2014 When we encounter an interesting entity (e.g., a person's name or a geographic location) while reading text, we typically search and retrieve relevant information about it. Entity linking (EL) is the task of linking entities in a text to the corresponding entries in a knowledge base, such as Wikipedia. Recently, EL has received considerable attention. EL can be used to enhance a user's text reading experience by streamlining the process of retrieving information on entities. Several EL methods have been proposed, though they tend to extract all of the entities in a document including unnecessary ones for users. Excessive linking of entities can be distracting and degrade the user experience. In this paper, we propose a new method for evaluating the helpfulness of linking entities to users. We address this task using supervised machine-learning with a broad set of features. Experimental results show that our method significantly outperforms baseline methods by approximately 5.7%-12% F1. In addition, we propose an application, Linkify, which enables developers to integrate EL easily into their web sites. 0 0
Evaluation of gastroenterology and hepatology articles on Wikipedia: Are they suitable as learning resources for medical students? Samy A. Azer (Eur J Gastroenterol Hepatol. 2014 Feb;26(2):155-63) doi:10.1097/MEG.0000000000000003 2014 BACKGROUND: With the changes introduced to medical curricula, medical students use learning resources on the Internet such as Wikipedia. However, the credibility of the medical content of Wikipedia has been questioned and there is no evidence to respond to these concerns. The aim of this paper was to critically evaluate the accuracy and reliability of the gastroenterology and hepatology information that medical students retrieve from Wikipedia. METHODS: The Wikipedia website was searched for articles on gastroenterology and hepatology on 28 May 2013. Copies of these articles were evaluated by three assessors independently using an appraisal form modified from the DISCERN instrument. The articles were scored for accuracy of content, readability, frequency of updating, and quality of references. RESULTS: A total of 39 articles were evaluated. Although the articles appeared to be well cited and reviewed regularly, several problems were identified with regard to depth of discussion of mechanisms and pathogenesis of diseases, as well as poor elaboration on different investigations. Analysis of the content showed a score ranging from 15.6±0.6 to 43.6±3.2 (mean±SD). The total number of references in all articles was 1233, and the number of references varied from 4 to 144 (mean±SD, 31.6±27.3). The number of citations from peer-reviewed journals published in the last 5 years was 242 (28%); however, several problems were identified in the list of references and citations made. The readability of articles was in the range of -8.0±55.7 to 44.4±1.4; for all articles the readability was 26±9.0 (mean±SD). The concordance between the assessors on applying the criteria had mean κ scores in the range of 0.61 to 0.79. CONCLUSION: Wikipedia is not a reliable source of information for medical students searching for gastroenterology and hepatology articles. Several limitations, deficiencies, and scientific errors have been identified in the articles examined. 0 0
Experimental comparison of semantic word clouds Barth L.
Kobourov S.G.
Pupyrev S.
Lecture Notes in Computer Science English 2014 We study the problem of computing semantics-preserving word clouds in which semantically related words are close to each other. We implement three earlier algorithms for creating word clouds and three new ones. We define several metrics for quantitative evaluation of the resulting layouts. Then the algorithms are compared according to these metrics, using two data sets of documents from Wikipedia and research papers. We show that two of our new algorithms outperform all the others by placing many more pairs of related words so that their bounding boxes are adjacent. Moreover, this improvement is not achieved at the expense of significantly worsened measurements for the other metrics. 0 0
Explaining authors' contribution to pivotal artifacts during mass collaboration in the Wikipedia's knowledge base Iassen Halatchliyski
Johannes Moskaliuk
Joachim Kimmerle
Ulrike Cress
International Journal of Computer-Supported Collaborative Learning English 2014 This article discusses the relevance of large-scale mass collaboration for computer-supported collaborative learning (CSCL) research, adhering to a theoretical perspective that views collective knowledge both as substance and as participatory activity. In an empirical study using the German Wikipedia as a data source, we explored collective knowledge as manifested in the structure of artifacts that were created through the collaborative activity of authors with different levels of contribution experience. Wikipedia's interconnected articles were considered at the macro level as a network and analyzed using a network analysis approach. The focus of this investigation was the relation between the authors' experience and their contribution to two types of articles: central pivotal articles within the artifact network of a single knowledge domain and boundary-crossing pivotal articles within the artifact network of two adjacent knowledge domains. Both types of pivotal articles were identified by measuring the network position of artifacts based on network analysis indices of topological centrality. The results showed that authors with specialized contribution experience in one domain predominantly contributed to central pivotal articles within that domain. Authors with generalized contribution experience in two domains predominantly contributed to boundary-crossing pivotal articles between the knowledge domains. Moreover, article experience (i.e., the number of articles in both domains an author had contributed to) was positively related to the contribution to both types of pivotal articles, regardless of whether an author had specialized or generalized domain experience. We discuss the implications of our findings for future studies in the field of CSCL. © 2013 International Society of the Learning Sciences, Inc. and Springer Science+Business Media New York. 0 0
Exploiting Twitter and Wikipedia for the annotation of event images McParlane P.J.
Jose J.M.
SIGIR 2014 - Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2014 With the rise in popularity of smart phones, there has been a recent increase in the number of images taken at large social (e.g. festivals) and world (e.g. natural disasters) events which are uploaded to image sharing websites such as Flickr. As with all online images, they are often poorly annotated, resulting in a difficult retrieval scenario. To overcome this problem, many photo tag recommendation methods have been introduced, however, these methods all rely on historical Flickr data which is often problematic for a number of reasons, including the time lag problem (i.e. in our collection, users upload images on average 50 days after taking them, meaning "training data" is often out of date). In this paper, we develop an image annotation model which exploits textual content from related Twitter and Wikipedia data which aims to overcome the discussed problems. The results of our experiments show and highlight the merits of exploiting social media data for annotating event images, where we are able to achieve recommendation accuracy comparable with a state-of-the-art model. Copyright 2014 ACM. 0 0
Exploiting Wikipedia for Evaluating Semantic Relatedness Mechanisms Ferrara F.
Tasso C.
Communications in Computer and Information Science English 2014 The semantic relatedness between two concepts is a measure that quantifies the extent to which two concepts are semantically related. In the area of digital libraries, several mechanisms based on semantic relatedness methods have been proposed. Visualization interfaces, information extraction mechanisms, and classification approaches are just some examples of mechanisms where semantic relatedness methods can play a significant role and were successfully integrated. Due to the growing interest of researchers in areas like Digital Libraries, Semantic Web, Information Retrieval, and NLP, various approaches have been proposed for automatically computing the semantic relatedness. However, despite the growing number of proposed approaches, there are still significant criticalities in evaluating the results returned by different methods. The limitations evaluation mechanisms prevent an effective evaluation and several works in the literature emphasize that the exploited approaches are rather inconsistent. In order to overcome this limitation, we propose a new evaluation methodology where people provide feedback about the semantic relatedness between concepts explicitly defined in digital encyclopedias. In this paper, we specifically exploit Wikipedia for generating a reliable dataset. 0 0
Exploiting the wisdom of the crowds for characterizing and connecting heterogeneous resources Kawase R.
Siehndel P.
Pereira Nunes B.
Herder E.
Wolfgang Nejdl
HT 2014 - Proceedings of the 25th ACM Conference on Hypertext and Social Media English 2014 Heterogeneous content is an inherent problem for cross-system search, recommendation and personalization. In this paper we investigate differences in topic coverage and the impact of topics in different kinds of Web services. We use entity extraction and categorization to create fingerprints that allow for meaningful comparison. As a basis taxonomy, we use the 23 main categories of Wikipedia Category Graph, which has been assembled over the years by the wisdom of the crowds. Following a proof of concept of our approach, we analyze differences in topic coverage and topic impact. The results show many differences between Web services like Twitter, Flickr and Delicious, which reflect users' behavior and the usage of each system. The paper concludes with a user study that demonstrates the benefits of fingerprints over traditional textual methods for recommendations of heterogeneous resources. 0 0
Exploratory search with semantic transformations using collaborative knowledge bases Yegin Genc WSDM 2014 - Proceedings of the 7th ACM International Conference on Web Search and Data Mining English 2014 Sometimes we search for simple facts. Other times we search for relationships between concepts. While existing information retrieval systems work well for simple searches, they are less satisfying for complex inquiries because of the ill-structured nature of many searches and the cognitive load involved in the search process. Search can be improved by leveraging the network of concepts that are maintained by collaborative knowledge bases such as Wikipedia. By treating exploratory search inquires as networks of concepts - and then mapping documents to these concepts, exploratory search performance can be improved. This method is applied to an exploratory search task: given a journal abstract, abstracts are ranked based their relevancy to the seed abstract. The results show comparable relevancy scores to state of the art techniques while at the same time providing better diversity. 0 0
Extended cognition and the explosion of knowledge Ludwig D. Philosophical Psychology English 2014 The aim of this article is to show that externalist accounts of cognition such as Clark and Chalmers' (1998) "active externalism" lead to an explosion of knowledge that is caused by online resources such as Wikipedia and Google. I argue that externalist accounts of cognition imply that subjects who integrate mobile Internet access in their cognitive routines have millions of standing beliefs on unexpected issues such as the birth dates of Moroccan politicians or the geographical coordinates of villages in southern Indonesia. Although many externalists propose criteria for the bounds of cognition that are designed to avoid this explosion of knowledge, I argue that these criteria are flawed and that active externalism has to accept that information resources such as Wikipedia and Google constitute extended cognitive processes. 0 0
Extracting Ontologies from Arabic Wikipedia: A Linguistic Approach Al-Rajebah N.I.
Al-Khalifa H.S.
Arabian Journal for Science and Engineering English 2014 As one of the important aspects of semantic web, building ontological models became a driving demand for developing a variety of semantic web applications. Through the years, much research was conducted to investigate the process of generating ontologies automatically from semi-structured knowledge sources such as Wikipedia. Different ontology building techniques were investigated, e.g., NLP tools and pattern matching, infoboxes and structured knowledge sources (Cyc and WordNet). Looking at the results of previous approaches we can see that the vast majority of employed techniques did not consider the linguistic aspect of Wikipedia. In this article, we present our solution to extract ontologies from Wikipedia using a linguistic approach based on the semantic field theory introduced by Jost Trier. Linguistic ontologies are significant in many applications for both linguists and Web researchers. We applied the proposed approach on the Arabic version of Wikipedia. The semantic relations were extracted from infoboxes, hyperlinks within infoboxes and list of categories that articles belong to. Our system successfully extracted approximately (760,000) triples from the Arabic Wikipedia. We conducted three experiments to evaluate the system output, namely: Validation Test, Crowd Evaluation and Domain Experts' evaluation. The system output achieved an average precision of 65 %. 0 0
Extracting semantic concept relations from Wikipedia Arnold P.
Rahm E.
ACM International Conference Proceeding Series English 2014 Background knowledge as provided by repositories such as WordNet is of critical importance for linking or mapping ontologies and related tasks. Since current repositories are quite limited in their scope and currentness, we investigate how to automatically build up improved repositories by extracting semantic relations (e.g., is-a and part-of relations) from Wikipedia articles. Our approach uses a comprehensive set of semantic patterns, finite state machines and NLP-techniques to process Wikipedia definitions and to identify semantic relations between concepts. Our approach is able to extract multiple relations from a single Wikipedia article. An evaluation for different domains shows the high quality and effectiveness of the proposed approach. 0 0
From open-source software to Wikipedia: 'Backgrounding' trust by collective monitoring and reputation tracking De Laat P.B. Ethics and Information Technology English 2014 Open-content communities that focus on co-creation without requirements for entry have to face the issue of institutional trust in contributors. This research investigates the various ways in which these communities manage this issue. It is shown that communities of open-source software-continue to-rely mainly on hierarchy (reserving write-access for higher echelons), which substitutes (the need for) trust. Encyclopedic communities, though, largely avoid this solution. In the particular case of Wikipedia, which is confronted with persistent vandalism, another arrangement has been pioneered instead. Trust (i.e. full write-access) is 'backgrounded' by means of a permanent mobilization of Wikipedians to monitor incoming edits. Computational approaches have been developed for the purpose, yielding both sophisticated monitoring tools that are used by human patrollers, and bots that operate autonomously. Measures of reputation are also under investigation within Wikipedia; their incorporation in monitoring efforts, as an indicator of the trustworthiness of editors, is envisaged. These collective monitoring efforts are interpreted as focusing on avoiding possible damage being inflicted on Wikipedian spaces, thereby being allowed to keep the discretionary powers of editing intact for all users. Further, the essential differences between backgrounding and substituting trust are elaborated. Finally it is argued that the Wikipedian monitoring of new edits, especially by its heavy reliance on computational tools, raises a number of moral questions that need to be answered urgently. 0 0
Fuzzy ontology alignment using background knowledge Todorov K.
Hudelot C.
Adrian Popescu
Geibel P.
International Journal of Uncertainty, Fuzziness and Knowlege-Based Systems English 2014 We propose an ontology alignment framework with two core features: the use of background knowledge and the ability to handle vagueness in the matching process and the resulting concept alignments. The procedure is based on the use of a generic reference vocabulary, which is used for fuzzifying the ontologies to be matched. The choice of this vocabulary is problem-dependent in general, although Wikipedia represents a general-purpose source of knowledge that can be used in many cases, and even allows cross language matchings. In the first step of our approach, each domain concept is represented as a fuzzy set of reference concepts. In the next step, the fuzzified domain concepts are matched to one another, resulting in fuzzy descriptions of the matches of the original concepts. Based on these concept matches, we propose an algorithm that produces a merged fuzzy ontology that captures what is common to the source ontologies. The paper describes experiments in the domain of multimedia by using ontologies containing tagged images, as well as an evaluation of the approach in an information retrieval setting. The undertaken fuzzy approach has been compared to a classical crisp alignment by the help of a ground truth that was created based on human judgment. 0 0
Heterogeneous graph-based intent learning with queries, web pages and Wikipedia concepts Ren X.
Yafang Wang
Yu X.
Yan J.
Zheng Chen
Jangwhan Han
WSDM 2014 - Proceedings of the 7th ACM International Conference on Web Search and Data Mining English 2014 The problem of learning user search intents has attracted intensive attention from both industry and academia. However, state-of-the-art intent learning algorithms suffer from different drawbacks when only using a single type of data source. For example, query text has difficulty in distinguishing ambiguous queries; search log is bias to the order of search results and users' noisy click behaviors. In this work, we for the first time leverage three types of objects, namely queries, web pages and Wikipedia concepts collaboratively for learning generic search intents and construct a heterogeneous graph to represent multiple types of relationships between them. A novel unsupervised method called heterogeneous graph-based soft-clustering is developed to derive an intent indicator for each object based on the constructed heterogeneous graph. With the proposed co-clustering method, one can enhance the quality of intent understanding by taking advantage of different types of data, which complement each other, and make the implicit intents easier to interpret with explicit knowledge from Wikipedia concepts. Experiments on two real-world datasets demonstrate the power of the proposed method where it achieves a 9.25% improvement in terms of NDCG on search ranking task and a 4.67% enhancement in terms of Rand index on object co-clustering task compared to the best state-of-the-art method. 0 0
How collective intelligence emerges: Knowledge creation process in Wikipedia from microscopic viewpoint Kangpyo Lee Proceedings of the Workshop on Advanced Visual Interfaces AVI English 2014 The Wikipedia, one of the richest human knowledge repositories on the Internet, has been developed by collective intelligence. To gain insight into Wikipedia, one asks how initial ideas emerge and develop to become a concrete article through the online collaborative process? Led by this question, the author performed a microscopic observation of the knowledge creation process on the recent article, "Fukushima Daiichi nuclear disaster." The author collected not only the revision history of the article but also investigated interactions between collaborators by making a user-paragraph network to reveal an intellectual intervention of multiple authors. The knowledge creation process on the Wikipedia article was categorized into 4 major steps and 6 phases from the beginning to the intellectual balance point where only revisions were made. To represent this phenomenon, the author developed a visaphor (digital visual metaphor) to digitally represent the article's evolving concepts and characteristics. Then the author created a dynamic digital information visualization using particle effects and network graph structures. The visaphor reveals the interaction between users and their collaborative efforts as they created and revised paragraphs and debated aspects of the article. 0 0
Identifying the topic of queries based on domain specify ontology ChienTa D.C.
Thi T.P.
WIT Transactions on Information and Communication Technologies English 2014 In order to identify the topic of queries, a large number of past researches have relied on lexicon-syntactic and handcrafted knowledge sources in Machine Learning and Natural Language Processing (NLP). Conversely, in this paper, we introduce the application system that detects the topic of queries based on domain-specific ontology. On this system, we work hard on building this domainspecific ontology, which is composed of instances automatically extracted from available resources such as Wikipedia, WordNet, and ACM Digital Library. The experimental evaluation with many cases of queries related to information technology area shows that this system considerably outperforms a matching and identifying approach. 0 0
Improving contextual advertising matching by using Wikipedia thesaurus knowledge GuanDong Xu
ZongDa Wu
Li G.
Chen E.
Knowledge and Information Systems English 2014 As a prevalent type of Web advertising, contextual advertising refers to the placement of the most relevant commercial ads within the content of a Web page, to provide a better user experience and as a result increase the user's ad-click rate. However, due to the intrinsic problems of homonymy and polysemy, the low intersection of keywords, and a lack of sufficient semantics, traditional keyword matching techniques are not able to effectively handle contextual matching and retrieve relevant ads for the user, resulting in an unsatisfactory performance in ad selection. In this paper, we introduce a new contextual advertising approach to overcome these problems, which uses Wikipedia thesaurus knowledge to enrich the semantic expression of a target page (or an ad). First, we map each page into a keyword vector, upon which two additional feature vectors, the Wikipedia concept and category vector derived from the Wikipedia thesaurus structure, are then constructed. Second, to determine the relevant ads for a given page, we propose a linear similarity fusion mechanism, which combines the above three feature vectors in a unified manner. Last, we validate our approach using a set of real ads, real pages along with the external Wikipedia thesaurus. The experimental results show that our approach outperforms the conventional contextual advertising matching approaches and can substantially improve the performance of ad selection. © 2014 Springer-Verlag London. 0 0
Inferring attitude in online social networks based on quadratic correlation Chao Wang
Bulatov A.A.
Lecture Notes in Computer Science English 2014 The structure of an online social network in most cases cannot be described just by links between its members. We study online social networks, in which members may have certain attitude, positive or negative, toward each other, and so the network consists of a mixture of both positive and negative relationships. Our goal is to predict the sign of a given relationship based on the evidences provided in the current snapshot of the network. More precisely, using machine learning techniques we develop a model that after being trained on a particular network predicts the sign of an unknown or hidden link. The model uses relationships and influences from peers as evidences for the guess, however, the set of peers used is not predefined but rather learned during the training process. We use quadratic correlation between peer members to train the predictor. The model is tested on popular online datasets such as Epinions, Slashdot, and Wikipedia. In many cases it shows almost perfect prediction accuracy. Moreover, our model can also be efficiently updated as the underlying social network evolves. 0 0
Information overload and virtual institutions Memmi D. AI and Society English 2014 The Internet puts at our disposal an unprecedented wealth of information. Unfortunately much of this information is unreliable and its very quantity exceeds our cognitive capacity. To deal with the resulting information overload requires knowledge evaluation procedures that have traditionally been performed by social institutions, such as the press or universities. But the Internet has also given rise to a new type of social institution operating online, such as Wikipedia. We will analyze these virtual institutions to understand how they function, and to determine to what extent they can help manage the information overload. Their distributed and collaborative nature, their agility and low cost make them not only a very interesting social model, but also a rather fragile one. To be durable, virtual institutions probably need strong rules and norms, as well as an appropriate social framework. 0 0
Iranian EFL learners' vocabulary development through wikipedia Khany R.
Khosravian F.
English Language Teaching English 2014 Language teaching has passed through a long way in search of a remedy for language learners and teachers. Countless theories, approaches, and methods have been recommended. With all these, however, more inclusive L2 theories and models ought to be considered to come up with real classroom practices. One of such crucial practices is authenticity, being straightforwardly found in web-based materials in general and Wikipedia texts and tasks in particular. In the same line and based on sound theoretical underpinnings, the place of Wikipedia is investigated in this study as a prospective tool to teach and learn a major language component with practical procedures i.e. vocabulary knowledge. To this end, 36 intermediate Iranian EFL students assigned to two control and experimental groups took part in the study. The results of the tests administered divulged that the learners in the Wikipedia group surpassed those of the control group. Hence, Wikipedia is considered as an encouraging authentic resource to assist EFL learners in improving their vocabulary knowledge. Implications of present findings and suggestions for further research are discussed. 0 0
Kondenzer: Exploration and visualization of archived social media Alonso O.
Khandelwal K.
Proceedings - International Conference on Data Engineering English 2014 Modern social networks such as Twitter provide a platform for people to express their opinions on a variety of topics ranging from personal to global. While the factual part of this information and the opinions of various experts are archived by sources such as Wikipedia and reputable news articles, the opinion of the general public is drowned out in a sea of noise and 'un-interesting' information. In this demo we present Kondenzer - an offline system for condensing, archiving and visualizing social data. Specifically, we create digests of social data using a combination of filtering, duplicate removal and efficient clustering. This gives a condensed set of high quality data which is used to generate facets and create a collection that can be visualized using the PivotViewer control. 0 0
La connaissance est un réseau: Perspective sur l’organisation archivistique et encyclopédique Martin Grandjean Les Cahiers du Numérique French 2014 Network analysis is not revolutionizing our objects of study, it revolutionizes the perspective of the researcher on the latter. Organized as a network, information becomes relational. It makes potentially possible the creation of new information, as with an encyclopedia which links between records weave a web which can be analyzed in terms of structural characteristics or with an archive directory which sees its hierarchy fundamentally altered by an index recomposing the information exchange network within a group of people. On the basis of two examples of management, conservation and knowledge enhancement tools, the online encyclopedia Wikipedia and the archives of the Intellectual Cooperation of the League of Nations, this paper discusses the relationship between the researcher and its object understood as a whole.

[Preprint version available].

Abstract (french)

L’analyse de réseau ne transforme pas nos objets d’étude, elle transforme le regard que le chercheur porte sur ceux-ci. Organisée en réseau, l’information devient relationnelle. Elle rend possible en puissance la création d’une nouvelle connaissance, à l’image d’une encyclopédie dont les liens entre les notices tissent une toile dont on peut analyser les caractéristiques structurelles ou d’un répertoire d’archives qui voit sa hiérarchie bouleversée par un index qui recompose le réseau d’échange d’information à l’intérieur d’un groupe de personnes. Sur la base de deux exemples d’outils de gestion, conservation et valorisation de la connaissance, l’encyclopédie en ligne Wikipédia et les archives de la coopération intellectuelle de la Société des Nations, cet article questionne le rapport entre le chercheur et son objet compris dans sa globalité. [Version preprint disponible].
0 0
La négociation contre la démocratie : le cas Wikipedia Pierre-Carl Langlais Négociations French 2014 The first pillar of Wikipedia stresses that « Wikipedia is not a democracy ». The wikipedian communities tend to view democracy and polling as the alter ego (if not the nemesis) of negociation and consensual thought. This article questions the validity and the motives of such a specific conception. Using the conceptual framework of Arend Lijphart, it describes the emergence of a joint-system, which includes elements of majoritarian democracy into the general setting of a consensual democracy. The unconditional rejection of democratic interpretation seems to have its own social use : it allows a pragmatic acclimation of pre-existent procedures in the static political system. 0 0
Large-scale author verification: Temporal and topical influences Van Dam M.
Claudia Hauff
SIGIR 2014 - Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2014 The task of author verification is concerned with the question whether or not someone is the author of a given piece of text. Algorithms that extract writing style features from texts are used to determine how close in style different documents are. Currently, evaluations of author verification algorithms are restricted to small-scale corpora with usually less than one hundred test cases. In this work, we present a methodology to derive a large-scale author verification corpus based on Wikipedia Talkpages. We create a corpus based on English Wikipedia which is significantly larger than existing corpora. We investigate two dimensions on this corpus which so far have not received sufficient attention: the influence of topic and the influence of time on author verification accuracy. Copyright 2014 ACM. 0 0
Learning a lexical simplifier using Wikipedia Horn C.
Manduca C.
David Kauchak
52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference English 2014 In this paper we introduce a new lexical simplification approach. We extract over 30K candidate lexical simplifications by identifying aligned words in a sentence-aligned corpus of English Wikipedia with Simple English Wikipedia. To apply these rules, we learn a feature-based ranker using SVM rank trained on a set of labeled simplifications collected using Amazon's Mechanical Turk. Using human simplifications for evaluation, we achieve a precision of 76% with changes in 86% of the examples. 0 0
Learning to compute semantic relatedness using knowledge from wikipedia Zheng C.
Zhe Wang
Bie R.
Zhou M.
Lecture Notes in Computer Science English 2014 Recently, Wikipedia has become a very important resource for computing semantic relatedness (SR) between entities. Several approaches have already been proposed to compute SR based on Wikipedia. Most of the existing approaches use certain kinds of information in Wikipedia (e.g. links, categories, and texts) and compute the SR by empirically designed measures. We have observed that these approaches produce very different results for the same entity pair in some cases. Therefore, how to select appropriate features and measures to best approximate the human judgment on SR becomes a challenging problem. In this paper, we propose a supervised learning approach for computing SR between entities based on Wikipedia. Given two entities, our approach first maps entities to articles in Wikipedia; then different kinds of features of the mapped articles are extracted from Wikipedia, which are then combined with different relatedness measures to produce nine raw SR values of the entity pair. A supervised learning algorithm is proposed to learn the optimal weights of different raw SR values. The final SR is computed as the weighted average of raw SRs. Experiments on benchmark datasets show that our approach outperforms baseline methods. 0 0
Les jeunes, leurs enseignants et Wikipédia : représentations en tension autour d’un objet documentaire singulier Sahut Gilles (Documentaliste-Sciences de l'information. 2014 June;51(2):p. 70-79) DOI : 10.3917/docsi.512.0070 2014 The collaborative encyclopedia Wikipedia is a heavily used resource, especially by high school and college students, whether for school work or personal reasons. However, for most teachers and information professionals, the jury is still out on the validity of its contents. Are young persons aware of its controversial reputation ? What opinions, negative or positive, do they hold ? How much confidence do they place in this information resource ? This survey of high school and college students provides an opportunity to grasp the diversity of attitudes towards Wikipedia and also how these evolve as the students move up the grade ladder. More widely, this article studies the factors that condition the degree of acceptability of the contents of this unusual source of information. 0 0
Leveraging open source tools for Web mining Pennete K.C. Lecture Notes in Electrical Engineering English 2014 Web mining is the most pursued research area and often the most challenging one. Using web mining, corporates and individuals alike are inquisitively pursuing to unravel the hidden knowledge underneath the diverse gargantuan volumes of web data. This paper tries to present how a researcher can leverage the colossal knowledge available in open access sites such as Wikipedia as a source of information rather than subscribing to closed networks of knowledge and use open source tools rather than prohibitively priced commercial mining tools to do web mining. The paper illustrates a step-by-step usage of R and RapidMiner in web mining to enable a novice to understand the concepts as well as apply it in real world. 0 0
Lexical speaker identification in TV shows Roy A.
Bredin H.
Hartmann W.
Le V.B.
Barras C.
Gauvain J.-L.
Multimedia Tools and Applications English 2014 It is possible to use lexical information extracted from speech transcripts for speaker identification (SID), either on its own or to improve the performance of standard cepstral-based SID systems upon fusion. This was established before typically using isolated speech from single speakers (NIST SRE corpora, parliamentary speeches). On the contrary, this work applies lexical approaches for SID on a different type of data. It uses the REPERE corpus consisting of unsegmented multiparty conversations, mostly debates, discussions and Q&A sessions from TV shows. It is hypothesized that people give out clues to their identity when speaking in such settings which this work aims to exploit. The impact on SID performance of the diarization front-end required to pre-process the unsegmented data is also measured. Four lexical SID approaches are studied in this work, including TFIDF, BM25 and LDA-based topic modeling. Results are analysed in terms of TV shows and speaker roles. Lexical approaches achieve low error rates for certain speaker roles such as anchors and journalists, sometimes lower than a standard cepstral-based Gaussian Supervector - Support Vector Machine (GSV-SVM) system. Also, in certain cases, the lexical system shows modest improvement over the cepstral-based system performance using score-level sum fusion. To highlight the potential of using lexical information not just to improve upon cepstral-based SID systems but as an independent approach in its own right, initial studies on crossmedia SID is briefly reported. Instead of using speech data as all cepstral systems require, this approach uses Wikipedia texts to train lexical speaker models which are then tested on speech transcripts to identify speakers. © 2014 Springer Science+Business Media New York. 0 0
Lightweight domain ontology learning from texts:Graph theory-based approach using wikipedia Ahmed K.B.
Toumouh A.
Widdows D.
International Journal of Metadata, Semantics and Ontologies English 2014 Ontology engineering is the backbone of the semantic web. However, the construction of formal ontologies is a tough exercise which requires time and heavy costs. Ontology learning is thus a solution for this requirement. Since texts are massively available everywhere, making up of experts' knowledge and their know-how, it is of great value to capture the knowledge existing within such texts. Our approach is thus the kind of research work that answers the challenge of creating concepts' hierarchies from textual data taking advantage of the Wikipedia encyclopaedia to achieve some good-quality results. This paper presents a novel approach which essentially uses plain text Wikipedia instead of its categorical system and works with a simplified algorithm to infer a domain taxonomy from a graph.© 2014 Inderscience Enterprises Ltd. 0 0
MIGSOM: A SOM algorithm for large scale hyperlinked documents inspired by neuronal migration Kotaro Nakayama
Yutaka Matsuo
Lecture Notes in Computer Science English 2014 The SOM (Self Organizing Map), one of the most popular unsupervised machine learning algorithms, maps high-dimensional vectors into low-dimensional data (usually a 2-dimensional map). The SOM is widely known as a "scalable" algorithm because of its capability to handle large numbers of records. However, it is effective only when the vectors are small and dense. Although a number of studies on making the SOM scalable have been conducted, technical issues on scalability and performance for sparse high-dimensional data such as hyperlinked documents still remain. In this paper, we introduce MIGSOM, an SOM algorithm inspired by new discovery on neuronal migration. The two major advantages of MIGSOM are its scalability for sparse high-dimensional data and its clustering visualization functionality. In this paper, we describe the algorithm and implementation in detail, and show the practicality of the algorithm in several experiments. We applied MIGSOM to not only experimental data sets but also a large scale real data set: Wikipedia's hyperlink data. 0 0
Massive query expansion by exploiting graph knowledge bases for image retrieval Guisado-Gamez J.
Dominguez-Sal D.
Larriba-Pey J.-L.
ICMR 2014 - Proceedings of the ACM International Conference on Multimedia Retrieval 2014 English 2014 Annotation-based techniques for image retrieval suffer from sparse and short image textual descriptions. Moreover, users are often not able to describe their needs with the most appropriate keywords. This situation is a breeding ground for a vocabulary mismatch problem resulting in poor results in terms of retrieval precision. In this paper, we propose a query expansion technique for queries expressed as keywords and short natural language descriptions. We present a new massive query expansion strategy that enriches queries using a graph knowledge base by identifying the query concepts, and adding relevant synonyms and semantically related terms. We propose a topological graph enrichment technique that analyzes the network of relations among the concepts, and suggests semantically related terms by path and community detection analysis of the knowledge graph. We perform our expansions by using two versions of Wikipedia as knowledge base achieving improvements of the system's precision up to more than 27% Copyright 2014 ACM. 0 0
Maturity assessment of Wikipedia medical articles Conti R.
Marzini E.
Spognardi A.
Matteucci I.
Mori P.
Petrocchi M.
Proceedings - IEEE Symposium on Computer-Based Medical Systems English 2014 Recent studies report that Internet users are growingly looking for health information through the Wikipedia Medicine Portal, a collaboratively edited multitude of articles with contents often comparable with professionally edited material. Automatic quality assessment of the Wikipedia medical articles has not received much attention by Academia and it presents open distinctive challenges. In this paper, we propose to tag the medical articles on the Wikipedia Medicine Portal, clearly stating their maturity degree, intended as a summarizing measure of several article properties. For this purpose, we adopt the Analytic Hierarchy Process, a well known methodology for decision making, and we evaluate the maturity degree of more than 24000 Wikipedia medical articles. The obtained results show how the qualitative analysis of medical content not always overlap with a quantitative analysis (an example of which is shown in the paper), since important properties of an article can hardly be synthesized by quantitative features. This seems particularly true when the analysis considers the concept of maturity, defined and verified in this work. 0 0
Mining hidden concepts: Using short text clustering and wikipedia knowledge Yang C.-L.
Benjamasutin N.
Chen-Burger Y.-H.
Proceedings - 2014 IEEE 28th International Conference on Advanced Information Networking and Applications Workshops, IEEE WAINA 2014 English 2014 In recent years, there has been a rapidly increasing use of social networking platforms in the forms of short-text communication. However, due to the short-length of the texts used, the precise meaning and context of these texts are often ambiguous. To address this problem, we have devised a new community mining approach that is an adaptation and extension of text clustering, using Wikipedia as background knowledge. Based on this method, we are able to achieve a high level of precision in identifying the context of communication. Using the same methods, we are also able to efficiently identify hidden concepts in Twitter texts. Using Wikipedia as background knowledge considerably improved the performance of short text clustering. 0 0
Mining knowledge on relationships between objects from the web Xiaodan Zhang
Yasuhito Asano
Masatoshi Yoshikawa
IEICE Transactions on Information and Systems English 2014 How do global warming and agriculture influence each other? It is possible to answer the question by searching knowledge about the relationship between global warming and agriculture. As exemplified by this question, strong demands exist for searching relationships between objects. Mining knowledge about relationships on Wikipedia has been studied. However, it is desired to search more diverse knowledge about relationships on theWeb. By utilizing the objects constituting relationships mined from Wikipedia, we propose a new method to search images with surrounding text that include knowledge about relationships on the Web. Experimental results show that our method is effective and applicable in searching knowledge about relationships. We also construct a relationship search system named "Enishi" based on the proposed new method. Enishi supplies a wealth of diverse knowledge including images with surrounding text to help users to understand relationships deeply, by complementarily utilizing knowledge from Wikipedia and the Web. Copyright 0 0
Mining the personal interests of microbloggers via exploiting wikipedia knowledge Fan M.
Zhou Q.
Zheng T.F.
Lecture Notes in Computer Science English 2014 This paper focuses on an emerging research topic about mining microbloggers' personalized interest tags from their own microblogs ever posted. It based on an intuition that microblogs indicate the daily interests and concerns of microblogs. Previous studies regarded the microblogs posted by one microblogger as a whole document and adopted traditional keyword extraction approaches to select high weighting nouns without considering the characteristics of microblogs. Given the less textual information of microblogs and the implicit interest expression of microbloggers, we suggest a new research framework on mining microbloggers' interests via exploiting the Wikipedia, a huge online word knowledge encyclopedia, to take up those challenges. Based on the semantic graph constructed via the Wikipedia, the proposed semantic spreading model (SSM) can discover and leverage the semantically related interest tags which do not occur in one's microblogs. According to SSM, An interest mining system have implemented and deployed on the biggest microblogging platform (Sina Weibo) in China. We have also specified a suite of new evaluation metrics to make up the shortage of evaluation functions in this research topic. Experiments conducted on a real-time dataset demonstrate that our approach outperforms the state-of-the-art methods to identify microbloggers' interests. 0 0
Multilinguals and wikipedia editing Hale S.A. WebSci 2014 - Proceedings of the 2014 ACM Web Science Conference English 2014 This article analyzes one month of edits to Wikipedia in order to examine the role of users editing multiple language editions (referred to as multilingual users). Such multilingual users may serve an important function in diffusing information across different language editions of the encyclopedia, and prior work has suggested this could reduce the level of self-focus bias in each edition. This study finds multilingual users are much more active than their single-edition (monolingual) counterparts. They are found in all language editions, but smaller-sized editions with fewer users have a higher percentage of multilingual users than larger-sized editions. About a quarter of multilingual users always edit the same articles in multiple languages, while just over 40% of multilingual users edit different articles in different languages. When non-English users do edit a second language edition, that edition is most frequently English. Nonetheless, several regional and linguistic cross-editing patterns are also present. Copyright 0 0
Mutual disambiguation for entity linking Charton E.
Meurs M.-J.
Jean-Louis L.
Marie-Pierre Gagnon
52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference English 2014 The disambiguation algorithm presented in this paper is implemented in SemLinker, an entity linking system. First, named entities are linked to candidate Wikipedia pages by a generic annotation engine. Then, the algorithm re-ranks candidate links according to mutual relations between all the named entities found in the document. The evaluation is based on experiments conducted on the test corpus of the TAC-KBP 2012 entity linking task. 0 0
Named entity evolution analysis on wikipedia Holzmann H.
Risse T.
WebSci 2014 - Proceedings of the 2014 ACM Web Science Conference English 2014 Accessing Web archives raises a number of issues caused by their temporal characteristics. Additional knowledge is needed to find and understand older texts. Especially entities mentioned in texts are subject to change. Most severe in terms of information retrieval are name changes. In order to find entities that have changed their name over time, search engines need to be aware of this evolution. We tackle this problem by analyzing Wikipedia in terms of entity evolutions mentioned in articles. We present statistical data on excerpts covering name changes, which will be used to discover similar text passages and extract evolution knowledge in future work. Copyright 0 0
No praise without effort: Experimental evidence on how rewards affect Wikipedia's contributor community Restivo M.
Van de Rijt A.
Information Communication and Society English 2014 The successful provision of public goods through mass volunteering over the Internet poses a puzzle to classic social science theories of human cooperation. A solution suggested by recent studies proposes that informal rewards (e.g. a thumbs-up, a badge, an editing award, etc.) can motivate participants by raising their status in the community, which acts as a select incentive to continue contributing. Indeed, a recent study of Wikipedia found that receiving a reward had a large positive effect on the subsequent contribution levels of highly-active contributors. While these findings are suggestive, they only pertained to already highly-active contributors. Can informal rewards also serve as a mechanism to increase participation among less-active contributors by initiating a virtuous cycle of work and reward? We conduct a field experiment on the online encyclopedia Wikipedia in which we bestowed rewards to randomly selected editors of varying productivity levels. Analysis of post-treatment activity shows that despite greater room for less-active contributors to increase their productive efforts, rewards yielded increases in work only among already highly-productive editors. On the other hand, rewards were associated with lower retention of less-active contributors. These findings suggest that the incentive structure in peer production is broadly meritocratic, as highly-active contributors accumulate the most rewards. However, this may also contribute to the divide between the stable core of highly-prodigious producers and a peripheral population of less-active contributors with shorter volunteer tenures. 0 0
Okinawa in Japanese and English Wikipedia Hale S.A. Conference on Human Factors in Computing Systems - Proceedings English 2014 This research analyzes edits by foreign-language users in Wikipedia articles about Okinawa, Japan, in the Japanese and English editions of the encyclopedia. Okinawa, home to both English and Japanese speaking users, provides a good case to look at content differences and cross-language editing in a small geographic area on Wikipedia. Consistent with prior work, this research finds large differences in the representations of Okinawa in the content of the two editions. The number of users crossing the language boundary to edit both editions is also extremely small. When users do edit in a non-primary language, they most frequently edit articles that have cross-language (interwiki) links, articles that are edited more by other users, and articles that have more images. Finally, the possible value of edits from foreign-language users and design possibilities to motivate wider contributions from foreign-language users are discussed. 0 0
On the influence propagation of web videos Liu J.
Yang Y.
Huang Z.
Shen H.T.
IEEE Transactions on Knowledge and Data Engineering English 2014 We propose a novel approach to analyze how a popular video is propagated in the cyberspace, to identify if it originated from a certain sharing-site, and to identify how it reached the current popularity in its propagation. In addition, we also estimate their influences across different websites outside the major hosting website. Web video is gaining significance due to its rich and eye-ball grabbing content. This phenomenon is evidently amplified and accelerated by the advance of Web 2.0. When a video receives some degree of popularity, it tends to appear on various websites including not only video-sharing websites but also news websites, social networks or even Wikipedia. Numerous video-sharing websites have hosted videos that reached a phenomenal level of visibility and popularity in the entire cyberspace. As a result, it is becoming more difficult to determine how the propagation took place-was the video a piece of original work that was intentionally uploaded to its major hosting site by the authors, or did the video originate from some small site then reached the sharing site after already getting a good level of popularity, or did it originate from other places in the cyberspace but the sharing site made it popular. Existing study regarding this flow of influence is lacking. Literature that discuss the problem of estimating a video's influence in the whole cyberspace also remains rare. In this article we introduce a novel framework to identify the propagation of popular videos from its major hosting site's perspective, and to estimate its influence. We define a Unified Virtual Community Space (UVCS) to model the propagation and influence of a video, and devise a novel learning method called Noise-reductive Local-and-Global Learning (NLGL) to effectively estimate a video's origin and influence. Without losing generality, we conduct experiments on annotated dataset collected from a major video sharing site to evaluate the effectiveness of the framework. Surrounding the collected videos and their ranks, some interesting discussions regarding the propagation and influence of videos as well as user behavior are also presented. 0 0
Ontology construction using multiple concept lattices Wang W.C.
Lu J.
Advanced Materials Research English 2014 The paper proposes an ontology construction approach that combines Fuzzy Formal Concept Analysis, Wikipedia and WordNet in a process that constructs multiple concept lattices for sub-domains. Those sub-domains are divided from the target domain. The multiple concept lattices approach can mine concepts and determine relations between concepts automatically, and construct domain ontology accordingly. This approach is suitable for the large domain or complex domain which contains obvious sub-domains. 0 0
Open collaboration for innovation: Principles and performance Levine S.S.
Prietula M.J.
Organization Science English 2014 The principles of open collaboration for innovation (and production), once distinctive to open source software, are now found in many other ventures. Some of these ventures are Internet based: for example, Wikipedia and online communities. Others are off-line: they are found in medicine, science, and everyday life. Such ventures have been affecting traditional firms and may represent a new organizational form. Despite the impact of such ventures, their operating principles and performance are not well understood. Here we define open collaboration (OC), the underlying set of principles, and propose that it is a robust engine for innovation and production. First, we review multiple OC ventures and identify four defining principles. In all instances, participants create goods and services of economic value, they exchange and reuse each other's work, they labor purposefully with just loose coordination, and they permit anyone to contribute and consume. These principles distinguish OC from other organizational forms, such as firms or cooperatives. Next, we turn to performance. To understand the performance of OC, we develop a computational model, combining innovation theory with recent evidence on human cooperation. We identify and investigate three elements that affect performance: the cooperativeness of participants, the diversity of their needs, and the degree to which the goods are rival (subtractable). Through computational experiments, we find that OC performs well even in seemingly harsh environments: when cooperators are a minority, free riders are present, diversity is lacking, or goods are rival. We conclude that OC is viable and likely to expand into new domains. The findings also inform the discussion on new organizational forms, collaborative and communal. 0 0
Open domain question answering using Wikipedia-based knowledge model Ryu P.-M.
Jang M.-G.
Kim H.-K.
Information Processing and Management English 2014 This paper describes the use of Wikipedia as a rich knowledge source for a question answering (QA) system. We suggest multiple answer matching modules based on different types of semi-structured knowledge sources of Wikipedia, including article content, infoboxes, article structure, category structure, and definitions. These semi-structured knowledge sources each have their unique strengths in finding answers for specific question types, such as infoboxes for factoid questions, category structure for list questions, and definitions for descriptive questions. The answers extracted from multiple modules are merged using an answer merging strategy that reflects the specialized nature of the answer matching modules. Through an experiment, our system showed promising results, with a precision of 87.1%, a recall of 52.7%, and an F-measure of 65.6%, all of which are much higher than the results of a simple text analysis based system. © 2014 Elsevier Ltd. All rights reserved. 0 0
Opportunities for using Wiki technologies in building digital library models Mammadov E.C.O. Library Hi Tech News English 2014 Purpose: The purpose of this article is to research the open access and encyclopedia structured methodology of building digital libraries. In Azerbaijan Libraries, one of the most challenged topics is organizing digital resources (books, audio-video materials, etc.). Wiki technologies introduce easy, collaborative and open tools opportunities which make it possible to implement in digital library buildings. Design/methodology/approach: This paper looks at current practices, and the ways of organizing information resources to make them more systematized, open and accessible. These activities are valuable for rural libraries which are smaller and less well funded than main and central libraries in cities. Findings: The main finding of this article is how to organize digital resource management in the libraries using Wiki ideology. Originality/value: Wiki technologies determine the ways of building digital library network models which are structurally different from already known models, as well as new directions in forming information society and solving the problems encountered. 0 0
Preferences in Wikipedia abstracts: Empirical findings and implications for automatic entity summarization Xu D.
Cheng G.
Qu Y.
Information Processing and Management English 2014 The volume of entity-centric structured data grows rapidly on the Web. The description of an entity, composed of property-value pairs (a.k.a. features), has become very large in many applications. To avoid information overload, efforts have been made to automatically select a limited number of features to be shown to the user based on certain criteria, which is called automatic entity summarization. However, to the best of our knowledge, there is a lack of extensive studies on how humans rank and select features in practice, which can provide empirical support and inspire future research. In this article, we present a large-scale statistical analysis of the descriptions of entities provided by DBpedia and the abstracts of their corresponding Wikipedia articles, to empirically study, along several different dimensions, which kinds of features are preferable when humans summarize. Implications for automatic entity summarization are drawn from the findings. © 2013 Elsevier Ltd. All rights reserved. 0 0
Ranking Wikipedia article's data quality by learning dimension distributions Jangwhan Han
Chen K.
International Journal of Information Quality English 2014 As the largest free user-generated knowledge repository, data quality of Wikipedia has attracted great attention these years. Automatic assessment of Wikipedia article's data quality is a pressing concern. We observe that every Wikipedia quality class exhibits its specific characteristic along different first-class quality dimensions including accuracy, completeness, consistency and minimality. We propose to extract quality dimension values from article's content and editing history using dynamic Bayesian network (DBN) and information extraction techniques. Next, we employ multivariate Gaussian distributions to model quality dimension distributions for each quality class, and combine multiple trained classifiers to predict an article's quality class, which can distinguish different quality classes effectively and robustly. Experiments demonstrate that our approach generates a good performance. Copyright 0 0
Reader preferences and behavior on Wikipedia Janette Lehmann
Claudia Muller-Birn
David Laniado
Lalmas M.
Andreas Kaltenbrunner
HT 2014 - Proceedings of the 25th ACM Conference on Hypertext and Social Media English 2014 Wikipedia is a collaboratively-edited online encyclopaedia that relies on thousands of editors to both contribute articles and maintain their quality. Over the last years, research has extensively investigated this group of users while another group of Wikipedia users, the readers, their preferences and their behavior have not been much studied. This paper makes this group and its %their activities visible and valuable to Wikipedia's editor community. We carried out a study on two datasets covering a 13-months period to obtain insights on users preferences and reading behavior in Wikipedia. We show that the most read articles do not necessarily correspond to those frequently edited, suggesting some degree of non-alignment between user reading preferences and author editing preferences. We also identified that popular and often edited articles are read according to four main patterns, and that how an article is read may change over time. We illustrate how this information can provide valuable insights to Wikipedia's editor community. 0 0
Reading about explanations enhances perceptions of inevitability and foreseeability: A cross-cultural study with Wikipedia articles Oeberst A.
Von Der Beck I.
Nestler S.
Cognitive Processing English 2014 In hindsight, people often perceive events to be more inevitable and foreseeable than in foresight. According to Causal Model Theory (Nestler et al. in J Exp Psychol Learn Mem Cogn 34: 1043-1054, 2008), causal explanations are crucial for such hindsight distortions to occur. The present study provides further empirical support for this notion but extends previous findings in several ways. First, ecologically valid materials were used. Second, the effect of causal information on hindsight distortions was investigated in the realm of previously known events. Third, cross-cultural differences in reasoning (analytic vs. holistic) were taken into account. Specifically, German and Vietnamese participants in our study were presented with Wikipedia articles about the nuclear power plant in Fukushima Daiichi, Japan. They read either the version that existed before the nuclear disaster unfolded (Version 1) or the article that existed 8 weeks after the catastrophe commenced (Version 2). Only the latter contained elaborations on causal antecedents and therefore provided an explanation for the disaster. Reading that version led participants to perceive the nuclear disaster to be more likely inevitable and foreseeable when compared to reading Version 1. Cultural background did not exert a significant effect on these perceptions. Hence, hindsight distortions were obtained for ecologically valid materials even if the event was already known. Implications and directions for future research are discussed. 0 0
Research on XML data mining model based on multi-level technology Zhu J.-X. Advanced Materials Research English 2014 The era of Web 2.0 has been coming, and more and more Web 2.0 application, such social networks and Wikipedia, have come up. As an industrial standard of the Web 2.0, the XML technique has also attracted more and more researchers. However, how to mine value information from massive XML documents is still in its infancy. In this paper, we study the basic problem of XML data mining-XML data mining model. We design a multi-level XML data mining model, propose a multi-level data mining method, and list some research issues in the implementation of XML data mining systems. 0 0
Revision graph extraction in Wikipedia based on supergram decomposition and sliding update Wu J.
Mizuho Iwaihara
IEICE Transactions on Information and Systems English 2014 As one of the popular social media that many people turn to in recent years, collaborative encyclopedia Wikipedia provides information in a more "Neutral Point of View" way than others. Towards this core principle, plenty of efforts have been put into collaborative contribution and editing. The trajectories of how such collaboration appears by revisions are valuable for group dynamics and social media research, which suggest that we should extract the underlying derivation relationships among revisions from chronologically-sorted revision history in a precise way. In this paper, we propose a revision graph extraction method based on supergram decomposition in the document collection of near-duplicates. The plain text of revisions would be measured by its frequency distribution of supergram, which is the variable-length token sequence that keeps the same through revisions. We show that this method can effectively perform the task than existing methods. Copyright 0 0
Revision history: Translation trends in Wikipedia McDonough Dolmaya J. Translation Studies English 2014 Wikipedia is a well-known example of a website with content developed entirely through crowdsourcing. It has over 4 million articles in English alone, and content in 284 other language versions. While the articles in the different versions are often written directly in the respective target-language, translations also take place. Given that a previous study suggested that many of English Wikipedia's translators had neither formal training in translation nor professional work experience as translators, it is worth examining the quality of the translations produced. This paper uses Mossop's taxonomy of editing and revising procedures to explore a corpus of translated Wikipedia articles to determine how often transfer and language/style problems are present in these translations and assess how these problems are addressed. © 2014 © 2014 Taylor & Francis. 0 0
SCooL: A system for academic institution name normalization Jacob F.
Javed F.
Zhao M.
McNair M.
2014 International Conference on Collaboration Technologies and Systems, CTS 2014 English 2014 Named Entity Normalization involves normalizing recognized entities to a concrete, unambiguous real world entity. Within the purview of the online job posting domain, academic institution name normalization provides a beneficial opportunity for CareerBuilder (CB). Accurate and detailed normalization of academic institutions are important to perform sophisticated labor market dynamics analysis. In this paper we present and discuss the design and the implementation of sCooL, an academic institution name normalization system designed to supplant the existing manually maintained mapping system at CB. We also discuss the specific challenges that led to the design of sCooL. sCooL leverages Wikipedia to create academic institution name mappings from a school database which is created from job applicant resumes posted on our website. The mappings created are utilized to build a database which is then used for normalization. sCooL provides the flexibility to integrate mappings collected from different curated and non-curated sources. The system is able to identify malformed data and K-12 schools from universities and colleges. We conduct an extensive comparative evaluation of the semi-automated sCooL system against the existing manual mapping implementation and show that sCooL provides better coverage with improved accuracy. 0 0
Semantic full-text search with broccoli Holger Bast
Baurle F.
Buchhold B.
Haussmann E.
SIGIR 2014 - Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2014 We combine search in triple stores with full-text search into what we call semantic full-text search. We provide a fully functional web application that allows the incremental construction of complex queries on the English Wikipedia combined with the facts from Freebase. The user is guided by context-sensitive suggestions of matching words, instances, classes, and relations after each keystroke. We also provide a powerful API, which may be used for research tasks or as a back end, e.g., for a question answering system. Our web application and public API are available under http://broccoli.cs.uni-freiburg.de. 0 0
Semi-automatic construction of plane geometry ontology based-on WordNet and Wikipedia Fu H.-G.
LeBo Liu
Zhong X.-Q.
Jiang Y.
Sun Y.-Y.
Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China Chinese 2014 Ontology, as a member of the Semantic Web's hierarchical structure, is located in the central position. Regarding the current research situation of ontology construction, the manual construction is difficult to ensure its efficiency and scalability; and the automatic construction is hard to guarantee its interoperability. This paper presents a semi-automatic domain ontology construction method based on WordNet and Wikipedia. First, we construct the top-level ontology and then reuse WordNet structure to expand the terminology and terminology-level at the depth of the ontology. Furthermore, we expand the relationship and supplement the terminology at the width of the ontology by referring to page information of Wikipedia. Finally, this method of ontology construction is applied in elementary geometry domain. The experiments show that this method can greatly improve the efficiency of ontology construction and ensure the quality of the ontology to some degree. 0 0
Sentence similarity by combining explicit semantic analysis and overlapping n-grams Vu H.H.
Villaneau J.
Said F.
Marteau P.-F.
Lecture Notes in Computer Science English 2014 We propose a similarity measure between sentences which combines a knowledge-based measure, that is a lighter version of ESA (Explicit Semantic Analysis), and a distributional measure, Rouge. We used this hybrid measure with two French domain-orientated corpora collected from the Web and we compared its similarity scores to those of human judges. In both domains, ESA and Rouge perform better when they are mixed than they do individually. Besides, using the whole Wikipedia base in ESA did not prove necessary since the best results were obtained with a low number of well selected concepts. 0 0
Shades: Expediting Kademlia's lookup process Einziger G.
Friedman R.
Kantor Y.
Lecture Notes in Computer Science English 2014 Kademlia is considered to be one of the most effective key based routing protocols. It is nowadays implemented in many file sharing peer-to-peer networks such as BitTorrent, KAD, and Gnutella. This paper introduces Shades, a combined routing/caching scheme that significantly shortens the average lookup process in Kademlia and improves its load handling. The paper also includes an extensive performance study demonstrating the benefits of Shades and compares it to other suggested alternatives using both synthetic workloads and traces from YouTube and Wikipedia. 0 0
Shrinking digital gap through automatic generation of WordNet for Indian languages Jain A.
Tayal D.K.
Rai S.
AI & SOCIETY English 2014 Hindi ranks fourth in terms of speaker's size in the world. In spite of that, it has <0.1 % presence on web due to lack of competent lexical resources, a key reason behind digital gap due to language barrier among Indian masses. In the footsteps of the renowned lexical resource English WordNet, 18 Indian languages initiated building WordNets under the project Indo WordNet. India is a multilingual country with around 122 languages and 234 mother tongues. Many Indian languages still do not have any reliable lexical resource, and the coverage of numerous WordNets under progress is still far from average value of 25,792. The tedious manual process and high cost are major reasons behind unsatisfactory coverage and limping progress. In this paper, we discuss the socio-cultural and economic impact of providing Internet accessibility and present an approach for the automatic generation of WordNets to tackle the lack of competent lexical resources. Problems such as accuracy, association of linguistics specific gloss/example and incorrect back-translations which arise while deviating from traditional approach of compilation by lexicographers are resolved by utilising Wikipedia available for Indian languages. © 2014 Springer-Verlag London. 0 0
Snuggle: Designing for efficient socialization and ideological critique Aaron Halfaker
Geiger R.S.
Loren Terveen
Conference on Human Factors in Computing Systems - Proceedings English 2014 Wikipedia, the encyclopedia "anyone can edit", has become increasingly less so. Recent academic research and popular discourse illustrates the often aggressive ways newcomers are treated by veteran Wikipedians. These are complex sociotechnical issues, bound up in infrastructures based on problematic ideologies. In response, we worked with a coalition of Wikipedians to design, develop, and deploy Snuggle, a new user interface that served two critical functions: Making the work of newcomer socialization more effective, and bringing visibility to instances in which Wikipedians current practice of gatekeeping socialization breaks down. Snuggle supports positive socialization by helping mentors quickly find newcomers whose good-faith mistakes were reverted as damage. Snuggle also supports ideological critique and reflection by bringing visibility to the consequences of viewing newcomers through a lens of suspiciousness. 0 0
Sticky wikis Berghel H. Computer English 2014 After observing and developing online reference websites for 20 plus years, it's clear the biggest hurdle to reliability still hasn't been overcome. 0 0
Supply chains under strain Harris S. Engineering and Technology English 2014 The article discusses how to tackle the impact of climate change on supply chain risk. Businesses and governments need to start planning for a world with a changed climate. In particular, industries dependent on food, water, energy or ecosystem services need to scrutinize the resilience and viability of their supply chains. The researchers' vision is for the website to eventually host data that cover hundreds of industrial sectors across individual states, provinces and cities, allowing users to track the flows of specific goods at a scale appropriate for the effects of natural disasters. For example, users could find out exactly how many batteries are shipped from Osaka to California, or investigate the impact of a flood in Bangalore on particular industries worldwide. This resource will soon serve as the 'Wikipedia' for supply chain information, and with this he intends to illustrate the potential impact of climate change to politicians and global businesses. 0 0
Supporting navigation in Wikipedia by information visualization: Extended evaluation measures Wu I.-C.
Vakkari P.
Journal of Documentation English 2014 Purpose: The authors introduce two semantics-based navigation applications that facilitate information-seeking activities in internal link-based web sites in Wikipedia. These applications aim to help users find concepts within a topic and related articles on a given topic quickly and then gain topical knowledge from internal link-based encyclopedia web sites. The paper aims to discuss these issues. Design/methodology/approach: The WNavis application consists of three information visualization (IV) tools which are a topic network, a hierarchy topic tree and summaries for topics. The WikiMap application consists of a topic network. The goal of the topic network and topic tree tools is to help users to find the major concepts of a topic and identify relationships between these major concepts easily. In addition, in order to locate specific information and enable users to explore and read topic-related articles quickly, the topic tree and summaries for topics tools support users to gain topical knowledge quickly. The authors then apply the k-clique of cohesive indicator to analyze the sub topics of the seed query and find out the best clustering results via the cosine measure. The authors utilize four metrics, which are correctness, time cost, usage behaviors, and satisfaction, to evaluate the three interfaces. These metrics measure both the outputs and outcomes of applications. As a baseline system for evaluation the authors used a traditional Wikipedia interface. For the evaluation, the authors used an experimental user study with 30 participants. Findings: The results indicate that both WikiMap and WNavis supported users to identify concepts and their relations better compared to the baseline. In topical tasks WNavis over performed both WikiMap and the baseline system. Although there were no time differences in finding concepts or answering topical questions, the test systems provided users with a greater gain per time unit. The users of WNavis leaned on the hierarchy tree instead of other tools, whereas WikiMap users used the topic map. Research limitations/implications: The findings have implications for the design of IR support tools in knowledge-intensive web sites that help users to explore topics and concepts. Originality/value: The authors explored to what extent the use of each IV support tool contributed to successful exploration of topics in search tasks. The authors propose extended task-based evaluation measures to understand how each application provides useful context for users to accomplish the tasks and attain the search goals. That is, the authors not only evaluate the output of the search results, e.g. the number of relevant items retrieved, but also the outcome provided by the system for assisting users to attain the search goal. 0 0
Tagging Scientific Publications Using Wikipedia and Natural Language Processing Tools Lopuszynski M.
Bolikowski L.
Communications in Computer and Information Science English 2014 In this work, we compare two simple methods of tagging scientific publications with labels reflecting their content. As a first source of labels Wikipedia is employed, second label set is constructed from the noun phrases occurring in the analyzed corpus. We examine the statistical properties and the effectiveness of both approaches on the dataset consisting of abstracts from 0.7 million of scientific documents deposited in the ArXiv preprint collection. We believe that obtained tags can be later on applied as useful document features in various machine learning tasks (document similarity, clustering, topic modelling, etc.). 0 0
Term impact-based web page ranking Al-Akashi F.H.
Inkpen D.
ACM International Conference Proceeding Series English 2014 Indexing Web pages based on content is a crucial step in a modern search engine. A variety of methods and approaches exist to support web page rankings. In this paper, we describe a new approach for obtaining measures for Web page ranking. Unlike other recent approaches, it exploits the meta-terms extracted from the titles and urls for indexing the contents of web documents. We use the term impact to correlate each meta-term with document's content, rather than term frequency and other similar techniques. Our approach also uses the structural knowledge available in Wikipedia for making better expansion and formulation for the queries. Evaluation with automatic metrics provided by TREC reveals that our approach is effective for building the index and for retrieval. We present retrieval results from the ClueWeb collection, for a set of test queries, for two tasks: for an adhoc retrieval task and for a diversity task (which aims at retrieving relevant pages that cover different aspects of the queries). 0 0
Text summarization using Wikipedia Sankarasubramaniam Y.
Krishnan Ramanathan
Ghosh S.
Information Processing and Management English 2014 Automatic text summarization has been an active field of research for many years. Several approaches have been proposed, ranging from simple position and word-frequency methods, to learning and graph based algorithms. The advent of human-generated knowledge bases like Wikipedia offer a further possibility in text summarization - they can be used to understand the input text in terms of salient concepts from the knowledge base. In this paper, we study a novel approach that leverages Wikipedia in conjunction with graph-based ranking. Our approach is to first construct a bipartite sentence-concept graph, and then rank the input sentences using iterative updates on this graph. We consider several models for the bipartite graph, and derive convergence properties under each model. Then, we take up personalized and query-focused summarization, where the sentence ranks additionally depend on user interests and queries, respectively. Finally, we present a Wikipedia-based multi-document summarization algorithm. An important feature of the proposed algorithms is that they enable real-time incremental summarization - users can first view an initial summary, and then request additional content if interested. We evaluate the performance of our proposed summarizer using the ROUGE metric, and the results show that leveraging Wikipedia can significantly improve summary quality. We also present results from a user study, which suggests that using incremental summarization can help in better understanding news articles. © 2014 Elsevier Ltd. All rights reserved. 0 0
The anyone-can-edit syndrome: Intercreation stories of three featured articles on wikipedia Mattus M. Nordicom Review English 2014 The user-generated wiki encyclopedia Wikipedia was launched in January 2001 by Jimmy Wales and Larry Sanger. Wikipedia has become the world’s largest wiki encyclopedia, and behind many of its entries are interesting stories of creation, or rather intercreation, since Wikipedia is produced by a large number of contributors. Using the slogan “the free encyclopedia that anyone can edit” (Wikipedia 2013), Wikipedia invites everyone to participate, but the participants do not necessarily represent all kinds of individuals or interests – there might be an imbalance affecting the content as well as the perspective conveyed. As a phenomenon Wikipedia is quite complex, and can be studied from many different angels, for instance through the articles’ history and the edits to them. This paper is based on a study of Featured Articles from the Swedish Wikipedia. Three articles, Fri vilja [Free will], Fjäll [Fell], and Edgar Allan Poe, are chosen from a list of Featured Articles that belongs to the subject field culture. The articles’ development has been followed from their very first versions in 2003/2004 to edits made at the end of 2012. The aim is to examine the creation, or intercreation, processes of the articles, and the collaborative production. The data come from non-article material such as revision history pages, article material, and some complementary statistics. Principally the study has a qualitative approach, but with some quantitative elements. 0 0
The business and politics of search engines: A comparative study of Baidu and Google's search results of Internet events in China Jiang M. New Media and Society English 2014 Despite growing interest in search engines in China, relatively few empirical studies have examined their sociopolitical implications. This study fills several research gaps by comparing query results (N = 6320) from China's two leading search engines, Baidu and Google, focusing on accessibility, overlap, ranking, and bias patterns. Analysis of query results of 316 popular Chinese Internet events reveals the following: (1) after Google moved its servers from Mainland China to Hong Kong, its results are equally if not more likely to be inaccessible than Baidu's, and Baidu's filtering is much subtler than the Great Firewall's wholesale blocking of Google's results; (2) there is low overlap (6.8%) and little ranking similarity between Baidu's and Google's results, implying different search engines, different results and different social realities; and (3) Baidu rarely links to its competitors Hudong Baike or Chinese Wikipedia, while their presence in Google's results is much more prominent, raising search bias concerns. These results suggest search engines can be architecturally altered to serve political regimes, arbitrary in rendering social realities and biased toward self-interest. 0 0
The last click: Why users give up information network navigation Scaria A.T.
Philip R.M.
Robert West
Leskovec J.
WSDM 2014 - Proceedings of the 7th ACM International Conference on Web Search and Data Mining English 2014 An important part of finding information online involves clicking from page to page until an information need is fully satisfied. This is a complex task that can easily be frustrating and force users to give up prematurely. An empirical analysis of what makes users abandon click-based navigation tasks is hard, since most passively collected browsing logs do not specify the exact target page that a user was trying to reach. We propose to overcome this problem by using data collected via Wikispeedia, a Wikipedia-based human-computation game, in which users are asked to navigate from a start page to an explicitly given target page (both Wikipedia articles) by only tracing hyperlinks between Wikipedia articles. Our contributions are two-fold. First, by analyzing the differences between successful and abandoned navigation paths, we aim to understand what types of behavior are indicative of users giving up their navigation task. We also investigate how users make use of back clicks during their navigation. We find that users prefer backtracking to high-degree nodes that serve as landmarks and hubs for exploring the network of pages. Second, based on our analysis, we build statistical models for predicting whether a user will finish or abandon a navigation task, and if the next action will be a back click. Being able to predict these events is important as it can potentially help us design more human-friendly browsing interfaces and retain users who would otherwise have given up navigating a website. 0 0
The reasons why people continue editing Wikipedia content - task value confirmation perspective Lai C.-Y.
Yang H.-L.
Behaviour and Information Technology English 2014 Recently, Wikipedia has garnered increasing public attention. However, few studies have examined the intentions of individuals who edit Wikipedia content. Furthermore, previous studies ascribed a 'knowledge sharing' label to Wikipedia content editors. However, in this work, Wikipedia can be viewed as a platform that allows individuals to show their expertise. This study investigates the underlying reasons that drive individuals to edit Wikipedia content. Based on expectation-confirmation theory and expectancy-value theory for achievement motivations, we propose an integrated model that incorporates psychological and contextual perspectives. Wikipedians from the English-language Wikipedia site were invited to survey. Partial least square was applied to test our proposed model. Analytical results indicated and confirmed that subjective task value, commitment, and procedural justice were significant to satisfaction of Wikipedians; and satisfaction significantly influenced continuance intention to edit Wikipedia content. © 2014 © 2014 Taylor & Francis. 0 0
Tibetan-Chinese named entity extraction based on comparable corpus Sun Y.
Zhao Q.
Applied Mechanics and Materials English 2014 Tibetan-Chinese named entity extraction is the foundation of Tibetan-Chinese information processing, which provides the basis for machine translation and cross-language information retrieval research. We used the multi-language links of Wikipedia to obtain Tibetan-Chinese comparable corpus, and combined sentence length, word matching and entity boundary words together to carry out sentence alignment. Then we extracted Tibetan-Chinese named entity from the aligned comparable corpus in three ways: (1) Natural labeling information extraction. (2) The links of Tibetan entries and Chinese entries extraction. (3) The method of sequence intersection. It contained taking the sentence as words sequence, recognizing Chinese named entity from Chinese sentences and intersecting aligned Tibetan sentences. Fianlly, through the experiment, the results prove the extraction method based on comparable corpus is effective. 0 0
Title named entity recognition using wikipedia and abbreviation generation Park Y.
Kang S.
Seo J.
2014 International Conference on Big Data and Smart Computing, BIGCOMP 2014 English 2014 In this paper, we propose a title named entity recognition model using Wikipedia and abbreviation generation. The proposed title named entity recognition model automatically extracts title named entities from Wikipedia so constant renewal is possible without additional costs. Also, in order to establish a dictionary of title named entity abbreviations, generation rules are used to generate abbreviation candidates and abbreviations are selected through web search methods. In this paper, we propose a statistical model that recognizes title named entities using CRFs (Conditional Random Fields). The proposed model uses lexical information, a named entity dictionary, and an abbreviation dictionary, and provides title named entity recognition performance of 82.1% according to experimental results. 0 0
Topic modeling approach to named entity linking Huai B.-X.
Bao T.-F.
Zhu H.-S.
Qiaoling Liu
Ruan Jian Xue Bao/Journal of Software Chinese 2014 Named entity linking (NEL) is an advanced technology which links a given named entity to an unambiguous entity in the knowledge base, and thus plays an important role in a wide range of Internet services, such as online recommender systems and Web search engines. However, with the explosive increasing of online information and applications, traditional solutions of NEL are facing more and more challenges towards linking accuracy due to the large number of online entities. Moreover, the entities are usually associated with different semantic topics (e.g., the entity "Apple" could be either a fruit or a brand) whereas the latent topic distributions of words and entities in same documents should be similar. To address this issue, this paper proposes a novel topic modeling approach to named entity linking. Different from existing works, the new approach provides a comprehensive framework for NEL and can uncover the semantic relationship between documents and named entities. Specifically, it first builds a knowledge base of unambiguous entities with the help of Wikipedia. Then, it proposes a novel bipartite topic model to capture the latent topic distribution between entities and documents. Therefore, given a new named entity, the new approach can link it to the unambiguous entity in the knowledge base by calculating their semantic similarity with respect to latent topics. Finally, the paper conducts extensive experiments on a real-world data set to evaluate our approach for named entity linking. Experimental results clearly show that the proposed approach outperforms other state-of-the-art baselines with a significant margin. 0 0
Topic modeling for wikipedia link disambiguation Skaggs B.
Getoor L.
ACM Transactions on Information Systems English 2014 Many articles in the online encyclopedia Wikipedia have hyperlinks to ambiguous article titles; these ambiguous links should be replaced with links to unambiguous articles, a process known as disambiguation. We propose a novel statistical topic model based on link text, which we refer to as the Link Text Topic Model (LTTM), that we use to suggest new link targets for ambiguous links. To evaluate our model, we describe a method for extracting ground truth for this link disambiguation task from edits made to Wikipedia in a specific time period. We use this ground truth to demonstrate the superiority of LTTM over other existing link- and content-based approaches to disambiguating links in Wikipedia. Finally, we build a web service that uses LTTM to make suggestions to human editors wanting to fix ambiguous links in Wikipedia. 0 0
Topic ontology-based efficient tag recommendation approach for blogs Subramaniyaswamy V.
Pandian S.C.
International Journal of Computational Science and Engineering English 2014 Efficient tag recommendation systems are required to help users in the task of searching, indexing and browsing appropriate blog content. Tag generation has become more popular to annotate web content, other blogs, photos, videos and music. Tag recommendation is an action of signifying valuable and informative tags to a budding item based on the content. We propose a novel approach based on topic ontology for tag recommendation. The proposed approach intelligently generates tag suggestions to blogs. In this paper, we effectively construct the technology entitled Ontology based on Wikipedia categories and WordNet semantic relationship to make the ontology more meaningful and reliable. Spreading activation algorithm is applied to assign interest scores to existing blog content and tags. High quality tags are suggested based on the significance of the interest score. Evaluation proves that the applicability of topic ontology with spreading activation algorithm helps tag recommendation more effective when compared to collaborative tag recommendations. Our proposed approach offers several solutions to tag spamming, sentiment analysis and popularity. Finally, we report the results of an experiment which improves the performance of tag recommendation approach. 0 0
Towards linking libraries and Wikipedia: Aautomatic subject indexing of library records with Wikipedia concepts Joorabchi A.
Mahdi A.E.
Journal of Information Science English 2014 In this article, we first argue the importance and timely need of linking libraries and Wikipedia for improving the quality of their services to information consumers, as such linkage will enrich the quality of Wikipedia articles and at the same time increase the visibility of library resources which are currently overlooked to a large degree. We then describe the development of an automatic system for subject indexing of library metadata records with Wikipedia concepts as an important step towards library-Wikipedia integration. The proposed system is based on first identifying all Wikipedia concepts occurring in the metadata elements of library records. This is then followed by training and deploying generic machine learning algorithms to automatically select those concepts which most accurately reflect the core subjects of the library materials whose records are being indexed. We have assessed the performance of the developed system using standard information retrieval measures of precision, recall and F-score on a dataset consisting of 100 library metadata records manually indexed with a total of 469 Wikipedia concepts. The evaluation results show that the developed system is capable of achieving an averaged F-score as high as 0.92. 0 0
Towards twitter user recommendation based on user relations and taxonomical analysis Slabbekoorn K.
Noro T.
Tokuda T.
Frontiers in Artificial Intelligence and Applications English 2014 Twitter is one of the largest social media platforms in the world. Although Twitter can be used as a tool for getting valuable information related to a topic of interest, it is a hard task for us to find users to follow for this purpose. In this paper, we present a method for Twitter user recommendation based on user relations and taxonomical analysis. This method first finds some users to follow related to the topic of interest by giving keywords representing the topic, then picks up users who continuously provide related tweets from the user list. In the first phase we rank users based on user relations obtained from tweet behaviour of each user such as retweet and mention (reply), and we create topic taxonomies of each user from tweets posted during different time periods in the second phase. Experimental results show that our method is very effective in recommending users who post tweets related to the topic of interest all the time rather than users who post related tweets just temporarily. 0 0
Tracking topics on revision graphs of wikipedia edit history Li B.
Wu J.
Mizuho Iwaihara
Lecture Notes in Computer Science English 2014 Wikipedia is known as the largest online encyclopedia, in which articles are constantly contributed and edited by users. Past revisions of articles after edits are also accessible from the public for confirming the edit process. However, the degree of similarity between revisions is very high, making it difficult to generate summaries for these small changes from revision graphs of Wikipedia edit history. In this paper, we propose an approach to give a concise summary to a given scope of revisions, by utilizing supergrams, which are consecutive unchanged term sequences. 0 0
Trendspedia: An Internet observatory for analyzing and visualizing the evolving web Kang W.
Tung A.K.H.
Chen W.
Li X.
Song Q.
Zhang C.
Fei Zhao
Xiaofeng Zhou
Proceedings - International Conference on Data Engineering English 2014 The popularity of social media services has been innovating the way of information acquisition in modern society. Meanwhile, mass information is generated in every single day. To extract useful knowledge, much effort has been invested in analyzing social media contents, e.g., (emerging) topic discovery. With these findings, however, users may still find it hard to obtain knowledge of great interest in conformity with their preference. In this paper, we present a novel system which brings proper context to continuously incoming social media contents, such that mass information can be indexed, organized and analyzed around Wikipedia entities. Four data analytics tools are employed in the system. Three of them aim to enrich each Wikipedia entity by analyzing the relevant contents while the other one builds an information network among the most relevant Wikipedia entities. With our system, users can easily pinpoint valuable information and knowledge they are interested in, as well as navigate to other closely related entities through the information network for further exploration. 0 0
TripBuilder: A tool for recommending sightseeing tours Brilhante I.
MacEdo J.A.
Nardini F.M.
Perego R.
Renso C.
Lecture Notes in Computer Science English 2014 We propose TripBuilder, an user-friendly and interactive system for planning a time-budgeted sightseeing tour of a city on the basis of the points of interest and the patterns of movements of tourists mined from user-contributed data. The knowledge needed to build the recommendation model is entirely extracted in an unsupervised way from two popular collaborative platforms: Wikipedia and Flickr. TripBuilder interacts with the user by means of a friendly Web interface that allows her to easily specify personal interests and time budget. The sightseeing tour proposed can be then explored and modified. We present the main components composing the system. 0 0
Trust, but verify: Predicting contribution quality for knowledge base construction and curation Tan C.H.
Agichtein E.
Ipeirotis P.
Evgeniy Gabrilovich
WSDM 2014 - Proceedings of the 7th ACM International Conference on Web Search and Data Mining English 2014 The largest publicly available knowledge repositories, such as Wikipedia and Freebase, owe their existence and growth to volunteer contributors around the globe. While the majority of contributions are correct, errors can still creep in, due to editors' carelessness, misunderstanding of the schema, malice, or even lack of accepted ground truth. If left undetected, inaccuracies often degrade the experience of users and the performance of applications that rely on these knowledge repositories. We present a new method, CQUAL, for automatically predicting the quality of contributions submitted to a knowledge base. Significantly expanding upon previous work, our method holistically exploits a variety of signals, including the user's domains of expertise as reflected in her prior contribution history, and the historical accuracy rates of different types of facts. In a large-scale human evaluation, our method exhibits precision of 91% at 80% recall. Our model verifies whether a contribution is correct immediately after it is submitted, significantly alleviating the need for post-submission human reviewing. 0 0
Twelve years of wikipedia research Judit Bar-Ilan
Noa Aharony
WebSci 2014 - Proceedings of the 2014 ACM Web Science Conference English 2014 Wikipedia was formally launched in 2001, but the first research papers mentioning it appeared only in 2002. Since then it raised a huge amount of interest in the research community. At first mainly the content creation processes and the quality of the content were studied, but later on it was picked up as a valuable source for data mining and for testing. In this paper we present preliminary results that characterize the research done on and using Wikipedia since 2002. Copyright 0 0
Two is bigger (and better) than one: The wikipedia bitaxonomy project Flati T.
Vannella D.
Pasini T.
Roberto Navigli
52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference English 2014 We present WiBi, an approach to the automatic creation of a bitaxonomy for Wikipedia, that is, an integrated taxonomy of Wikipage pages and categories. We leverage the information available in either one of the taxonomies to reinforce the creation of the other taxonomy. Our experiments show higher quality and coverage than state-of-the-art resources like DBpedia, YAGO, MENTA, WikiNet and WikiTaxonomy. 0 0
Uneven Geographies of User-Generated Information: Patterns of Increasing Informational Poverty Mark Graham
Hogan B.
Straumann R.K.
Medhat A.
Annals of the Association of American Geographers English 2014 Geographies of codified knowledge have always been characterized by stark core-periphery patterns, with some parts of the world at the center of global voice and representation and many others invisible or unheard. Many have pointed to the potential for radical change, however, as digital divides are bridged and 2.5 billion people are now online. With a focus on Wikipedia, which is one of the world's most visible, most used, and most powerful repositories of user-generated content, we investigate whether we are now seeing fundamentally different patterns of knowledge production. Even though Wikipedia consists of a massive cloud of geographic information about millions of events and places around the globe put together by millions of hours of human labor, the encyclopedia remains characterized by uneven and clustered geographies: There is simply not a lot of content about much of the world. The article then moves to describe the factors that explain these patterns, showing that although just a few conditions can explain much of the variance in geographies of information, some parts of the world remain well below their expected values. These findings indicate that better connectivity is only a necessary but not a sufficient condition for the presence of volunteered geographic information about a place. We conclude by discussing the remaining social, economic, political, regulatory, and infrastructural barriers that continue to disadvantage many of the world's informational peripheries. The article ultimately shows that, despite many hopes that a democratization of connectivity will spur a concomitant democratization of information production, Internet connectivity is not a panacea and can only ever be one part of a broader strategy to deepen the informational layers of places. 0 0
User interests identification on Twitter using a hierarchical knowledge base Kapanipathi P.
Jain P.
Venkataramani C.
Sheth A.
Lecture Notes in Computer Science English 2014 Twitter, due to its massive growth as a social networking platform, has been in focus for the analysis of its user generated content for personalization and recommendation tasks. A common challenge across these tasks is identifying user interests from tweets. Semantic enrichment of Twitter posts, to determine user interests, has been an active area of research in the recent past. These approaches typically use available public knowledge-bases (such as Wikipedia) to spot entities and create entity-based user profiles. However, exploitation of such knowledge-bases to create richer user profiles is yet to be explored. In this work, we leverage hierarchical relationships present in knowledge-bases to infer user interests expressed as a Hierarchical Interest Graph. We argue that the hierarchical semantics of concepts can enhance existing systems to personalize or recommend items based on a varied level of conceptual abstractness. We demonstrate the effectiveness of our approach through a user study which shows an average of approximately eight of the top ten weighted hierarchical interests in the graph being relevant to a user's interests. 0 0
Using linked data to mine RDF from Wikipedia's tables Munoz E.
Hogan A.
Mileo A.
WSDM 2014 - Proceedings of the 7th ACM International Conference on Web Search and Data Mining English 2014 The tables embedded in Wikipedia articles contain rich, semi-structured encyclopaedic content. However, the cumulative content of these tables cannot be queried against. We thus propose methods to recover the semantics of Wikipedia tables and, in particular, to extract facts from them in the form of RDF triples. Our core method uses an existing Linked Data knowledge-base to find pre-existing relations between entities in Wikipedia tables, suggesting the same relations as holding for other entities in analogous columns on different rows. We find that such an approach extracts RDF triples from Wikipedia's tables at a raw precision of 40%. To improve the raw precision, we define a set of features for extracted triples that are tracked during the extraction phase. Using a manually labelled gold standard, we then test a variety of machine learning methods for classifying correct/incorrect triples. One such method extracts 7.9 million unique and novel RDF triples from over one million Wikipedia tables at an estimated precision of 81.5%. 0 0
Validating and extending semantic knowledge bases using video games with a purpose Vannella D.
Jurgens D.
Scarfini D.
Toscani D.
Roberto Navigli
52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference English 2014 Large-scale knowledge bases are important assets in NLP. Frequently, such resources are constructed through automatic mergers of complementary resources, such as WordNet and Wikipedia. However, manually validating these resources is prohibitively expensive, even when using methods such as crowdsourcing. We propose a cost-effective method of validating and extending knowledge bases using video games with a purpose. Two video games were created to validate conceptconcept and concept-image relations. In experiments comparing with crowdsourcing, we show that video game-based validation consistently leads to higher-quality annotations, even when players are not compensated. 0 0
Virtual tutorials, Wikipedia books, and multimedia-based teaching for blended learning support in a course on algorithms and data structures Knackmuss J.
Creutzburg R.
Proceedings of SPIE - The International Society for Optical Engineering English 2014 The aim of this paper is to describe the benefit and support of virtual tutorials, Wikipedia books and multimedia-based teaching in a course on Algorithms and Data Structures. We describe our work and experiences gained from using virtual tutorials held in Netucate iLinc sessions and the use of various multimedia and animation elements for the support of deeper understanding of the ordinary lectures held in the standard classroom on Algorithms and Data Structures for undergraduate computer sciences students. We will describe the benefits, form, style and contents of those virtual tutorials. Furthermore, we mention the advantage of Wikipedia books to support the blended learning process using modern mobile devices. Finally, we give some first statistical measures of improved student's scores after introducing this new form of teaching support. 0 0
Visualizing large-scale human collaboration in Wikipedia Biuk-Aghai R.P.
Pang C.-I.
Si Y.-W.
Future Generation Computer Systems English 2014 Volunteer-driven large-scale human-to-human collaboration has become common in the Web 2.0 era. Wikipedia is one of the foremost examples of such large-scale collaboration, involving millions of authors writing millions of articles on a wide range of subjects. The collaboration on some popular articles numbers hundreds or even thousands of co-authors. We have analyzed the co-authoring across entire Wikipedias in different languages and have found it to follow a geometric distribution in all the language editions we studied. In order to better understand the distribution of co-author counts across different topics, we have aggregated content by category and visualized it in a form resembling a geographic map. The visualizations produced show that there are significant differences of co-author counts across different topics in all the Wikipedia language editions we visualized. In this article we describe our analysis and visualization method and present the results of applying our method to the English, German, Chinese, Swedish and Danish Wikipedias. We have evaluated our visualization against textual data and found it to be superior in usability, accuracy, speed and user preference. © 2013 Elsevier B.V. All rights reserved. 0 0
Ways of worldmaking in Wikipedia: Reality, legitimacy and collaborative knowledge making Fullerton L.
Ettema J.
Media, Culture and Society English 2014 The on-going social construction of reality, according to Berger and Luckmann's classic treatise, entails both an explanation of the social order which ascribes "cognitive validity to its objectivated meanings" and a justification of that order which provides "a normative dignity to its practical imperatives." The implication is that our knowledge of social reality integrates cognitive facts and normative values to continuously legitimize that reality. We explore this integration of fact and value in an unexpected setting: the "talk pages" of the online encyclopedia Wikipedia in which discussions of article creation are recorded. Our analysis of these discussions draws on Nelson Goodman's Ways of Worldmaking, another classic on the social construction of reality, which catalogues strategies for producing a worldview. We utilize Goodman's theories in four cases of Wikipedia article creation - two histories, "Iraq War" and "Afghanistan War," and two biographies, "George W. Bush" and "Barack Obama" - all of which reveal how knowledge products are created. 0 0
What influences online deliberation? A wikipedia study Xiao L.
Askin N.
Journal of the Association for Information Science and Technology English 2014 In this paper we describe a study aimed at evaluating and improving the quality of online deliberation.We consider the rationales used by participants in deletion discussions on Wikipedia in terms of the literature on democratic and online deliberation and collaborative information quality. Our findings suggest that most participants in these discussions were concerned with the notability and credibility of the topics presented for deletion, and that most presented rationales rooted in established site policies. We found that factors like article topic and unanimity (or lack thereof) were among the factors that tended to affect the outcome of the debate. Our results also suggested that the blackout of the site in response to the proposed Stop Online Piracy Act (SOPA) law affected the decisions of deletion debates that occurred close to the event. We conclude by suggesting implications of this study for broader considerations of online information quality and democratic deliberation. 0 0
What makes a good team of Wikipedia editors? A preliminary statistical analysis Bukowski L.
Jankowski-Lorek M.
Jaroszewicz S.
Sydow M.
Lecture Notes in Computer Science English 2014 The paper concerns studying the quality of teams of Wikipedia authors with statistical approach. We report preparation of a dataset containing numerous behavioural and structural attributes and its subsequent analysis and use to predict team quality. We have performed exploratory analysis using partial regression to remove the influence of attributes not related to the team itself. The analysis confirmed that the key issue significantly influencing article's quality are discussions between teem members. The second part of the paper successfully uses machine learning models to predict good articles based on features of the teams that created them. 0 0
WikiReviz: An edit history visualization for wiki systems Wu J.
Mizuho Iwaihara
Lecture Notes in Computer Science English 2014 Wikipedia maintains a linear record of edit history with article content and meta-information for each article, which conceals precious information on how each article has evolved. This demo describes the motivation and features of WikiReviz, a visualization system for analyzing edit history in Wikipedia and other Wiki systems. From the official exported edit history of a single Wikipedia article, WikiReviz reconstructs the derivation relationships among revisions precisely and efficiently by revision graph extraction and indicate meaningful article evolution progress by edit summarization. 0 0
WikiWho: Precise and Efficient Attribution of Authorship of Revisioned Content Fabian Flöck
Maribel Acosta
World Wide Web Conference 2014 English 2014 Revisioned text content is present in numerous collaboration platforms on the Web, most notably Wikis. To track authorship of text tokens in such systems has many potential applications; the identification of main authors for licensing reasons or tracing collaborative writing patterns over time, to name some. In this context, two main challenges arise. First, it is critical for such an authorship tracking system to be precise in its attributions, to be reliable for further processing. Second, it has to run efficiently even on very large datasets, such as Wikipedia. As a solution, we propose a graph-based model to represent revisioned content and an algorithm over this model that tackles both issues effectively. We describe the optimal implementation and design choices when tuning it to a Wiki environment. We further present a gold standard of 240 tokens from English Wikipedia articles annotated with their origin. This gold standard was created manually and confirmed by multiple independent users of a crowdsourcing platform. It is the first gold standard of this kind and quality and our solution achieves an average of 95% precision on this data set. We also perform a first-ever precision evaluation of the state-of-the-art algorithm for the task, exceeding it by over 10% on average. Our approach outperforms the execution time of the state-of-the-art by one order of magnitude, as we demonstrate on a sample of over 240 English Wikipedia articles. We argue that the increased size of an optional materialization of our results by about 10% compared to the baseline is a favorable trade-off, given the large advantage in runtime performance. 0 0
Wikipedia-based Kernels for dialogue topic tracking Soo-Hwan Kim
Banchs R.E.
Hua Li
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings English 2014 Dialogue topic tracking aims to segment on-going dialogues into topically coherent sub-dialogues and predict the topic category for each next segment. This paper proposes a kernel method for dialogue topic tracking to utilize various types of information obtained from Wikipedia. The experimental results show that our proposed approach can significantly improve the performances of the task in mixed-initiative humanhuman dialogues. 0 0
Wikipedia-based query performance prediction Gilad Katz
Shtok A.
Kurland O.
Bracha Shapira
Lior Rokach
SIGIR 2014 - Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2014 The query-performance prediction task is to estimate retrieval effectiveness with no relevance judgments. Pre-retrieval prediction methods operate prior to retrieval time. Hence, these predictors are often based on analyzing the query and the corpus upon which retrieval is performed. We propose a corpus-independent approach to preretrieval prediction which relies on information extracted from Wikipedia. Specifically, we present Wikipedia-based features that can attest to the effectiveness of retrieval performed in response to a query regardless of the corpus upon which search is performed. Empirical evaluation demonstrates the merits of our approach. As a case in point, integrating the Wikipedia- based features with state-of-the-art pre-retrieval predictors that analyze the corpus yields prediction quality that is consistently better than that of using the latter alone. Copyright 2014 ACM. 0 0
Wikipedia: An info-communist manifesto Sylvain Firer-Blaess
Fuchs C.
Television and New Media English 2014 The task of this article is to analyze the political economy of Wikipedia. We discuss the specifics of Wikipedia's mode of production. The basic principles of what we call the info-communist mode of production will be presented. Our analysis is grounded in Marxist philosophy and Marxist political economy, and is connected to the current discourse about the renewal and reloading of the idea of communism that is undertaken by thinkers like Slavoj Žižek and Alain Badiou. We explore to which extent Wikipedia encompasses principles that go beyond the capitalist mode of production and represent the info-communist mode of production. We present the subjective dimension of the mode of production (cooperative labor), the objective dimension of the mode of production (common ownership of the means of production), and the subject-object dimension of the mode of production (the effects and products of the mode of production). 0 0
Wikipedia: What it is and why it matters for healthcare Rasberry L. BMJ (Online) English 2014 [No abstract available] 0 0
X-REC: Cross-category entity recommendation Milchevski D.
Berberich K.
Proceedings of the 5th Information Interaction in Context Symposium, IIiX 2014 English 2014 We demonstrate X-Rec, a novel system for entity recommendation. In contrast to other systems, X-Rec can recommend entities from diverse categories including goods (e.g., books), other physical entities (e.g., actors), but also immaterial entities (e.g., ideologies). Further, it does so only based on publicly available data sources, including the revision history of Wikipedia, using an easily extensible approach for recommending entities. We describe X-Rec's architecture, showing how its components interact with each other. Moreover, we outline our demonstration, which foresees different modes for users to interact with the system. 0 0
Wikipedia’s Economic Value Jonathan Band
Jonathan Gerafi
Policybandwidth English 7 October 2013 3 0
Art History on Wikipedia, a Macroscopic Observation Doron Goldfarb
Max Arends
Josef Froschauer
Dieter Merkl
ArXiv English 20 April 2013 How are articles about art historical actors interlinked within Wikipedia? Lead by this question, we seek an overview on the link structure of a domain specific subset of Wikipedia articles. We use an established domain-specific person name authority, the Getty Union List of Artist Names (ULAN), in order to externally identify relevant actors. Besides containing consistent biographical person data, this database also provides associative relationships between its person records, serving as a reference link structure for comparison. As a first step, we use mappings between the ULAN and English Dbpedia provided by the Virtual Internet Authority File (VIAF). This way, we are able to identify 18,002 relevant person articles. Examining the link structure between these resources reveals interesting insight about the high level structure of art historical knowledge as it is represented on Wikipedia. 4 1
Jointly They Edit: Examining the Impact of Community Identification on Political Interaction in Wikipedia Jessica J. Neff
David Laniado
Karolin E. Kappler
Yana Volkovich
Pablo Aragón
Andreas Kaltenbrunner
PLOS ONE English 3 April 2013 Background

In their 2005 study, Adamic and Glance coined the memorable phrase ‘divided they blog’, referring to a trend of cyberbalkanization in the political blogosphere, with liberal and conservative blogs tending to link to other blogs with a similar political slant, and not to one another. As political discussion and activity increasingly moves online, the power of framing political discourses is shifting from mass media to social media.

Methodology/Principal Findings

Continued examination of political interactions online is critical, and we extend this line of research by examining the activities of political users within the Wikipedia community. First, we examined how users in Wikipedia choose to display their political affiliation. Next, we analyzed the patterns of cross-party interaction and community participation among those users proclaiming a political affiliation. In contrast to previous analyses of other social media, we did not find strong trends indicating a preference to interact with members of the same political party within the Wikipedia community.

Conclusions/Significance

Our results indicate that users who proclaim their political affiliation within the community tend to proclaim their identity as a ‘Wikipedian’ even more loudly. It seems that the shared identity of ‘being Wikipedian’ may be strong enough to triumph over other potentially divisive facets of personal identity, such as political affiliation.
0 0
(Re)triggering Backlash: Responses to news about Wikipedia's gender gap Eckert S.
Steiner L.
Journal of Communication Inquiry English 2013 Wikipedia, the free encyclopedia that anyone can edit, has been enormously successful. But while it is read nearly equally by women and men, women are only 8.5 to 12.6% of those who edit or write Wikipedia articles. We analyzed coverage of Wikipedia's gender gap by 42 U.S. news organizations and blogs as well as 1,336 comments posted online by readers. We also interviewed Wikimedia Foundation executive director Sue Gardner. Commentators questioned Wikipedia's epistemology and culture and associated the gap with societal issues and/or (perceived) gender differences regarding time management, self-confidence, and expertise, as well as personality and interests. Yet, many commentators denied the gap was a problem; they blamed women for not joining, suggested it was women's choice, or mocked girly interests. The belittling of the disparity as feminist ideology arguably betrays an antifeminist backlash. © The Author(s) 2013 Reprints and permissions: sagepub.com/journalsPermissions.nav. 0 0
2012 - A year of Ginev D.
Miller B.R.
Lecture Notes in Computer Science English 2013 a to XML converter, is being used in a wide range of MKM applications. In this paper, we present a progress report for the 2012 calendar year. Noteworthy enhancements include: increased coverage such as Wikipedia syntax; enhanced capabilities such as embeddable JavaScript and CSS resources and RDFa support; a web service for remote processing via web-sockets; along with general accuracy and reliability improvements. The outlook for an 0.8.0 release in mid-2013 is also discussed. 0 0
3D Wikipedia: Using online text to automatically label and navigate reconstructed geometry Russell B.C.
Martin-Brualla R.
Butler D.J.
Seitz S.M.
Zettlemoyer L.
ACM Transactions on Graphics English 2013 We introduce an approach for analyzing Wikipedia and other text, together with online photos, to produce annotated 3D models of famous tourist sites. The approach is completely automated, and leverages online text and photo co-occurrences via Google Image Search. It enables a number of new interactions, which we demonstrate in a new 3D visualization tool. Text can be selected to move the camera to the corresponding objects, 3D bounding boxes provide anchors back to the text describing them, and the overall narrative of the text provides a temporal guide for automatically flying through the scene to visualize the world as you read about it. We show compelling results on several major tourist sites. 0 0
A Wikipedia based hybrid ranking method for taxonomic relation extraction Zhong X. Lecture Notes in Computer Science English 2013 This paper proposes a hybrid ranking method for taxonomic relation extraction (or select best position) in an existing taxonomy. This method is capable of effectively combining two resources, an existing taxonomy and Wikipedia, in order to select a most appropriate position for a term candidate in the existing taxonomy. Previous methods mainly focus on complex inference methods to select the best position among all the possible position in the taxonomy. In contrast, our algorithm, a simple but effective one, leverage two kinds of information, the expression of and the ranking information of a term candidate, to select the best position for the term candidate (the hypernym of the term candidate in the existing taxonomy). We conduct our approach on the agricultural domain and the experimental result indicates that the performances are significantly improved. 0 0
A bookmark recommender system based on social bookmarking services and wikipedia categories Yoshida T.
Inoue U.
SNPD 2013 - 14th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing English 2013 Social book marking services allow users to add bookmarks of web pages with freely chosen keywords as tags. Personalized recommender systems recommend new and useful bookmarks added by other users. We propose a new method to find similar users and to select relevant bookmarks in a social book marking service. Our method is lightweight, because it uses a small set of important tags for each user to find useful bookmarks to recommend. Our method is also powerful, because it employs the Wikipedia category database to deal with the diversity of tags among users. The evaluation using the Hatena bookmark service in Japan shows that our method significantly increases the number of relevant bookmarks recommended without notable increase of irrelevant bookmarks. 0 0
A case study of a course including Wikipedia editing activity for undergraduate students Mori Y.
Egi H.
Ozawa S.
Proceedings of the 21st International Conference on Computers in Education, ICCE 2013 English 2013 Editing Wikipedia can increase participants' understandings of subjects, while making valuable contributions to the information society. In this study, we designed an online course for undergraduate students that included a Wikipedia editing activity. The result of a content analysis of the term papers revealed that the suggestions made by the e-mentor and the teacher were highly supportive for the students in our case study, and it is important for Japanese students to check Wikipedia in English before making their edits in Japanese. 0 0
A cloud of FAQ: A highly-precise FAQ retrieval system for the Web 2.0 Romero M.
Moreo A.
Castro J.L.
Knowledge-Based Systems English 2013 FAQ (Frequency Asked Questions) lists have attracted increasing attention for companies and organizations. There is thus a need for high-precision and fast methods able to manage large FAQ collections. In this context, we present a FAQ retrieval system as part of a FAQ exploiting project. Following the growing trend towards Web 2.0, we aim to provide users with mechanisms to navigate through the domain of knowledge and to facilitate both learning and searching, beyond classic FAQ retrieval algorithms. To this purpose, our system involves two different modules: an efficient and precise FAQ retrieval module and, a tag cloud generation module designed to help users to complete the comprehension of the retrieved information. Empirical results evidence the validity of our approach with respect to a number of state-of-the-art algorithms in terms of the most popular metrics in the field. © 2013 Elsevier B.V. All rights reserved. 0 0
A collaboration effectiveness and Easiness Evaluation Method for RE-specific wikis based on Cognition-Behavior Consistency Decision Triangle Peng R.
Sun D.
Lai H.
Jisuanji Xuebao/Chinese Journal of Computers Chinese 2013 Wiki technology, represented by Wikipedia, has attracted serious concern due to its capability to support collaboratively online contents' creation in a flexible and simple manner. Under the guidance of Wiki technology, developing specific wiki-based requirements management tools, namely RE-specific wikis, through extending various open source wikis to support distributed requirements engineering activities becomes a hot research topic. Many RE-specific wikis, such as RE-Wiki, SOP-Wiki and WikiWinWin, have been developed. But how to evaluate its collaboration effectiveness and easiness still needs further study. Based on Cognition-Behavior Consistency Decision Triangle (CBCDT), a Collaboration Effectiveness and Easiness Evaluation Method (CE3M) for evaluating RE-specific wikis is proposed. As to a specific RE-specific wiki, it evaluates the consistencies from three aspects: the expectations of its designers, the cognitions of its users and the behavior significations of its users. Specifically, the expectations of its designers and the cognitions of users are got from investigation; the behavior significations are gained from experts' investigation according to their opinions on the statistical data of the users' collaboration behaviors. And then, the consistencies' evaluations based on statistical hypothesis testing are performed. Through the case study, it shows that CE3M is appropriate to discover the similarities and differences among the expectations, cognitions and behaviors. These insights gained can be used as the objective evidences of RE-specific wiki's evolution decisions. 0 0
A comparative study of academic and wikipedia ranking Shuai X.
Jiang Z.
Xiaojiang Liu
Bollen J.
Proceedings of the ACM/IEEE Joint Conference on Digital Libraries English 2013 In addition to its broad popularity Wikipedia is also widely used for scholarly purposes. Many Wikipedia pages pertain to academic papers, scholars and topics providing a rich ecology for scholarly uses. Scholarly references and mentions on Wikipedia may thus shape the \societal impact" of a certain scholarly communication item, but it is not clear whether they shape actual \academic impact". In this paper we compare the impact of papers, scholars, and topics according to two different measures, namely scholarly citations and Wikipedia mentions. Our results show that academic and Wikipedia impact are positively correlated. Papers, authors, and topics that are mentioned on Wikipedia have higher academic impact than those are not mentioned. Our findings validate the hypothesis that Wikipedia can help assess the impact of scholarly publications and underpin relevance indicators for scholarly retrieval or recommendation systems. Copyright © 2013 by the Association for Computing Machinery, Inc. (ACM). 0 0
A computational approach to politeness with application to social factors Cristian Danescu-Niculescu-Mizil
Sudhof M.
Dan J.
Leskovec J.
Potts C.
ACL 2013 - 51st Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference English 2013 We propose a computational framework for identifying linguistic aspects of politeness. Our starting point is a new corpus of requests annotated for politeness, which we use to evaluate aspects of politeness theory and to uncover new interactions between politeness markers and context. These findings guide our construction of a classifier with domain-independent lexical and syntactic features operationalizing key components of politeness theory, such as indirection, deference, impersonalization and modality. Our classifier achieves close to human performance and is effective across domains. We use our framework to study the relationship between politeness and social power, showing that polite Wikipedia editors are more likely to achieve high status through elections, but, once elevated, they become less polite. We see a similar negative correlation between politeness and power on Stack Exchange, where users at the top of the reputation scale are less polite than those at the bottom. Finally, we apply our classifier to a preliminary analysis of politeness variation by gender and community. 0 0
A content analysis of wikiproject discussions: Toward a typology of coordination language used by virtual teams Morgan J.T.
Mcdonald D.W.
Gilbert M.
Mark Zachry
English 2013 Understanding the role of explicit coordination in virtual teams allows for a more meaningful understanding of how people work together online. We describe a new content analysis for classifying discussions within Wikipedia WikiProjects-voluntary, self-directed teams of editors-present preliminary findings, and discuss potential applications and future research directions. Copyright © 2012 by the Association for Computing Machinery, Inc. (ACM). 0 0
A content-context-centric approach for detecting vandalism in Wikipedia Lakshmish Ramaswamy
Tummalapenta R.S.
Li K.
Calton Pu
Proceedings of the 9th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing, COLLABORATECOM 2013 English 2013 Collaborative online social media (CSM) applications such as Wikipedia have not only revolutionized the World Wide Web, but they also have had a hugely positive effect on modern free societies. Unfortunately, Wikipedia has also become target to a wide-variety of vandalism attacks. Most existing vandalism detection techniques rely upon simple textual features such as existence of abusive language or spammy words. These techniques are ineffective against sophisticated vandal edits, which often do not contain the tell-tale markers associated with vandalism. In this paper, we argue for a context-aware approach for vandalism detection. This paper proposes a content-context-aware vandalism detection framework. The main idea is to quantify how well the words contained in the edit fit into the topic and the existing content of the Wikipedia article. We present two novel metrics, called WWW co-occurrence probability and top-ranked co-occurrence probability for this purpose. We also develop efficient mechanisms for evaluating these two metrics, and machine learning-based schemes that utilize these metrics. The paper presents a range of experiments to demonstrate the effectiveness of the proposed approach. 0 0
A framework for benchmarking entity-annotation systems Cornolti M.
Paolo Ferragina
Massimiliano Ciaramita
WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web English 2013 In this paper we design and implement a benchmarking framework for fair and exhaustive comparison of entity-annotation systems. The framework is based upon the definition of a set of problems related to the entity-annotation task, a set of measures to evaluate systems performance, and a systematic comparative evaluation involving all publicly available datasets, containing texts of various types such as news, tweets and Web pages. Our framework is easily-extensible with novel entity annotators, datasets and evaluation measures for comparing systems, and it has been released to the public as open source1. We use this framework to perform the first extensive comparison among all available entity annotators over all available datasets, and draw many interesting conclusions upon their efficiency and effectiveness. We also draw conclusions between academic versus commercial annotators. Copyright is held by the International World Wide Web Conference Committee (IW3C2). 0 0
A framework for detecting public health trends with Twitter Parker J.
Wei Y.
Yates A.
Frieder O.
Goharian N.
Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2013 English 2013 Traditional public health surveillance requires regular clinical reports and considerable effort by health professionals to analyze data. Therefore, a low cost alternative is of great practical use. As a platform used by over 500 million users worldwide to publish their ideas about many topics, including health conditions, Twitter provides researchers the freshest source of public health conditions on a global scale. We propose a framework for tracking public health condition trends via Twitter. The basic idea is to use frequent term sets from highly purified health-related tweets as queries into a Wikipedia article index - treating the retrieval of medically-related articles as an indicator of a health-related condition. By observing fluctuations in frequent term sets and in turn medically-related articles over a series of time slices of tweets, we detect shifts in public health conditions and concerns over time. Compared to existing approaches, our framework provides a general a priori identification of emerging public health conditions rather than a specific illness (e.g., influenza) as is commonly done. Copyright 2013 ACM. 0 0
A framework for the calibration of social simulation models Ciampaglia G.L. Advances in Complex Systems English 2013 Simulation with agent-based models is increasingly used in the study of complex socio-technical systems and in social simulation in general. This paradigm offers a number of attractive features, namely the possibility of modeling emergent phenomena within large populations. As a consequence, often the quantity in need of calibration may be a distribution over the population whose relation with the parameters of the model is analytically intractable. Nevertheless, we can simulate. In this paper we present a simulation-based framework for the calibration of agent-based models with distributional output based on indirect inference. We illustrate our method step by step on a model of norm emergence in an online community of peer production, using data from three large Wikipedia communities. Model fit and diagnostics are discussed. 0 0
A game theoretic analysis of collaboration in Wikipedia Anand S.
Ofer Arazy
Mandayam N.B.
Oded Nov
Lecture Notes in Computer Science English 2013 Peer production projects such as Wikipedia or open-source software development allow volunteers to collectively create knowledge-based products. The inclusive nature of such projects poses difficult challenges for ensuring trustworthiness and combating vandalism. Prior studies in the area deal with descriptive aspects of peer production, failing to capture the idea that while contributors collaborate, they also compete for status in the community and for imposing their views on the product. In this paper, we investigate collaborative authoring in Wikipedia, where contributors append and overwrite previous contributions to a page. We assume that a contributor's goal is to maximize ownership of content sections, such that content owned (i.e. originated) by her survived the most recent revision of the page.We model contributors' interactions to increase their content ownership as a non-cooperative game, where a player's utility is associated with content owned and cost is a function of effort expended. Our results capture several real-life aspects of contributors interactions within peer-production projects. Namely, we show that at the Nash equilibrium there is an inverse relationship between the effort required to make a contribution and the survival of a contributor's content. In other words, majority of the content that survives is necessarily contributed by experts who expend relatively less effort than non-experts. An empirical analysis of Wikipedia articles provides support for our model's predictions. Implications for research and practice are discussed in the context of trustworthy collaboration as well as vandalism. 0 0
A generalized flow-based method for analysis of implicit relationships on wikipedia Xiaodan Zhang
Yasuhito Asano
Masatoshi Yoshikawa
IEEE Transactions on Knowledge and Data Engineering English 2013 We focus on measuring relationships between pairs of objects in Wikipedia whose pages can be regarded as individual objects. Two kinds of relationships between two objects exist: in Wikipedia, an explicit relationship is represented by a single link between the two pages for the objects, and an implicit relationship is represented by a link structure containing the two pages. Some of the previously proposed methods for measuring relationships are cohesion-based methods, which underestimate objects having high degrees, although such objects could be important in constituting relationships in Wikipedia. The other methods are inadequate for measuring implicit relationships because they use only one or two of the following three important factors: distance, connectivity, and cocitation. We propose a new method using a generalized maximum flow which reflects all the three factors and does not underestimate objects having high degree. We confirm through experiments that our method can measure the strength of a relationship more appropriately than these previously proposed methods do. Another remarkable aspect of our method is mining elucidatory objects, that is, objects constituting a relationship. We explain that mining elucidatory objects would open a novel way to deeply understand a relationship. 0 0
A generic open world named entity disambiguation approach for tweets Habib M.B.
Van Keulen M.
IC3K 2013; KDIR 2013 - 5th International Conference on Knowledge Discovery and Information Retrieval and KMIS 2013 - 5th International Conference on Knowledge Management and Information Sharing, Proc. English 2013 Social media is a rich source of information. To make use of this information, it is sometimes required to extract and disambiguate named entities. In this paper, we focus on named entity disambiguation (NED) in twitter messages. NED in tweets is challenging in two ways. First, the limited length of Tweet makes it hard to have enough context while many disambiguation techniques depend on it. The second is that many named entities in tweets do not exist in a knowledge base (KB). We share ideas from information retrieval (IR) and NED to propose solutions for both challenges. For the first problem we make use of the gregarious nature of tweets to get enough context needed for disambiguation. For the second problem we look for an alternative home page if there is no Wikipedia page represents the entity. Given a mention, we obtain a list of Wikipedia candidates from YAGO KB in addition to top ranked pages from Google search engine. We use Support Vector Machine (SVM) to rank the candidate pages to find the best representative entities. Experiments conducted on two data sets show better disambiguation results compared with the baselines and a competitor. 0 0
A history of newswork on wikipedia Brian C. Keegan Proceedings of the 9th International Symposium on Open Collaboration, WikiSym + OpenSym 2013 English 2013 Wikipedia's coverage of current events blurs the boundaries of what it means to be an encyclopedia. Drawing on Gieyrn's concept of \boundary work", this paper explores how Wiki- pedia's response to the 9/11 attacks expanded the role of the encyclopedia to include newswork, excluded content like the 9/11 Memorial Wiki that became problematic following this expansion, and legitimized these changes through the adop- Tion of news-related policies and routines like promoting "In the News" content on the homepage. However, a second case exploring WikiNews illustrates the pitfalls of misappropriat- ing professional newswork norms as well as the challenges of sustaining online communities. These cases illuminate the social construction of new technologies as they confront the boundaries of traditional professional identities and also re- veal how newswork is changing in response to new forms of organizing enabled by these technologies. Categories and Subject Descriptors K.2 [Computing Milieux]: History of Computing; K.4.3 [Computers and Society]: Organizational ImpactsCom- puter supported collaborative work General Terms Standardization,Theory. Copyright 2010 ACM. 0 0
A likelihood-based framework for the analysis of discussion threads Gomez V.
Kappen H.J.
Litvak N.
Andreas Kaltenbrunner
World Wide Web English 2013 Online discussion threads are conversational cascades in the form of posted messages that can be generally found in social systems that comprise many-to-many interaction such as blogs, news aggregators or bulletin board systems. We propose a framework based on generative models of growing trees to analyse the structure and evolution of discussion threads. We consider the growth of a discussion to be determined by an interplay between popularity, novelty and a trend (or bias) to reply to the thread originator. The relevance of these features is estimated using a full likelihood approach and allows to characterise the habits and communication patterns of a given platform and/or community. We apply the proposed framework on four popular websites: Slashdot, Barrapunto (a Spanish version of Slashdot), Meneame (a Spanish Digg-clone) and the article discussion pages of the English Wikipedia. Our results provide significant insight into understanding how discussion cascades grow and have potential applications in broader contexts such as community management or design of communication platforms. 0 0
A linguistic consensus model for Web 2.0 communities Alonso S.
Perez I.J.
Cabrerizo F.J.
Herrera-Viedma E.
Applied Soft Computing Journal English 2013 Web 2.0 communities are a quite recent phenomenon which involve large numbers of users and where communication between members is carried out in real time. Despite of those good characteristics, there is still a necessity of developing tools to help users to reach decisions with a high level of consensus in those new virtual environments. In this contribution a new consensus reaching model is presented which uses linguistic preferences and is designed to minimize the main problems that this kind of organization presents (low and intermittent participation rates, difficulty of establishing trust relations and so on) while incorporating the benefits that a Web 2.0 community offers (rich and diverse knowledge due to a large number of users, real-time communication, etc.). The model includes some delegation and feedback mechanisms to improve the speed of the process and its convergence towards a solution of consensus. Its possible application to some of the decision making processes that are carried out in the Wikipedia is also shown. © 2012 Elsevier B.V. All rights reserved. 0 0
A method for recommending the most appropriate expansion of acronyms using wikipedia Choi D.
Shin J.
Lee E.
Kim P.
Proceedings - 7th International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, IMIS 2013 English 2013 Over the years, many researchers have been studied to detect expansions of acronyms in texts by using linguistic and syntactical approaches in order to overcome disambiguation problems. Acronym is an abbreviation formed which is composed of initial components of single or multiple words. These initial components bring huge mistakes when a machine conducts experiments to find meaning from given texts. Detecting expansions of acronyms is not a big issue now days. The problem is that a polysemous acronym. In order to solve this problem, this paper proposes a method to recommend the most related expansion of acronym through analyzing co-occurrence words by using Wikipedia. Our goal is not finding acronym definition or expansion but recommending the most appropriate expansion of given acronyms. 0 0
A multilingual and multiplatform application for medicinal plants prescription from medical symptoms Ruiz-Rico F.
Rubio-Sanchez M.-C.
Tomas D.
Vicedo J.-L.
SIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2013 This paper presents an application for medicinal plants prescription based on text classification techniques. The system receives as an input a free text describing the symptoms of a user, and retrieves a ranked list of medicinal plants related to those symptoms. In addition, a set of links to Wikipedia are also provided, enriching the information about every medicinal plant presented to the user. In order to improve the accessibility to the application, the input can be written in six different languages, adapting the results accordingly. The application interface can be accessed from different devices and platforms. 0 0
A new approach for building domain-specific corpus with wikipedia Zhang X.Y.
Li X.
Ruan Z.J.
Applied Mechanics and Materials English 2013 Domain-specific corpus can be used to build domain ontology, which is used in many areas such as IR, NLP and web Mining. We propose a multi-root based method to build a domain-specific corpus making use of Wikipedia resources. First we select some top-level nodes (Wikipedia category articles) as root nodes and traverse the Wikipedia using BFS-like algorithm. After the traverse, we get a directed Wikipedia graph (Wiki-graph). Then an algorithm mainly based on Kosaraju Algorithm is proposed to remove the cycles in the Wiki-graph. Finally, topological sort algorithm is used to traverse the Wiki-graph, and ranking and filtering is done during the process. When computing a node's ranking score, the in-degree of itself and the out-degree of its parents are both considered. The experimental evaluation shows that our method could get a high-quality domain-specific corpus. 0 0
A new approach to detecting content anomalies in Wikipedia Sinanc D.
Yavanoglu U.
Proceedings - 2013 12th International Conference on Machine Learning and Applications, ICMLA 2013 English 2013 The rapid growth of the web has caused to availability of data effective if its content is well organized. Despite the fact that Wikipedia is the biggest encyclopedia on the web, its quality is suspect due to its Open Editing Schemas (OES). In this study, zoology and botany pages are selected in English Wikipedia and their html contents are converted to text then Artificial Neural Network (ANN) is used for classification to prevent disinformation or misinformation. After the train phase, some irrelevant words added in the content about politics or terrorism in proportion to the size of the text. By the time unsuitable content is added in a page until the moderators' intervention, the proposed system realized the error via wrong categorization. The results have shown that, when words number 2% of the content is added anomaly rate begins to cross the 50% border. 0 0
A new text representation scheme combining Bag-of-Words and Bag-of-Concepts approaches for automatic text classification Alahmadi A.
Joorabchi A.
Mahdi A.E.
2013 7th IEEE GCC Conference and Exhibition, GCC 2013 English 2013 This paper introduces a new approach to creating text representations and apply it to a standard text classification collections. The approach is based on supplementing the well-known Bag-of-Words (BOW) representational scheme with a concept-based representation that utilises Wikipedia as a knowledge base. The proposed representations are used to generate a Vector Space Model, which in turn is fed into a Support Vector Machine classifier to categorise a collection of textual documents from two publically available datasets. Experimental results for evaluating the performance of our model in comparison to using a standard BOW scheme and a concept-based scheme, as well as recently reported similar text representations that are based on augmenting the standard BOW approach with concept-based representations. 0 0
A novel map-based visualization method based on liquid modelling Biuk-Aghai R.P.
Ao W.H.
ACM International Conference Proceeding Series English 2013 Many applications produce large amounts of data, and information visualization has been successfully applied to help make sense of this data. Recently geographic maps have been used as a metaphor for visualization, given that most people are familiar with reading maps, and several visualization methods based on this metaphor have been developed. In this paper we present a new visualization method that aims to improve on existing map-like visualizations. It is based on the metaphor of liquids poured onto a surface that expand outwards until they touch each other, forming larger areas. We present the design of our visualization method and an evaluation we have carried out to compare it with an existing visualization. Our new visualization has better usability, leading to higher accuracy and greater speed of task performance. 0 0
A portable multilingual medical directory by automatic categorization of wikipedia articles Ruiz-Rico F.
Rubio-Sanchez M.-C.
Tomas D.
Vicedo J.-L.
SIGIR 2013 - Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2013 Wikipedia has become one of the most important sources of information available all over the world. However, the categorization of Wikipedia articles is not standardized and the searches are mainly performed on keywords rather than concepts. In this paper we present an application that builds a hierarchical structure to organize all Wikipedia entries, so that medical articles can be reached from general to particular, using the well known Medical Subject Headings (MeSH) thesaurus. Moreover, the language links between articles will allow using the directory created in different languages. The final system can be packed and ported to mobile devices as a standalone offline application. 0 0
A preliminary study of Croatian language syllable networks Ban K.
Ivakic I.
Mestrovic A.
2013 36th International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2013 - Proceedings English 2013 This paper presents preliminary results of Croatian syllable networks analysis. We analyzed networks of syllables generated from texts collected from the Croatian Wikipedia and Blogs. Different syllable networks are constructed in a way that each node in this network is a syllable, and links are established between two syllables if they appear together in the same word (co-occurrence network) or if they appear as neighbours in a word (neighbour network). As a main tool we use network analysis methods which provide mechanisms that can reveal new patterns in a complex language structure. We aim to show that syllable networks differ from Erdös-Renyi random networks, which may indicate that language has its own rules and self-organization structure. Furthermore, our results have been compared with other studies on syllable network of Portuguese and Chinese. The results indicate that Croatian Syllables networks exhibit certain properties of a small world networks. 0 0
A preliminary study on the effects of barnstars on wikipedia editing Lim K.H.
Anwitaman Datta
Wise M.
Proceedings of the 9th International Symposium on Open Collaboration, WikiSym + OpenSym 2013 English 2013 This paper presents a preliminary study into the awarding of barnstars among Wikipedia editors to better understand their motivations in contributing to Wikipedia articles. We crawled the talk pages of all active Wikipedia editors and retrieved 21,299 barnstars that were awarded among 14,074 editors. In particular, we found that editors do not award and receive barnstars in equal (or similar) quantities. Also, editors were more active in editing articles before awarding or receiving barnstars. Categories and Subject Descriptors H.5.3 [Group and Organization Interfaces]: Computer- supported cooperative work General Terms Measurement. Copyright 2010 ACM. 0 0
A quick tour of BabelNet 1.1 Roberto Navigli Lecture Notes in Computer Science English 2013 In this paper we present BabelNet 1.1, a brand-new release of the largest "encyclopedic dictionary", obtained from the automatic integration of the most popular computational lexicon of English, i.e. WordNet, and the largest multilingual Web encyclopedia, i.e. Wikipedia. BabelNet 1.1 covers 6 languages and comes with a renewed Web interface, graph explorer and programmatic API. BabelNet is available online at http://www.babelnet.org. 0 0
A social contract for virtual institutions Memmi D. AI and Society English 2013 Computer-mediated social groups, often known as virtual communities, are now giving rise to a more durable and more abstract phenomenon: the emergence of virtual institutions. These social institutions operating mostly online exhibit very interesting qualities. Their distributed, collaborative, low-cost and reactive nature makes them very useful. Yet they are also probably more fragile than classical institutions and in need of appropriate support mechanisms. We will analyze them as social institutions, and then resort to social contract theory to determine adequate support measures. We will argue that virtual institutions can be greatly helped by making explicit and publicly available online their norms, rules and procedures, so as to improve the collaboration between their members. 0 0
A support framework for argumentative discussions management in the web Cabrio E.
Villata S.
Fabien Gandon
Lecture Notes in Computer Science English 2013 On the Web, wiki-like platforms allow users to provide arguments in favor or against issues proposed by other users. The increasing content of these platforms as well as the high number of revisions of the content through pros and cons arguments make it difficult for community managers to understand and manage these discussions. In this paper, we propose an automatic framework to support the management of argumentative discussions in wiki-like platforms. Our framework is composed by (i) a natural language module, which automatically detects the arguments in natural language returning the relations among them, and (ii) an argumentation module, which provides the overall view of the argumentative discussion under the form of a directed graph highlighting the accepted arguments. Experiments on the history of Wikipedia show the feasibility of our approach. 0 0
A verification method for MASOES Perozo N.
Aguilar J.
Teran O.
Molina H.
IEEE Transactions on Cybernetics English 2013 MASOES is a 3agent architecture for designing and modeling self-organizing and emergent systems. This architecture describes the elements, relationships, and mechanisms, both at the individual and the collective levels, that favor the analysis of the self-organizing and emergent phenomenon without mathematically modeling the system. In this paper, a method is proposed for verifying MASOES from the point of view of design in order to study the self-organizing and emergent behaviors of the modeled systems. The verification criteria are set according to what is proposed in MASOES for modeling self-organizing and emerging systems and the principles of the wisdom of crowd paradigm and the fuzzy cognitivemap (FCM) theory. The verificationmethod for MASOES has been implemented in a tool called FCM Designer and has been tested to model a community of free software developers that works under the bazaar style as well as a Wikipedia community in order to study their behavior and determine their self-organizing and emergent capacities. 0 0
Accessible online content creation by end users Kuksenok K.
Brooks M.
Mankoff J.
Conference on Human Factors in Computing Systems - Proceedings English 2013 Like most online content, user-generated content (UGC) poses accessibility barriers to users with disabilities. However, the accessibility difficulties pervasive in UGC warrant discussion and analysis distinct from other kinds of online content. Content authors, community culture, and the authoring tool itself all affect UGC accessibility. The choices, resources available, and strategies in use to ensure accessibility are different than for other types of online content. We contribute case studies of two UGC communities with accessible content: Wikipedia, where authors focus on access to visual materials and navigation, and an online health support forum where users moderate the cognitive accessibility of posts. Our data demonstrate real world moderation strategies and illuminate factors affecting success, such as community culture. We conclude with recommended strategies for creating a culture of accessibility around UGC. Copyright 0 0
Acronym-expansion recognition based on knowledge map system Jeong D.-H.
Myunggwon Hwang
Jihie Kim
Hanmin Jung
Sung W.-K.
Information (Japan) English 2013 In this paper, we present a method for instance mapping and URI resolving to merge two heterogeneous resources and construct a new semantic network from the viewpoint of acronym-expansion. Acronym-expansion information extracted from two unstructured large datasets can be remapped by using linkage information between instances and measuring string similarity. Finally we evaluate the acronym discrimination performance based on the proposed knowledge map system. The result showed that noun phrase based feature selection method gained 89.6% micro averaged precision, which outperformed single noun based one by 20.1%. We found a possibility of interoperability between heterogeneous databases through the experiment of acronym-expansion recognition. 0 0
Aemoo: Exploring knowledge on the Web Nuzzolese A.G.
Valentina Presutti
Aldo Gangemi
Alberto Musetti
Paolo Ciancarini
Proceedings of the 3rd Annual ACM Web Science Conference, WebSci 2013 English 2013 Aemoo is a Semantic Web application supporting knowledge exploration on the Web. Through a keyword-based search interface, users can gather an effective summary of the knowledge about an entity, according to Wikipedia, Twitter, and Google News. Summaries are designed by applying lenses based on a set of empirically discovered knowledge patterns. Copyright 2013 ACM. 0 0
An approach for deriving semantically related category hierarchies from Wikipedia category graphs Hejazy K.A.
El-Beltagy S.R.
Advances in Intelligent Systems and Computing English 2013 Wikipedia is the largest online encyclopedia known to date. Its rich content and semi-structured nature has made it into a very valuable research tool used for classification, information extraction, and semantic annotation, among others. Many applications can benefit from the presence of a topic hierarchy in Wikipedia. However, what Wikipedia currently offers is a category graph built through hierarchical category links the semantics of which are un-defined. Because of this lack of semantics, a sub-category in Wikipedia does not necessarily comply with the concept of a sub-category in a hierarchy. Instead, all it signifies is that there is some sort of relationship between the parent category and its sub-category. As a result, traversing the category links of any given category can often result in surprising results. For example, following the category of "Computing" down its sub-category links, the totally unrelated category of "Theology" appears. In this paper, we introduce a novel algorithm that through measuring the semantic relatedness between any given Wikipedia category and nodes in its sub-graph is capable of extracting a category hierarchy containing only nodes that are relevant to the parent category. The algorithm has been evaluated by comparing its output with a gold standard data set. The experimental setup and results are presented. 0 0
An approach for using wikipedia to measure the flow of trends across countries Tinati R.
Tiropanis T.
Leslie Carr
WWW 2013 Companion - Proceedings of the 22nd International Conference on World Wide Web English 2013 Wikipedia has grown to become the most successful online encyclopedia on the Web, containing over 24 million articles, offered in over 240 languages. In just over 10 years Wikipedia has transformed from being just an encyclopedia of knowledge, to a wealth of facts and information, from articles discussing trivia, political issues, geographies and demographics, to popular culture, news articles, and social events. In this paper we explore the use of Wikipedia for identifying the flow of information and trends across the world. We start with the hypothesis that, given that Wikipedia is a resource that is globally available in different languages across countries, access to its articles could be a reflection human activity. To explore this hypothesis we try to establish metrics on the use of Wikipedia in order to identify potential trends and to establish whether or how those trends flow from one county to another. We subsequently compare the outcome of this analysis to that of more established methods that are based on online social media or traditional media. We explore this hypothesis by applying our approach to a subset of Wikipedia articles and also a specific worldwide social phenomenon that occurred during 2012; we investigate whether access to relevant Wikipedia articles correlates to the viral success of the South Korean pop song, "Gangnam Style" and the associated artist "PSY" as evidenced by traditional and online social media. Our analysis demonstrates that Wikipedia can indeed provide a useful measure for detecting social trends and events, and in the case that we studied; it could have been possible to identify the specific trend quicker in comparison to other established trend identification services such as Google Trends. 0 0
An approach of filtering wrong-type entities for entity ranking Jinghua Zhang
Qu Y.
Gong S.
Tian S.
Sun H.
IEICE Transactions on Information and Systems English 2013 Entity is an important information carrier in Web pages. Users would like to directly get a list of relevant entities instead of a list of documents when they submit a query to the search engine. So the research of related entity finding (REF) is a meaningful work. In this paper we investigate the most important task of REF: Entity Ranking. The wrong-type entities which don't belong to the target-entity type will pollute the ranking result. We propose a novel method to filter wrong-type entities. We focus on the acquisition of seed entities and automatically extracting the common Wikipedia categories of target-entity type. Also we demonstrate how to filter wrong-type entities using the proposed model. The experimental results show our method can filter wrong-type entities effectively and improve the results of entity ranking. 0 0
An automatic approach for ontology-based feature extraction from heterogeneous textualresources Vicient C.
Sanchez D.
Moreno A.
Engineering Applications of Artificial Intelligence English 2013 Data mining algorithms such as data classification or clustering methods exploit features of entities to characterise, group or classify them according to their resemblance. In the past, many feature extraction methods focused on the analysis of numerical or categorical properties. In recent years, motivated by the success of the Information Society and the WWW, which has made available enormous amounts of textual electronic resources, researchers have proposed semantic data classification and clustering methods that exploit textual data at a conceptual level. To do so, these methods rely on pre-annotated inputs in which text has been mapped to their formal semantics according to one or several knowledge structures (e.g. ontologies, taxonomies). Hence, they are hampered by the bottleneck introduced by the manual semantic mapping process. To tackle this problem, this paper presents a domain-independent, automatic and unsupervised method to detect relevant features from heterogeneous textual resources, associating them to concepts modelled in a background ontology. The method has been applied to raw text resources and also to semi-structured ones (Wikipedia articles). It has been tested in the Tourism domain, showing promising results. © 2012 Elsevier Ltd. All rights reserved. 0 0
An efficient incentive compatible mechanism to motivate wikipedia contributors Pramod M.
Mukhopadhyay S.
Gosh D.
Advances in Intelligent Systems and Computing English 2013 Wikipedia is the world's largest collaboratively edited source of encyclopedic information repository consisting almost 1.5 million articles and more than 90,000 contributors. Although, since its inception on 2001, the numbers of contributors were huge, A study made in 2009 found that members (contributors) may initially contribute to site for pleasure or being motivated by an internal drive to share his knowledge. But latter they are not motivated to edit the related articles so that quality of the articles could be improved [1] [5].In our paper we address above problem in economics perspective. Here we propose a novel scheme to motivate the contributors of Wikipedia with the mechanism design theory that is the most emerging tool at present to address the situation when data is privately held with the agents. 0 0
An empirical study on faculty perceptions and teaching practices of wikipedia Llados J.
Eduard Aibar
Lerga M.
Meseguer A.
Minguillon J.
Proceedings of the European Conference on e-Learning, ECEL English 2013 Some faculty members from different universities around the world have begun to use Wikipedia as a teaching tool in recent years. These experiences show, in most cases, very satisfactory results and a substantial improvement in various basic skills, as well as a positive influence on the students' motivation. Nevertheless and despite the growing importance of e-learning methodologies based on the use of the Internet for higher education, the use of Wikipedia as a teaching resource remains scarce among university faculty. Our investigation tries to identify which are the main factors that determine acceptance or resistance to that use. We approach the decision to use Wikipedia as a teaching tool by analyzing both the individual attributes of faculty members and the characteristics of the environment where they develop their teaching activity. From a specific survey sent to all faculty of the Universitat Oberta de Catalunya (UOC), pioneer and leader in online education in Spain, we have tried to infer the influence of these internal and external elements. The questionnaire was designed to measure different constructs: perceived quality of Wikipedia, teaching practices involving Wikipedia, use experience, perceived usefulness and use of 2.0 tools. Control items were also included for gathering information on gender, age, teaching experience, academic rank, and area of expertise. Our results reveal that academic rank, teaching experience, age or gender, are not decisive factors in explaining the educational use of Wikipedia. Instead, the decision to use it is closely linked to the perception of Wikipedia's quality, the use of other collaborative learning tools, an active attitude towards web 2.0 applications, and connections with the professional non-academic world. Situational context is also very important, since the use is higher when faculty members have got reference models in their close environment and when they perceive it is positively valued by their colleagues. As far as these attitudes, practices and cultural norms diverge in different scientific disciplines, we have also detected clear differences in the use of Wikipedia among areas of academic expertise. As a consequence, a greater application of Wikipedia both as a teaching resource and as a driver for teaching innovation would require much more active institutional policies and some changes in the dominant academic culture among faculty members. 0 0
An exploration of discussion threads in social news sites: A case study of the Reddit community Weninger T.
Zhu X.A.
Jangwhan Han
Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2013 English 2013 Social news and content aggregation Web sites have become massive repositories of valuable knowledge on a diverse range of topics. Millions of Web-users are able to leverage these platforms to submit, view and discuss nearly anything. The users themselves exclusively curate the content with an intricate system of submissions, voting and discussion. Furthermore, the data on social news Web sites is extremely well organized by its user-base, which opens the door for opportunities to leverage this data for other purposes just like Wikipedia data has been used for many other purposes. In this paper we study a popular social news Web site called Reddit. Our investigation looks at the dynamics of its discussion threads, and asks two main questions: (1) to what extent do discussion threads resemble a topical hierarchy? and (2) Can discussion threads be used to enhance Web search? We show interesting results for these questions on a very large snapshot several sub-communities of the Reddit Web site. Finally, we discuss the implications of these results and suggest ways by which social news Web site's can be used to perform other tasks. Copyright 2013 ACM. 0 0
An index for efficient semantic full-text search Holger Bast
Buchhold B.
International Conference on Information and Knowledge Management, Proceedings English 2013 In this paper we present a novel index data structure tailored towards semantic full-text search. Semantic full-text search, as we call it, deeply integrates keyword-based full-text search with structured search in ontologies. Queries are SPARQL-like, with additional relations for specifying word-entity co-occurrences. In order to build such queries the user needs to be guided. We believe that incremental query construction with context-sensitive suggestions in every step serves that purpose well. Our index has to answer queries and provide such suggestions in real time. We achieve this through a novel kind of posting lists and query processing, avoiding very long (intermediate) result lists and expensive (non-local) operations on these lists. In an evaluation of 8000 queries on the full English Wikipedia (40 GB XML dump) and the YAGO ontology (26.6 million facts), we achieve average query and suggestion times of around 150ms. Copyright is held by the owner/author(s). 0 0
An investigation of the relationship between the amount of extra-textual data and the quality of Wikipedia articles Himoro M.Y.
Hanada R.
Marco Cristo
Pimentel M.D.G.C.
WebMedia 2013 - Proceedings of the 19th Brazilian Symposium on Multimedia and the Web English 2013 Wikipedia, a web-based collaboratively maintained free encyclopedia, is emerging as one of the most important websites on the internet. However, its openness raises many concerns about the quality of the articles and how to assess it automatically. In the Portuguese-speaking Wikipedia, articles can be rated by bots and by the community. In this paper, we investigate the correlation between these ratings and the count of media items (namely images and sounds) through a series of experiments. Our results show that article ratings and the count of media items are correlated. 0 0
An open-source toolkit for mining Wikipedia Milne D.
Witten I.H.
Artificial Intelligence English 2013 The online encyclopedia Wikipedia is a vast, constantly evolving tapestry of interlinked articles. For developers and researchers it represents a giant multilingual database of concepts and semantic relations, a potential resource for natural language processing and many other research areas. This paper introduces the Wikipedia Miner toolkit, an open-source software system that allows researchers and developers to integrate Wikipedia's rich semantics into their own applications. The toolkit creates databases that contain summarized versions of Wikipedia's content and structure, and includes a Java API to provide access to them. Wikipedia's articles, categories and redirects are represented as classes, and can be efficiently searched, browsed, and iterated over. Advanced features include parallelized processing of Wikipedia dumps, machine-learned semantic relatedness measures and annotation features, and XML-based web services. Wikipedia Miner is intended to be a platform for sharing data mining techniques. © 2012 Elsevier B.V. All rights reserved. 0 1
Analysis and forecasting of trending topics in online media streams Althoff T.
Borth D.
Hees J.
Andreas Dengel
MM 2013 - Proceedings of the 2013 ACM Multimedia Conference English 2013 Among the vast information available on the web, social media streams capture what people currently pay attention to and how they feel about certain topics. Awareness of such trending topics plays a crucial role in multimedia systems such as trend aware recommendation and automatic vocabulary selection for video concept detection systems. Correctly utilizing trending topics requires a better under- standing of their various characteristics in different social media streams. To this end, we present the first comprehensive study across three major online and social media streams, Twitter, Google, and Wikipedia, covering thou- sands of trending topics during an observation period of an entire year. Our results indicate that depending on one's requirements one does not necessarily have to turn to Twitter for information about current events and that some media streams strongly emphasize content of specific categories. As our second key contribution, we further present a novel approach for the challenging task of forecasting the life cycle of trending topics in the very moment they emerge. Our fully automated approach is based on a nearest neighbor forecasting technique exploiting our assumption that semantically similar topics exhibit similar behavior. We demonstrate on a large-scale dataset of Wikipedia page view statistics that forecasts by the proposed approach are about 9-48k views closer to the actual viewing statistics compared to baseline methods and achieve a mean average percentage error of 45-19% for time periods of up to 14 days. Copyright 0 0
Analysis of cluster structure in large-scale English Wikipedia category networks Klaysri T.
Fenner T.
Lachish O.
Mark Levene
Papapetrou P.
Lecture Notes in Computer Science English 2013 In this paper we propose a framework for analysing the structure of a large-scale social media network, a topic of significant recent interest. Our study is focused on the Wikipedia category network, where nodes correspond to Wikipedia categories and edges connect two nodes if the nodes share at least one common page within the Wikipedia network. Moreover, each edge is given a weight that corresponds to the number of pages shared between the two categories that it connects. We study the structure of category clusters within the three complete English Wikipedia category networks from 2010 to 2012. We observe that category clusters appear in the form of well-connected components that are naturally clustered together. For each dataset we obtain a graph, which we call the t-filtered category graph, by retaining just a single edge linking each pair of categories for which the weight of the edge exceeds some specified threshold t. Our framework exploits this graph structure and identifies connected components within the t-filtered category graph. We studied the large-scale structural properties of the three Wikipedia category networks using the proposed approach. We found that the number of categories, the number of clusters of size two, and the size of the largest cluster within the graph all appear to follow power laws in the threshold t. Furthermore, for each network we found the value of the threshold t for which increasing the threshold to t + 1 caused the "giant" largest cluster to diffuse into two or more smaller clusters of significant size and studied the semantics behind this diffusion. 0 0
Analyzing and Predicting Quality Flaws in User-generated Content: The Case of Wikipedia Maik Anderka Bauhaus-Universität Weimar, Germany English 2013 Web applications that are based on user-generated content are often criticized for containing low-quality information; a popular example is the online encyclopedia Wikipedia. The major points of criticism pertain to the accuracy, neutrality, and reliability of information. The identification of low-quality information is an important task since for a huge number of people around the world it has become a habit to first visit Wikipedia in case of an information need. Existing research on quality assessment in Wikipedia either investigates only small samples of articles, or else deals with the classification of content into high-quality or low-quality. This thesis goes further, it targets the investigation of quality flaws, thus providing specific indications of the respects in which low-quality content needs improvement. The original contributions of this thesis, which relate to the fields of user-generated content analysis, data mining, and machine learning, can be summarized as follows:

(1) We propose the investigation of quality flaws in Wikipedia based on user-defined cleanup tags. Cleanup tags are commonly used in the Wikipedia community to tag content that has some shortcomings. Our approach is based on the hypothesis that each cleanup tag defines a particular quality flaw.

(2) We provide the first comprehensive breakdown of Wikipedia's quality flaw structure. We present a flaw organization schema, and we conduct an extensive exploratory data analysis which reveals (a) the flaws that actually exist, (b) the distribution of flaws in Wikipedia, and, (c) the extent of flawed content.

(3) We present the first breakdown of Wikipedia's quality flaw evolution. We consider the entire history of the English Wikipedia from 2001 to 2012, which comprises more than 508 million page revisions, summing up to 7.9 TB. Our analysis reveals (a) how the incidence and the extent of flaws have evolved, and, (b) how the handling and the perception of flaws have changed over time.

(4) We are the first who operationalize an algorithmic prediction of quality flaws in Wikipedia. We cast quality flaw prediction as a one-class classification problem, develop a tailored quality flaw model, and employ a dedicated one-class machine learning approach. A comprehensive evaluation based on human-labeled Wikipedia articles underlines the practical applicability of our approach.
0 0
Analyzing multi-dimensional networks within mediawikis Brian C. Keegan
Ceni A.
Smith M.A.
Proceedings of the 9th International Symposium on Open Collaboration, WikiSym + OpenSym 2013 English 2013 The MediaWiki platform supports popular socio-technical systems such as Wikipedia as well as thousands of other wikis. This software encodes and records a variety of rela- Tionships about the content, history, and editors of its arti- cles such as hyperlinks between articles, discussions among editors, and editing histories. These relationships can be an- Alyzed using standard techniques from social network analy- sis, however, extracting relational data from Wikipedia has traditionally required specialized knowledge of its API, in- formation retrieval, network analysis, and data visualization that has inhibited scholarly analysis. We present a soft- ware library called the NodeXL MediaWiki Importer that extracts a variety of relationships from the MediaWiki API and integrates with the popular NodeXL network analysis and visualization software. This library allows users to query and extract a variety of multidimensional relationships from any MediaWiki installation with a publicly-accessible API. We present a case study examining the similarities and dif- ferences between dierent relationships for the Wikipedia articles about \Pope Francis" and \Social media." We con- clude by discussing the implications this library has for both theoretical and methodological research as well as commu- nity management and outline future work to expand the capabilities of the library. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous; D.2.8 [Software Engineering]: Metricscomplexity mea- sures, performance measures General Terms System. Copyright 2010 ACM. 0 0
Arabic WordNet semantic relations enrichment through morpho-lexical patterns Boudabous M.M.
Chaaben Kammoun N.
Khedher N.
Belguith L.H.
Sadat F.
2013 1st International Conference on Communications, Signal Processing and Their Applications, ICCSPA 2013 English 2013 Arabic WordNet (AWN) ontology is one of the most interesting lexical resources for Modern Standard Arabic. Although, its development is based on Princeton WordNet, it suffers from some weaknesses such as the absence of some words and some semantic relations between synsets. In this paper we propose a linguistic method based on morpho-lexical patterns to add semantic relations between synsets in order to improve the AWN performance. This method relies on two steps: morpho-lexical patterns definition and Semantic relations enrichment. We will take advantage of defined patterns to propose a hybrid method for building Arabic ontology based on Wikipedia. 0 0
Arguments about deletion: How experience improves the acceptability of arguments in ad-hoc online task groups Jodi Schneider
Samp K.
Alexandre Passant
Stefan Decker
English 2013 Increasingly, ad-hoc online task groups must make decisions about jointly created artifacts such as open source software and Wikipedia articles. Time-consuming and laborious attention to textual discussions is needed to make such decisions, for which computer support would be beneficial. Yet there has been little study of the argumentation patterns that distributed ad-hoc online task groups use in evaluation and decision-making. In a corpus of English Wikipedia deletion discussions, we investigate the argumentation schemes used, the role of the arguer's experience, and which arguments are acceptable to the audience. We report three main results: First, the most prevalent patterns are the Rules and Evidence schemes from Walton's catalog of argumentation schemes [34], which comprise 36% of arguments. Second, we find that familiarity with community norms correlates with the novices' ability to craft persuasive arguments. Third, acceptable arguments use community-appropriate rhetoric that demonstrate knowledge of policies and community values while problematic arguments are based on personal preference and inappropriate analogy to other cases. Copyright 2013 ACM. 0 0
Assessing quality score of wikipedia articles using mutual evaluation of editors and texts Yu Suzuki
Masatoshi Yoshikawa
International Conference on Information and Knowledge Management, Proceedings English 2013 In this paper, we propose a method for assessing quality scores of Wikipedia articles by mutually evaluating editors and texts. Survival ratio based approach is a major approach to assessing article quality. In this approach, when a text survives beyond multiple edits, the text is assessed as good quality, because poor quality texts have a high probability of being deleted by editors. However, many vandals, low quality editors, delete good quality texts frequently, which improperly decreases the survival ratios of good quality texts. As a result, many good quality texts are unfairly assessed as poor quality. In our method, we consider editor quality score for calculating text quality score, and decrease the impact on text quality by vandals. Using this improvement, the accuracy of the text quality score should be improved. However, an inherent problem with this idea is that the editor quality scores are calculated by the text quality scores. To solve this problem, we mutually calculate the editor and text quality scores until they converge. In this paper, we prove that the text quality score converges. We did our experimental evaluation, and confirmed that our proposed method could accurately assess the text quality scores. Copyright is held by the owner/author(s). 0 0
Assessing trustworthiness in collaborative environments Segall J.
Mayhew M.J.
Atighetchi M.
Greenstadt R.
ACM International Conference Proceeding Series English 2013 Collaborative environments, specifically those concerning in- formation creation and exchange, increasingly demand notions of trust and accountability. In the absence of explicit authority, the quality of information is often unknown. Using Wikipedia edit sequences as a use case scenario, we detail experiments in the determination of community-based user and document trust. Our results show success in answering the first of many research questions: Provided a user's edit history, is a given edit to a document positively contributing to its content? We detail how the ability to answer this question provides a preliminary framework towards a better model for collaborative trust and discuss subsequent areas of research necessary to broaden its utility and scope. Copyright 2012 ACM. 0 0
Attributing authorship of revisioned content Luca de Alfaro
Shavlovsky M.
WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web English 2013 A considerable portion of web content, from wikis to collaboratively edited documents, to code posted online, is revisioned. We consider the problem of attributing authorship to such revisioned content, and we develop scalable attribution algorithms that can be applied to very large bodies of revisioned content, such as the English Wikipedia. Since content can be deleted, only to be later re-inserted, we introduce a notion of authorship that requires comparing each new revision with the entire set of past revisions. For each portion of content in the newest revision, we search the entire history for content matches that are statistically unlikely to occur spontaneously, thus denoting common origin. We use these matches to compute the earliest possible attribution of each word (or each token) of the new content. We show that this \earliest plausible attribution" can be computed efficiently via compact summaries of the past revision history. This leads to an algorithm that runs in time proportional to the sum of the size of the most recent revision, and the total amount of change (edit work) in the revision history. This amount of change is typically much smaller than the total size of all past revisions. The resulting algorithm can scale to very large repositories of revisioned content, as we show via experimental data over the English Wikipedia Copyright is held by the International World Wide Web Conference Committee (IW3C2). 0 0
Automated Decision support for human tasks in a collaborative system: The case of deletion in wikipedia Gelley B.S.
Suel T.
Proceedings of the 9th International Symposium on Open Collaboration, WikiSym + OpenSym 2013 English 2013 Wikipedia's low barriers to participation have the unintended effect of attracting a large number of articles whose topics do not meet Wikipedia's inclusion standards. Many are quickly deleted, often causing their creators to stop contributing to the site. We collect and make available several datasets of deleted articles, heretofore inaccessible, and use them to create a model that can predict with high precision whether or not an article will be deleted. We report precision of 98.6% and recall of 97.5% in the best case and high precision with lower, but still useful, recall, in the most difficult case. We propose to deploy a system utilizing this model on Wikipedia as a set of decision-support tools to help article creators evaluate and improve their articles before posting, and new article patrollers make more informed decisions about which articles to delete and which to improve. Categories and Subject Descriptors H.5.3. Collaborative Computing; Computer Supported Collaborative Work General Terms Measurement, Performance, Human Factors,. Copyright 2010 ACM. 0 0
Automated non-content word list generation using hLDA Krug W.
Tomlinson M.T.
FLAIRS 2013 - Proceedings of the 26th International Florida Artificial Intelligence Research Society Conference English 2013 In this paper, we present a language-independent method for the automatic, unsupervised extraction of non-content words from a corpus of documents. This method permits the creation of word lists that may be used in place of traditional function word lists in various natural language processing tasks. As an example we generated lists of words from a corpus of English, Chinese, and Russian posts extracted from Wikipedia articles and Wikipedia Wikitalk discussion pages. We applied these lists to the task of authorship attribution on this corpus to compare the effectiveness of lists of words extracted with this method to expert-created function word lists and frequent word lists (a common alternative to function word lists). hLDA lists perform comparably to frequent word lists. The trials also show that corpus-derived lists tend to perform better than more generic lists, and both sets of generated lists significantly outperformed the expert lists. Additionally, we evaluated the performance of an English expert list on machine translations of our Chinese and Russian documents, showing that our method also outperforms this alternative. Copyright © 2013, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
Automated query learning with Wikipedia and genetic programming Pekka Malo
Pyry Siitari
Ankur Sinha
Artificial Intelligence English 2013 Most of the existing information retrieval systems are based on bag-of-words model and are not equipped with common world knowledge. Work has been done towards improving the efficiency of such systems by using intelligent algorithms to generate search queries, however, not much research has been done in the direction of incorporating human-and-society level knowledge in the queries. This paper is one of the first attempts where such information is incorporated into the search queries using Wikipedia semantics. The paper presents Wikipedia-based Evolutionary Semantics (Wiki-ES) framework for generating concept based queries using a set of relevance statements provided by the user. The query learning is handled by a co-evolving genetic programming procedure. To evaluate the proposed framework, the system is compared to a bag-of-words based genetic programming framework as well as to a number of alternative document filtering techniques. The results obtained using Reuters newswire documents are encouraging. In particular, the injection of Wikipedia semantics into a GP-algorithm leads to improvement in average recall and precision, when compared to a similar system without human knowledge. A further comparison against other document filtering frameworks suggests that the proposed GP-method also performs well when compared with systems that do not rely on query-expression learning. © 2012 Elsevier B.V. All rights reserved. 0 1
Automatic extraction of Polish language errors from text edition history Grundkiewicz R. Lecture Notes in Computer Science English 2013 There are no large error corpora for a number of languages, despite the fact that they have multiple applications in natural language processing. The main reason underlying this situation is a high cost of manual corpora creation. In this paper we present the methods of automatic extraction of various kinds of errors such as spelling, typographical, grammatical, syntactic, semantic, and stylistic ones from text edition histories. By applying of these methods to the Wikipedia's article revision history, we created the large and publicly available corpus of naturally-occurring language errors for Polish, called PlEWi. Finally, we analyse and evaluate the detected error categories in our corpus. 0 0
Automatic keyphrase annotation of scientific documents using Wikipedia and genetic algorithms Joorabchi A.
Mahdi A.E.
Journal of Information Science English 2013 Topical annotation of documents with keyphrases is a proven method for revealing the subject of scientific and research documents to both human readers and information retrieval systems. This article describes a machine learning-based keyphrase annotation method for scientific documents that utilizes Wikipedia as a thesaurus for candidate selection from documents' content. We have devised a set of 20 statistical, positional and semantical features for candidate phrases to capture and reflect various properties of those candidates that have the highest keyphraseness probability. We first introduce a simple unsupervised method for ranking and filtering the most probable keyphrases, and then evolve it into a novel supervised method using genetic algorithms. We have evaluated the performance of both methods on a third-party dataset of research papers. Reported experimental results show that the performance of our proposed methods, measured in terms of consistency with human annotators, is on a par with that achieved by humans and outperforms rival supervised and unsupervised methods. 0 0
Automatic readability classification of crowd-sourced data based on linguistic and information-theoretic features Zahurul Islam
Alexander Mehler
Computacion y Sistemas English 2013 This paper presents a classifier of text readability based on information-theoretic features. The classifier was developed based on a linguistic approach to readability that explores lexical, syntactic and semantic features. For this evaluation we extracted a corpus of 645 articles from Wikipedia together with their quality judgments. We show that information-theoretic features perform as well as their linguistic counterparts even if we explore several linguistic levels at once. 0 0
Automatic topic ontology construction using semantic relations from wordnet and wikipedia Subramaniyaswamy V. International Journal of Intelligent Information Technologies English 2013 Due to the explosive growth of web technology, a huge amount of information is available as web resources over the Internet. Therefore, in order to access the relevant content from the web resources effectively, considerable attention is paid on the semantic web for efficient knowledge sharing and interoperability. Topic ontology is a hierarchy of a set of topics that are interconnected using semantic relations, which is being increasingly used in the web mining techniques. Reviews of the past research reveal that semiautomatic ontology is not capable of handling high usage. This shortcoming prompted the authors to develop an automatic topic ontology construction process. However, in the past many attempts have been made by other researchers to utilize the automatic construction of ontology, which turned out to be challenging due to time, cost and maintenance. In this paper, the authors have proposed a corpus based novel approach to enrich the set of categories in the ODP by automatically identifying the concepts and their associated semantic relationship with corpus based external knowledge resources, such as Wikipedia and WordNet. This topic ontology construction approach relies on concept acquisition and semantic relation extraction. A Jena API framework has been developed to organize the set of extracted semantic concepts, while Protégé provides the platform to visualize the automatically constructed topic ontology. To evaluate the performance, web documents were classified using SVM classifier based on ODP and topic ontology. The topic ontology based classification produced better accuracy than ODP. Copyright 0 0
Automatically building templates for entity summary construction Li P.
Yafang Wang
Jian Jiang
Information Processing and Management English 2013 In this paper, we propose a novel approach to automatic generation of summary templates from given collections of summary articles. We first develop an entity-aspect LDA model to simultaneously cluster both sentences and words into aspects. We then apply frequent subtree pattern mining on the dependency parse trees of the clustered and labeled sentences to discover sentence patterns that well represent the aspects. Finally, we use the generated templates to construct summaries for new entities. Key features of our method include automatic grouping of semantically related sentence patterns and automatic identification of template slots that need to be filled in. Also, we implement a new sentence compression algorithm which use dependency tree instead of parser tree. We apply our method on five Wikipedia entity categories and compare our method with three baseline methods. Both quantitative evaluation based on human judgment and qualitative comparison demonstrate the effectiveness and advantages of our method. © 2012 Elsevier Ltd. All rights reserved. 0 0
Automating document annotation using open source knowledge Apoorv Singhal
Kasturi R.
Srivastava J.
Proceedings - 2013 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2013 English 2013 Annotating documents with relevant and comprehensive keywords offers invaluable assistance to the readers to quickly overview any document. The problem of document annotation is addressed in the literature under two broad classes of techniques namely, key phrase extraction and key phrase abstraction. In this paper, we propose a novel approach to generate summary phrases for research documents. Given the dynamic nature of scientific research, it has become important to incorporate new and popular scientific terminologies in document annotations. For this purpose, we have used crowd-source knowledge bases like Wikipedia and WikiCFP (a open source information source for call for papers) for automating key phrase generation. Also, we have taken into account the lack of availability of the document's content (due to protective policies) and developed a global context based key-phrase identification approach. We show that given only the title of a document, the proposed approach generates its global context information using academic search engines like Google Scholar. We evaluated the performance of the proposed approach on real-world dataset obtained from a computer science research document corpus. We quantitatively evaluated the performance of the proposed approach and compared it with two baseline approaches. 0 0
Beyond open source software: Framework and implications for open content research Chitu Okoli
Carillo K.D.A.
ECIS 2013 - Proceedings of the 21st European Conference on Information Systems English 2013 The same open source philosophy that has been traditionally applied to software development can be applied to the collaborative creation of non-software information products, such as books, music and video. Such products are generically referred to as open content. Due largely to the success of large projects such as Wikipedia and the Creative Commons, open content has gained increasing attention not only in the popular media, but also in scholarly research. It is important to investigate the workings of the open source process in these new media of expression. This paper introduces the scope of emerging research on the open content phenomenon beyond open source software. We develop a framework for categorizing copyrightable works as utilitarian, factual, aesthetic or opinioned works. Based on these categories, we review some key theory-driven findings from open source software research and assess the applicability of extending their implications to open content. We present a research agenda that integrates the findings and proposes a list of research topics that can help lay a solid foundation for open content research. 0 0
BlueFinder: Recommending wikipedia links using DBpedia properties Torres D.
Hala Skaf-Molli
Pascal Molli
Diaz A.
Proceedings of the 3rd Annual ACM Web Science Conference, WebSci 2013 English 2013 DBpedia knowledge base has been built from data extracted from Wikipedia. However, many existing relations among resources in DBpedia are missing links among articles from Wikipedia. In some cases, adding these links into Wikipedia will enrich Wikipedia content and therefore will enable better navigation. In previous work, we proposed PIA algorithm that predicts the best link to connect two articles in Wikipedia corresponding to those related by a semantic property in DB-pedia and respecting the Wikipedia convention. PIA calculates this link as a path query. After introducing PIA results in Wikipedia, most of them were accepted by the Wikipedia community. However, some were rejected because PIA predicts path queries that are too general. In this paper, we report the BlueFinder collaborative filtering algorithm that fixes PIA miscalculation. It is sensible to the specificity of the resource types. According to the conducted experimentation we found out that BlueFinder is a better solution than PIA because it solves more cases with a better recall. Copyright 2013 ACM. 0 0
Bookmark recommendation in social bookmarking services using Wikipedia Yoshida T.
Inoue U.
2013 IEEE/ACIS 12th International Conference on Computer and Information Science, ICIS 2013 - Proceedings English 2013 Social bookmarking systems allow users to attach freely chosen keywords as tags to bookmarks of web pages. These tags are used to recommend relevant bookmarks to other users. However, there is no guarantee that every user get enough bookmark recommended, because of the diversity of tags. In this paper, we propose a personalized recommender system using Wikipedia. Our system extends a tag set to find similar users and relevant bookmarks by using the Wikipedia category database. The experimental results show that significant increase of relevant bookmarks recommended without notable increase of the noise. 0 0
Boosting cross-lingual knowledge linking via concept annotation Zhe Wang
Jing-Woei Li
Tang J.
IJCAI International Joint Conference on Artificial Intelligence English 2013 Automatically discovering cross-lingual links (CLs) between wikis can largely enrich the cross-lingual knowledge and facilitate knowledge sharing across different languages. In most existing approaches for cross-lingual knowledge linking, the seed CLs and the inner link structures are two important factors for finding new CLs. When there are insufficient seed CLs and inner links, discovering new CLs becomes a challenging problem. In this paper, we propose an approach that boosts cross-lingual knowledge linking by concept annotation. Given a small number of seed CLs and inner links, our approach first enriches the inner links in wikis by using concept annotation method, and then predicts new CLs with a regression-based learning model. These two steps mutually reinforce each other, and are executed iteratively to find as many CLs as possible. Experimental results on the English and Chinese Wikipedia data show that the concept annotation can effectively improve the quantity and quality of predicted CLs. With 50,000 seed CLs and 30% of the original inner links in Wikipedia, our approach discovered 171,393 more CLs in four runs when using concept annotation. 0 0
Boot-strapping language identifiers for short colloquial postings Goldszmidt M.
Najork M.
Paparizos S.
Lecture Notes in Computer Science English 2013 There is tremendous interest in mining the abundant user generated content on the web. Many analysis techniques are language dependent and rely on accurate language identification as a building block. Even though there is already research on language identification, it focused on very 'clean' editorially managed corpora, on a limited number of languages, and on relatively large-sized documents. These are not the characteristics of the content to be found in say, Twitter or Facebook postings, which are short and riddled with vernacular. In this paper, we propose an automated, unsupervised, scalable solution based on publicly available data. To this end we thoroughly evaluate the use of Wikipedia to build language identifiers for a large number of languages (52) and a large corpus and conduct a large scale study of the best-known algorithms for automated language identification, quantifying how accuracy varies in correlation to document size, language (model) profile size and number of languages tested. Then, we show the value in using Wikipedia to train a language identifier directly applicable to Twitter. Finally, we augment the language models and customize them to Twitter by combining our Wikipedia models with location information from tweets. This method provides massive amount of automatically labeled data that act as a bootstrapping mechanism which we empirically show boosts the accuracy of the models. With this work we provide a guide and a publicly available tool [1] to the mining community for language identification on web and social data. 0 0
Building, maintaining, and using knowledge bases: A report from the trenches Deshpande O.
Lamba D.S.
Tourn M.
Sanmay Das
Subramaniam S.
Rajaraman A.
Harinarayan V.
Doan A.
Proceedings of the ACM SIGMOD International Conference on Management of Data English 2013 A knowledge base (KB) contains a set of concepts, instances, and relationships. Over the past decade, numerous KBs have been built, and used to power a growing array of applications. Despite this flurry of activities, however, surprisingly little has been published about the end-to-end process of building, maintaining, and using such KBs in industry. In this paper we describe such a process. In particular, we describe how we build, update, and curate a large KB at Kosmix, a Bay Area startup, and later at WalmartLabs, a development and research lab of Walmart. We discuss how we use this KB to power a range of applications, including query understanding, Deep Web search, in-context advertising, event monitoring in social media, product search, social gifting, and social mining. Finally, we discuss how the KB team is organized, and the lessons learned. Our goal with this paper is to provide a real-world case study, and to contribute to the emerging direction of building, maintaining, and using knowledge bases for data management applications. Copyright 0 0
C Arsan T.
Sen R.
Ersoy B.
Devri K.K.
Lecture Notes in Electrical Engineering English 2013 In this paper, we design and implement a novel all-in-one Media Center that can be directly connected to a high-definition television (HDTV). C# programming is used for developing modular structured media center for home entertainment. Therefore it is possible and easy to add new limitless number of modules and software components. The most importantly, user interface is designed by considering two important factors; simplicity and tidiness. Proposed media center provides opportunities to users to have an experience on listening to music/radio, watching TV, connecting to Internet, online Internet videos, editing videos, Internet connection to pharmacy on duty, checking weather conditions, song lyrics, CD/DVD burning, connecting to Wikipedia. All the modules and design steps are explained in details for user friendly cost effective all-in-one media center. 0 0
Can a Wiki be used as a knowledge service platform? Lin F.-R.
Wang C.-R.
Huang H.-Y.
Advances in Intelligent Systems and Computing English 2013 Many knowledge services have been developed as a matching platform for knowledge demanders and providers. However, most of these knowledge services have a common drawback that they cannot provide a list of experts corresponding to the knowledge demanders' need. Knowledge demanders have to post their questions in a public area and then wait patiently until corresponding knowledge providers appear. In order to facilitate knowledge demanders to acquire knowledge, this study proposes a knowledge service system based on Wikipedia to actively inform potential knowledge providers on behalf of knowledge demanders. This study also developed a knowledge activity map system used for the knowledge service system to identify Wikipedians' knowledge domains. The experimental evaluation results show that the knowledge service system is acceptable by leader users on Wikipedia, in which their domain knowledge can be identified and represented on their knowledge activity maps. 0 0
Can the Web turn into a digital library? Maurer H.
Mueller H.
International Journal on Digital Libraries English 2013 There is no doubt that the enormous amounts of information on the WWW are influencing how we work, live, learn and think. However, information on the WWW is in general too chaotic, not reliable enough and specific material often too difficult to locate that it cannot be considered a serious digital library. In this paper we concentrate on the question how we can retrieve reliable information from the Web, a task that is fraught with problems, but essential if the WWW is supposed to be used as serious digital library. It turns out that the use of search engines has many dangers. We will point out some of the possible ways how those dangers can be reduced and how dangerous traps can be avoided. Another approach to find useful information on the Web is to use "classical" resources of information like specialized dictionaries, lexica or encyclopaedias in electronic form, such as the Britannica. Although it seemed for a while that such resources might more or less disappear from the Web due to attempts such as Wikipedia, some to the classical encyclopaedias and specialized offerings have picked up steam again and should not be ignored. They do sometimes suffer from what we will call the "wishy-washy" syndrome explained in this paper. It is interesting to note that Wikipedia which is also larger than all other encyclopaedias (at least the English version) is less afflicted by this syndrome, yet has some other serious drawbacks. We discuss how those could be avoided and present a system that is halfway between prototype and production system that does take care of many of the aforementioned problems and hence may be a model for further undertakings in turning (part of) the Web into a useable digital library. 0 0
Characterizing and curating conversation threads: Expansion, focus, volume, re-entry Backstrom L.
Kleinberg J.
Lena Lee
Cristian Danescu-Niculescu-Mizil
WSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining English 2013 Discussion threads form a central part of the experience on many Web sites, including social networking sites such as Facebook and Google Plus and knowledge creation sites such as Wikipedia. To help users manage the challenge of allocating their attention among the discussions that are relevant to them, there has been a growing need for the algorithmic curation of on-line conversations - - the development of automated methods to select a subset of discussions to present to a user. Here we consider two key sub-problems inherent in conversational curation: length prediction - - predicting the number of comments a discussion thread will receive - - and the novel task of re-entry prediction - - predicting whether a user who has participated in a thread will later contribute another comment to it. The first of these sub-problems arises in estimating how interesting a thread is, in the sense of generating a lot of conversation; the second can help determine whether users should be kept notified of the progress of a thread to which they have already contributed. We develop and evaluate a range of approaches for these tasks, based on an analysis of the network structure and arrival pattern among the participants, as well as a novel dichotomy in the structure of long threads. We find that for both tasks, learning-based approaches using these sources of information. 0 0
Chinese text filtering based on domain keywords extracted from Wikipedia Xiaolong Wang
Hua Li
Jia Y.
Jin S.
Lecture Notes in Electrical Engineering English 2013 Several machine learning and information retrieval algorithms have been used for text filtering. All these methods have a common ground that they need positive and negative examples to build user profile. However, not all applications can get good training documents. In this paper, we present a Wikipedia based method to build user profile without any other training documents. The proposed method extracts keywords of a special category from Wikipedia taxonomy and computes the weights of the extracted keywords based on Wikipedia pages. Experiment results on Chinese news text dataset SogouC show that the proposed method achieves good performance. 0 0
Classification of scientific publications according to library controlled vocabularies: A new concept matching-based approach Joorabchi A.
Mahdi A.E.
Library Hi Tech English 2013 Purpose: This paper aims to report on the design and development of a new approach for automatic classification and subject indexing of research documents in scientific digital libraries and repositories (DLR) according to library controlled vocabularies such as DDC and FAST. Design/methodology/approach: The proposed concept matching-based approach (CMA) detects key Wikipedia concepts occurring in a document and searches the OPACs of conventional libraries via querying the WorldCat database to retrieve a set of MARC records which share one or more of the detected key concepts. Then the semantic similarity of each retrieved MARC record to the document is measured and, using an inference algorithm, the DDC classes and FAST subjects of those MARC records which have the highest similarity to the document are assigned to it. Findings: The performance of the proposed method in terms of the accuracy of the DDC classes and FAST subjects automatically assigned to a set of research documents is evaluated using standard information retrieval measures of precision, recall, and F1. The authors demonstrate the superiority of the proposed approach in terms of accuracy performance in comparison to a similar system currently deployed in a large scale scientific search engine. Originality/value: The proposed approach enables the development of a new type of subject classification system for DLR, and addresses some of the problems similar systems suffer from, such as the problem of imbalanced training data encountered by machine learning-based systems, and the problem of word-sense ambiguity encountered by string matching-based systems. 0 0
Clustering editors of wikipedia by editor's biases Nakamura A.
Yu Suzuki
Ishikawa Y.
Proceedings - 2013 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2013 English 2013 Wikipedia is an Internet encyclopedia where any user can edit articles. Because editors act on their own judgments, editors' biases are reflected in edit actions. When editors' biases are reflected in articles, the articles should have low credibility. However, it is difficult for users to judge which parts in articles have biases. In this paper, we propose a method of clustering editors by editors' biases for the purpose that we distinguish texts' biases by using editors' biases and aid users to judge the credibility of each description. If each text is distinguished such as by colors, users can utilize it for the judgments of the text credibility. Our system makes use of the relationships between editors: agreement and disagreement. We assume that editors leave texts written by editors that they agree with, and delete texts written by editors that they disagree with. In addition, we can consider that editors who agree with each other have similar biases, and editors who disagree with each other have different biases. Hence, the relationships between editors enable to classify editors by biases. In experimental evaluation, we verify that our proposed method is useful in clustering editors by biases. Additionally, we validate that considering the dependency between editors improves the clustering performance. 0 0
Collective action towards enhanced knowledge management of neglected and underutilised species: Making use of internet opportunities Hermann M.
Kwek M.J.
Khoo T.K.
Amaya K.
Acta Horticulturae English 2013 The disproportionate use of crops - with a few species accounting for most of global food production - is being re-enforced by the considerable research, breeding and development efforts that make global crops so competitive vis-à-vis "neglected and underutilised species" (NUS). NUS promotional rhetoric, preaching to the converted, complaints about the discrimination of the "food of the poor" and the loss of traditional dietary habits are unlikely to revert the neglect of the vast majority of crop species. We need to lessen the supply and demand constraints that affect the production and consumption of NUS. NUS attributes relevant to consumers, nutrition and climate change need to be substantiated, demand for NUS stimulated, discriminating agricultural and trade policies amended, and donors convinced to make greater investments in NUS research and development. Much fascinating NUS research and development is underway, but much of this is dissipated amongst countries, institutions and taxa. Researchers operate in unsupportive environments and are often unaware of each other's work. Their efforts remain unrecognised as addressing global concerns. We suggest that the much-needed enhancement of NUS knowledge management should be at the centre of collective efforts of the NUS community. This will underpin future research and development advances as well as inform the formulation and advocacy of policies. This paper recommends that the NUS community make greater use of Internet knowledge repositories to deposit research results, publications and images into the public domain. As examples for such a low-cost approach, we assess the usefulness of Wikipedia, Google Books and Wikimedia Commons for the documentation and dissemination of NUS knowledge. We urge donors and administrators to promote and encourage the use of these and other public and electronically accessible repositories as sources of verification for the achievement of project and research outputs. 0 0
Combining lexical and semantic features for short text classification Yang L.
Chenliang Li
Ding Q.
Li L.
Procedia Computer Science English 2013 In this paper, we propose a novel approach to classify short texts by combining both their lexical and semantic features. We present an improved measurement method for lexical feature selection and furthermore obtain the semantic features with the background knowledge repository which covers target category domains. The combination of lexical and semantic features is achieved by mapping words to topics with different weights. In this way, the dimensionality of feature space is reduced to the number of topics. We here use Wikipedia as background knowledge and employ Support Vector Machine (SVM) as classifier. The experiment results show that our approach has better effectiveness compared with existing methods for classifying short texts. 0 0
… further results

Jacso 2002[edit]

Wikipedia is a 2002 journal article by Peter Jacso and published in Online (Wilton, Connecticut).

[edit] Abstract

This section requires expansion. Please, help!

[edit] References

This section requires expansion. Please, help!

Cited by

This publication has 1 citations. Only those publications available in WikiPapers are shown here:


Discussion

No comments yet. Be first!