Lecture Notes in Computer Science

From WikiPapers
Jump to: navigation, search


Only those publications related to wikis already available at WikiPapers are shown here.
Title Author(s) Keyword(s) Language DateThis property is a special property in this wiki. Abstract R C
A correlation-based semantic model for text search Sun J.
Bin Wang
Yang X.
Semantic correlation
Text search
English 2014 With the exponential growth of texts on the Internet, text search is considered a crucial problem in many fields. Most of the traditional text search approaches are based on "bag of words" text representation based on frequency statics. However, these approaches ignore the semantic correlation of words in the text. So this may lead to inaccurate ranking of the search results. In this paper, we propose a new Wikipedia-based similar text search approach that the words in the texts and query text could be semantic correlated in Wikipedia. We propose a new text representation model and a new text similarity metric. Finally, the experiments on the real dataset demonstrate the high precision, recall and efficiency of our approach. 0 0
A cross-cultural comparison on contributors' motivations to online knowledge sharing: Chinese vs. Germans Zhu B.
Gao Q.
Nohdurft E.
Cross-cultural differences
Knowledge sharing
English 2014 Wikipedia is the most popular online knowledge sharing platform in western countries. However, it is not widely accepted in eastern countries. This indicates that culture plays a key role in determining users' acceptance of online knowledge sharing platforms. The purpose of this study is to investigate the cultural differences between Chinese and Germans in motivations for sharing knowledge, and further examine the impacts of these motives on the actual behavior across two cultures. A questionnaire was developed to explore the motivation factors and actual behavior of contributors. 100 valid responses were received from Chinese and 34 responses from the Germans. The results showed that the motivations were significantly different between Chinese and Germans. The Chinese had more consideration for others and cared more about receiving reward and strengthening the relationship, whereas Germans had more concerns about losing competitiveness. The impact of the motives on the actual behavior was also different between Chinese and Germans. 0 0
A scalable gibbs sampler for probabilistic entity linking Houlsby N.
Massimiliano Ciaramita
English 2014 Entity linking involves labeling phrases in text with their referent entities, such as Wikipedia or Freebase entries. This task is challenging due to the large number of possible entities, in the millions, and heavy-tailed mention ambiguity. We formulate the problem in terms of probabilistic inference within a topic model, where each topic is associated with a Wikipedia article. To deal with the large number of topics we propose a novel efficient Gibbs sampling scheme which can also incorporate side information, such as the Wikipedia graph. This conceptually simple probabilistic approach achieves state-of-the-art performance in entity-linking on the Aida-CoNLL dataset. 0 0
A seed based method for dictionary translation Krajewski R.
Rybinski H.
Kozlowski M.
Dictionary translation
Machine translation
Multilingual corpus
Semantic similarity
English 2014 The paper refers to the topic of automatic machine translation. The proposed method enables translating a dictionary by means of mining repositories in the source and target repository, without any directly given relationships connecting two languages. It consists of two stages: (1) translation by lexical similarity, where words are compared graphically, and (2) translation by semantic similarity, where contexts are compared. Polish and English version of Wikipedia were used as multilingual corpora. The method and its stages are thoroughly analyzed. The results allow implementing this method in human-in-the-middle systems. 0 0
A student perception related to the implementation of virtual courses Chilian A.
Bancuta O.-R.
Bancuta C.
Access to information
Virtual course
English 2014 This paper aims to characterize the point of view of the students regarding virtual courses in education, but in particular the study is based on the experience gained by the students in the Designing TEL course, organized in the frame of CoCreat project. Thus, it was noticed that a very important role in the development of virtual courses was played by using Wiki and Moodle platforms. Even there are still some problems on implementing virtual courses using those platforms, Designing TEL course can be considered a successful one. 0 0
An automatic sameAs link discovery from Wikipedia Kagawa K.
Susumu Tamagawa
Takahira Yamaguchi
SameAs link
Spelling variants
English 2014 Spelling variants of words or word sense ambiguity takes many costs in such processes as Data Integration, Information Searching, data pre-processing for Data Mining, and so on. It is useful to construct relations between a word or phrases and a representative name of the entity to meet these demands. To reduce the costs, this paper discusses how to automatically discover "sameAs" and "meaningOf" links from Japanese Wikipedia. In order to do so, we gathered relevant features such as IDF, string similarity, number of hypernym, and so on. We have identified the link-based score on salient features based on SVM results with 960,000 anchor link pairs. These case studies show us that our link discovery method goes well with more than 70% precision/ recall rate. 0 0
Apply wiki for improving intellectual capital and effectiveness of project management at Cideco company Misra S.
Pham Q.T.
Tran T.N.
Cideco Company
Intellectual capital
Knowledge management
Project management
English 2014 Today, knowledge is considered the only source for creating the competitive advantages of modern organizations. However, managing intellectual capital is challenged, especially for SMEs in developing countries like Vietnam. In order to help SMEs to build KMS and to stimulate their intellectual capital, a suitable technical platform for collaboration is needed. Wiki is a cheap technology for improving both intellectual capital and effectiveness of project management. However, there is a lack of proof about real benefit of applying wiki in Vietnamese SMEs. Cideco Company, a Vietnamese SME in construction design & consulting industry, is finding a solution to manage its intellectual capital for improving the effectiveness of project management. In this research, wiki is applied and tested to check whether it can be a suitable technology for Cideco to stimulate its intellectual capital and to improve the effectiveness of project management activities. Besides, a demo wiki is also implemented for 2 pilot projects to evaluate its real benefit. Analysis results showed that wiki can help to increase both intellectual capital and effectiveness of project management at Cideco. 0 0
Collaborative tools in the primary classroom: Teachers' thoughts on wikis Agesilaou A.
Vassiliou C.
Irakleous S.
Zenios M.
Collaborative learning
Primary education
English 2014 The purpose of this work-in-progress study is to examine the attitudes of primary school teachers in Cyprus on the use of wikis as a mean to promote collaborative learning in the classroom. A survey investigation was undertaken using 20 questionnaires and 3 semi-structured interviews. The survey results indicate a positive attitude of teachers in Cyprus to integrate wikis in primary education for the promotion of cooperation. As such collaborative learning activities among pupils are being encouraged. 0 0
Collective memory in Poland: A reflection in street names Radoslaw Nielek
Wawer A.
Adam Wierzbicki
Collective memory
Street names
English 2014 Our article starts with an observation that street names fall into two general types: generic and historically inspired. We analyse street names distributions (of the second type) as a window to nation-level collective memory in Poland. The process of selecting street names is determined socially, as the selections reflect the symbols considered important to the nation-level society, but has strong historical motivations and determinants. In the article, we seek for these relationships in the available data sources. We use Wikipedia articles to match street names with their textual descriptions and assign them to the time points. We then apply selected text mining and statistical techniques to reach quantitative conclusions. We also present a case study: the geographical distribution of two particular street names in Poland to demonstrate the binding between history and political orientation of regions. 0 0
Continuous temporal Top-K query over versioned documents Lan C.
YanChun Zhang
Chunxiao Xing
Chenliang Li
English 2014 The management of versioned documents has attracted researchers' attentions in recent years. Based on the observation that decision-makers are often interested in finding the set of objects that have continuous behavior over time, we study the problem of continuous temporal top-k query. With a given a query, continuous temporal top-k search finds the documents that frequently rank in the top-k during a time period and take the weights of different time intervals into account. Existing works regarding querying versioned documents have focused on adding the constraint of time, however lacked to consider the continuous ranking of objects and weights of time intervals. We propose a new interval window-based method to address this problem. Our method can get the continuous temporal top-k results while using interval windows to support time and weight constraints simultaneously. We use data from Wikipedia to evaluate our method. 0 0
Development of a semantic and syntactic model of natural language by means of non-negative matrix and tensor factorization Anisimov A.
Marchenko O.
Taranukha V.
Vozniuk T.
Information extraction
Knowledge representation
English 2014 A method for developing a structural model of natural language syntax and semantics is proposed. Syntactic and semantic relations between parts of a sentence are presented in the form of a recursive structure called a control space. Numerical characteristics of these data are stored in multidimensional arrays. After factorization, the arrays serve as the basis for the development of procedures for analyses of natural language semantics and syntax. 0 0
Encoding document semantic into binary codes space Yu Z.
Xuan Zhao
Lei Wang
English 2014 We develop a deep neural network model to encode document semantic into compact binary codes with the elegant property that semantically similar documents have similar embedding codes. The deep learning model is constructed with three stacked auto-encoders. The input of the lowest auto-encoder is the representation of word-count vector of a document, while the learned hidden features of the deepest auto-encoder are thresholded to be binary codes to represent the document semantic. Retrieving similar document is very efficient by simply returning the documents whose codes have small Hamming distances to that of the query document. We illustrate the effectiveness of our model on two public real datasets - 20NewsGroup and Wikipedia, and the experiments demonstrate that the compact binary codes sufficiently embed the semantic of documents and bring improvement in retrieval accuracy. 0 0
Entity recognition in information extraction Hanafiah N.
Quix C.
English 2014 Detecting and resolving entities is an important step in information retrieval applications. Humans are able to recognize entities by context, but information extraction systems (IES) need to apply sophisticated algorithms to recognize an entity. The development and implementation of an entity recognition algorithm is described in this paper. The implemented system is integrated with an IES that derives triples from unstructured text. By doing so, the triples are more valuable in query answering because they refer to identified entities. By extracting the information from Wikipedia encyclopedia, a dictionary of entities and their contexts is built. The entity recognition computes a score for context similarity which is based on cosine similarity with a tf-idf weighting scheme and the string similarity. The implemented system shows a good accuracy on Wikipedia articles, is domain independent, and recognizes entities of arbitrary types. 0 0
Experimental comparison of semantic word clouds Barth L.
Kobourov S.G.
Pupyrev S.
English 2014 We study the problem of computing semantics-preserving word clouds in which semantically related words are close to each other. We implement three earlier algorithms for creating word clouds and three new ones. We define several metrics for quantitative evaluation of the resulting layouts. Then the algorithms are compared according to these metrics, using two data sets of documents from Wikipedia and research papers. We show that two of our new algorithms outperform all the others by placing many more pairs of related words so that their bounding boxes are adjacent. Moreover, this improvement is not achieved at the expense of significantly worsened measurements for the other metrics. 0 0
Fostering collaborative learning with wikis: Extending MediaWiki with educational features Popescu E.
Maria C.
Udristoiu A.L.
Collaborative learning
Educational wiki
Learner tracking
MediaWiki extensions
English 2014 Wikis are increasingly popular Web 2.0 tools in educational settings, being used successfully for collaborative learning. However, since they were not originally conceived as educational tools, they lack some of the functionalities useful in the instructional process (such as learner monitoring, evaluation support, student group management etc.). Therefore in this paper we propose a solution to add these educational support features, as an extension to the popular MediaWiki platform. CoLearn, as it is called, is aimed at increasing the collaboration level between students, investigating also the collaborative versus cooperative learner actions. Its functionalities and pedagogical rationale are presented, together with some technical details. A set of practical guidelines for promoting collaborative learning with wikis is also included. 0 0
Graph-based domain-specific semantic relatedness from Wikipedia Sajadi A. Biomedical Domain
Semantic relatedness
Data mining
English 2014 Human made ontologies and lexicons are promising resources for many text mining tasks in domain specific applications, but they do not exist for most domains. We study the suitability of Wikipedia as an alternative resource for ontologies regarding the Semantic Relatedness problem. We focus on the biomedical domain because (1) high quality manually curated ontologies are available and (2) successful graph based methods have been proposed for semantic relatedness in this domain. Because Wikipedia is not hierarchical and links do not convey defined semantic relationships, the same methods used on lexical resources (such as WordNet) cannot be applied here straightforwardly. Our contributions are (1) Demonstrating that Wikipedia based methods outperform state of the art ontology based methods on most of the existing ontologies in the biomedical domain (2) Adapting and evaluating the effectiveness of a group of bibliometric methods of various degrees of sophistication on Wikipedia for the first time (3) Proposing a new graph-based method that is outperforming existing methods by considering some specific features of Wikipedia structure. 0 0
Inferring attitude in online social networks based on quadratic correlation Chao Wang
Bulatov A.A.
Machine learning
Quadratic optimization
Signed Networks
English 2014 The structure of an online social network in most cases cannot be described just by links between its members. We study online social networks, in which members may have certain attitude, positive or negative, toward each other, and so the network consists of a mixture of both positive and negative relationships. Our goal is to predict the sign of a given relationship based on the evidences provided in the current snapshot of the network. More precisely, using machine learning techniques we develop a model that after being trained on a particular network predicts the sign of an unknown or hidden link. The model uses relationships and influences from peers as evidences for the guess, however, the set of peers used is not predefined but rather learned during the training process. We use quadratic correlation between peer members to train the predictor. The model is tested on popular online datasets such as Epinions, Slashdot, and Wikipedia. In many cases it shows almost perfect prediction accuracy. Moreover, our model can also be efficiently updated as the underlying social network evolves. 0 0
Learning to compute semantic relatedness using knowledge from wikipedia Zheng C.
Zhe Wang
Bie R.
Zhou M.
Semantic relatedness
Supervised Learning
English 2014 Recently, Wikipedia has become a very important resource for computing semantic relatedness (SR) between entities. Several approaches have already been proposed to compute SR based on Wikipedia. Most of the existing approaches use certain kinds of information in Wikipedia (e.g. links, categories, and texts) and compute the SR by empirically designed measures. We have observed that these approaches produce very different results for the same entity pair in some cases. Therefore, how to select appropriate features and measures to best approximate the human judgment on SR becomes a challenging problem. In this paper, we propose a supervised learning approach for computing SR between entities based on Wikipedia. Given two entities, our approach first maps entities to articles in Wikipedia; then different kinds of features of the mapped articles are extracted from Wikipedia, which are then combined with different relatedness measures to produce nine raw SR values of the entity pair. A supervised learning algorithm is proposed to learn the optimal weights of different raw SR values. The final SR is computed as the weighted average of raw SRs. Experiments on benchmark datasets show that our approach outperforms baseline methods. 0 0
MIGSOM: A SOM algorithm for large scale hyperlinked documents inspired by neuronal migration Kotaro Nakayama
Yutaka Matsuo
Link analysis
English 2014 The SOM (Self Organizing Map), one of the most popular unsupervised machine learning algorithms, maps high-dimensional vectors into low-dimensional data (usually a 2-dimensional map). The SOM is widely known as a "scalable" algorithm because of its capability to handle large numbers of records. However, it is effective only when the vectors are small and dense. Although a number of studies on making the SOM scalable have been conducted, technical issues on scalability and performance for sparse high-dimensional data such as hyperlinked documents still remain. In this paper, we introduce MIGSOM, an SOM algorithm inspired by new discovery on neuronal migration. The two major advantages of MIGSOM are its scalability for sparse high-dimensional data and its clustering visualization functionality. In this paper, we describe the algorithm and implementation in detail, and show the practicality of the algorithm in several experiments. We applied MIGSOM to not only experimental data sets but also a large scale real data set: Wikipedia's hyperlink data. 0 0
Mining the personal interests of microbloggers via exploiting wikipedia knowledge Fan M.
Zhou Q.
Zheng T.F.
Social tagging
English 2014 This paper focuses on an emerging research topic about mining microbloggers' personalized interest tags from their own microblogs ever posted. It based on an intuition that microblogs indicate the daily interests and concerns of microblogs. Previous studies regarded the microblogs posted by one microblogger as a whole document and adopted traditional keyword extraction approaches to select high weighting nouns without considering the characteristics of microblogs. Given the less textual information of microblogs and the implicit interest expression of microbloggers, we suggest a new research framework on mining microbloggers' interests via exploiting the Wikipedia, a huge online word knowledge encyclopedia, to take up those challenges. Based on the semantic graph constructed via the Wikipedia, the proposed semantic spreading model (SSM) can discover and leverage the semantically related interest tags which do not occur in one's microblogs. According to SSM, An interest mining system have implemented and deployed on the biggest microblogging platform (Sina Weibo) in China. We have also specified a suite of new evaluation metrics to make up the shortage of evaluation functions in this research topic. Experiments conducted on a real-time dataset demonstrate that our approach outperforms the state-of-the-art methods to identify microbloggers' interests. 0 0
Monitoring teachers' complex thinking while engaging in philosophical inquiry with web 2.0 Agni Stylianou-Georgiou
Petrou A.
Andri Ioannou
Caring thinking
Complex thinking
Creative thinking
Critical thinking
Philosophical inquiry
Philosophy for children
Technology integration
English 2014 The purpose of this study was to examine how we can exploit new technologies to scaffold and monitor the development of teachers' complex thinking while engaging in philosophical inquiry. We set up an online learning environment using wiki and forum technologies and we organized the activity in four major steps to scaffold complex thinking for the teacher participants. In this article, we present the evolution of complex thinking of one group of teachers by studying their interactions in depth. 0 0
Motivating Wiki-based collaborative learning by increasing awareness of task conflict: A design science approach Wu K.
Vassileva J.
Xiaohua Sun
Fang J.
Collaborative learning
Task conflict
English 2014 Wiki system has been deployed in many collaborative learning projects. However, lack of motivation is a serious problem in the collaboration process. The wiki system is originally designed to hide authorship information. Such design may hinder users from being aware of task conflict, resulting in undesired outcomes (e.g. reduced motivation, suppressed knowledge exchange activities). We propose to incorporate two different tools in wiki systems to motivate learners by increasing awareness of task conflict. A field test was executed in two collaborative writing projects. The results from a wide-scale survey and a focus group study confirmed the utility of the new tools and suggested that these tools can help learners develop both extrinsic and intrinsic motivations to contribute. This study has several theoretical and practical implications, it enriched the knowledge of task conflict, proposed a new way to motivate collaborative learning, and provided a low-cost resolution to manage task conflict. 0 0
Myths to burst about hybrid learning Li K.C. Hybrid learning
Learner readiness
Learning effectiveness
Teacher readiness
English 2014 Given the snowballing attention to and growing popularity of hybrid learning, some take for granted that the learning mode means more effective education delivery while some who hold a skeptical view expect researchers to inform them whether hybrid learning leads to better learning effectiveness. Though diversified, both beliefs are like myths about the hybrid mode. By reporting findings concerning the use of wikis in a major project on hybrid courses piloted at a university in Hong Kong, this paper highlights the complexity concerning the effectiveness of a hybrid learning mode and the problems of a reductionistic view of its effectiveness. Means for elearning were blended with conventional distance learning components into four undergraduate courses. Findings show that a broad variety of factors, including subject matters, instructors' pedagogical knowledge of the teaching means, students' readiness for the new learning mode and the implementation methods, play a key role in deciding learning effectiveness, rather than just the delivery mode per se. 0 0
Sentence similarity by combining explicit semantic analysis and overlapping n-grams Vu H.H.
Villaneau J.
Said F.
Marteau P.-F.
English 2014 We propose a similarity measure between sentences which combines a knowledge-based measure, that is a lighter version of ESA (Explicit Semantic Analysis), and a distributional measure, Rouge. We used this hybrid measure with two French domain-orientated corpora collected from the Web and we compared its similarity scores to those of human judges. In both domains, ESA and Rouge perform better when they are mixed than they do individually. Besides, using the whole Wikipedia base in ESA did not prove necessary since the best results were obtained with a low number of well selected concepts. 0 0
Shades: Expediting Kademlia's lookup process Einziger G.
Friedman R.
Kantor Y.
English 2014 Kademlia is considered to be one of the most effective key based routing protocols. It is nowadays implemented in many file sharing peer-to-peer networks such as BitTorrent, KAD, and Gnutella. This paper introduces Shades, a combined routing/caching scheme that significantly shortens the average lookup process in Kademlia and improves its load handling. The paper also includes an extensive performance study demonstrating the benefits of Shades and compares it to other suggested alternatives using both synthetic workloads and traces from YouTube and Wikipedia. 0 0
The impact of semantic document expansion on cluster-based fusion for microblog search Liang S.
Ren Z.
Maarten de Rijke
English 2014 Searching microblog posts, with their limited length and creative language usage, is challenging. We frame the microblog search problem as a data fusion problem. We examine the effectiveness of a recent cluster-based fusion method on the task of retrieving microblog posts. We find that in the optimal setting the contribution of the clustering information is very limited, which we hypothesize to be due to the limited length of microblog posts. To increase the contribution of the clustering information in cluster-based fusion, we integrate semantic document expansion as a preprocessing step. We enrich the content of microblog posts appearing in the lists to be fused by Wikipedia articles, based on which clusters are created. We verify the effectiveness of our combined document expansion plus fusion method by making comparisons with microblog search algorithms and other fusion methods. 0 0
Tracking topics on revision graphs of wikipedia edit history Li B.
Wu J.
Mizuho Iwaihara
Edit history
Topic summarization
English 2014 Wikipedia is known as the largest online encyclopedia, in which articles are constantly contributed and edited by users. Past revisions of articles after edits are also accessible from the public for confirming the edit process. However, the degree of similarity between revisions is very high, making it difficult to generate summaries for these small changes from revision graphs of Wikipedia edit history. In this paper, we propose an approach to give a concise summary to a given scope of revisions, by utilizing supergrams, which are consecutive unchanged term sequences. 0 0
TripBuilder: A tool for recommending sightseeing tours Brilhante I.
MacEdo J.A.
Nardini F.M.
Perego R.
Renso C.
English 2014 We propose TripBuilder, an user-friendly and interactive system for planning a time-budgeted sightseeing tour of a city on the basis of the points of interest and the patterns of movements of tourists mined from user-contributed data. The knowledge needed to build the recommendation model is entirely extracted in an unsupervised way from two popular collaborative platforms: Wikipedia and Flickr. TripBuilder interacts with the user by means of a friendly Web interface that allows her to easily specify personal interests and time budget. The sightseeing tour proposed can be then explored and modified. We present the main components composing the system. 0 0
User interests identification on Twitter using a hierarchical knowledge base Kapanipathi P.
Jain P.
Venkataramani C.
Sheth A.
Hierarchical Interest Graph
Social Web
User Profiles
English 2014 Twitter, due to its massive growth as a social networking platform, has been in focus for the analysis of its user generated content for personalization and recommendation tasks. A common challenge across these tasks is identifying user interests from tweets. Semantic enrichment of Twitter posts, to determine user interests, has been an active area of research in the recent past. These approaches typically use available public knowledge-bases (such as Wikipedia) to spot entities and create entity-based user profiles. However, exploitation of such knowledge-bases to create richer user profiles is yet to be explored. In this work, we leverage hierarchical relationships present in knowledge-bases to infer user interests expressed as a Hierarchical Interest Graph. We argue that the hierarchical semantics of concepts can enhance existing systems to personalize or recommend items based on a varied level of conceptual abstractness. We demonstrate the effectiveness of our approach through a user study which shows an average of approximately eight of the top ten weighted hierarchical interests in the graph being relevant to a user's interests. 0 0
What makes a good team of Wikipedia editors? A preliminary statistical analysis Bukowski L.
Jankowski-Lorek M.
Jaroszewicz S.
Sydow M.
Statistical data mining
Team quality
English 2014 The paper concerns studying the quality of teams of Wikipedia authors with statistical approach. We report preparation of a dataset containing numerous behavioural and structural attributes and its subsequent analysis and use to predict team quality. We have performed exploratory analysis using partial regression to remove the influence of attributes not related to the team itself. The analysis confirmed that the key issue significantly influencing article's quality are discussions between teem members. The second part of the paper successfully uses machine learning models to predict good articles based on features of the teams that created them. 0 0
Wiki tools in teaching English for Specific (Academic) Purposes - Improving students' participation Felea C.
Stanca L.
Blended Learning
English for Specific (Academic) Purposes
Higher Education
Web 2.0
English 2014 This study is based on an on-going investigation on the impact of Web 2.0 technologies, namely a wiki-based learning environment, part of a blended approach to teaching English for Specific (Academic) Purposes for EFL undergraduate students in a Romanian university. The research aims to determine whether there are statistically significant differences between the degrees of wiki participation recorded in the first semester of two consecutive academic years, starting from the assumption that modifications in the learning environment, namely the change of location for face-to-face meetings from class to computer lab setting and the introduction of more complex individual page templates may lead to increased wiki participation. Due to the project's multiple dimensions, out of which participation and response to the new online environment are particularly important, the results provide information necessary for further decisions regarding specific instructional design needs and wiki components, and changes affecting the teaching/learning process. 0 0
WikiReviz: An edit history visualization for wiki systems Wu J.
Mizuho Iwaihara
Mass Collaboration
English 2014 Wikipedia maintains a linear record of edit history with article content and meta-information for each article, which conceals precious information on how each article has evolved. This demo describes the motivation and features of WikiReviz, a visualization system for analyzing edit history in Wikipedia and other Wiki systems. From the official exported edit history of a single Wikipedia article, WikiReviz reconstructs the derivation relationships among revisions precisely and efficiently by revision graph extraction and indicate meaningful article evolution progress by edit summarization. 0 0
2012 - A year of Ginev D.
Miller B.R.
English 2013 a to XML converter, is being used in a wide range of MKM applications. In this paper, we present a progress report for the 2012 calendar year. Noteworthy enhancements include: increased coverage such as Wikipedia syntax; enhanced capabilities such as embeddable JavaScript and CSS resources and RDFa support; a web service for remote processing via web-sockets; along with general accuracy and reliability improvements. The outlook for an 0.8.0 release in mid-2013 is also discussed. 0 0
A Wikipedia based hybrid ranking method for taxonomic relation extraction Zhong X. Hybrid ranking method
Select best position
Taxonomic relation extraction
English 2013 This paper proposes a hybrid ranking method for taxonomic relation extraction (or select best position) in an existing taxonomy. This method is capable of effectively combining two resources, an existing taxonomy and Wikipedia, in order to select a most appropriate position for a term candidate in the existing taxonomy. Previous methods mainly focus on complex inference methods to select the best position among all the possible position in the taxonomy. In contrast, our algorithm, a simple but effective one, leverage two kinds of information, the expression of and the ranking information of a term candidate, to select the best position for the term candidate (the hypernym of the term candidate in the existing taxonomy). We conduct our approach on the agricultural domain and the experimental result indicates that the performances are significantly improved. 0 0
A collaborative multi-source intelligence working environment: A systems approach Eachus P.
Short B.
Stedmon A.W.
Brown J.
Wilson M.
Lemanski L.
Collaborative working
Intelligence analysis
English 2013 This research applies a systems approach to aid the understanding of collaborative working during intelligence analysis using a dedicated (Wiki) environment. The extent to which social interaction, and problem solving was facilitated by the use of the wiki, was investigated using an intelligence problem derived from the Vast 2010 challenge. This challenge requires "intelligence analysts" to work with a number of different intelligence sources in order to predict a possible terrorist attack. The study compared three types of collaborative working, face-to-face without a wiki, face-to-face with a wiki, and use of a wiki without face-to-face contact. The findings revealed that in terms of task performance the use of the wiki without face-to-face contact performed best and the wiki group with face-to-face contact performed worst. Measures of interpersonal and psychological satisfaction were highest in the face-to-face group not using a wiki and least in the face-to-face group using a wiki. Overall it was concluded that the use of wikis in collaborative working is best for task completion whereas face-to-face collaborative working without a wiki is best for interpersonal and psychological satisfaction. 0 0
A game theoretic analysis of collaboration in Wikipedia Anand S.
Ofer Arazy
Mandayam N.B.
Oded Nov
Non-cooperative game
Peer production
Trustworthy collaboration
English 2013 Peer production projects such as Wikipedia or open-source software development allow volunteers to collectively create knowledge-based products. The inclusive nature of such projects poses difficult challenges for ensuring trustworthiness and combating vandalism. Prior studies in the area deal with descriptive aspects of peer production, failing to capture the idea that while contributors collaborate, they also compete for status in the community and for imposing their views on the product. In this paper, we investigate collaborative authoring in Wikipedia, where contributors append and overwrite previous contributions to a page. We assume that a contributor's goal is to maximize ownership of content sections, such that content owned (i.e. originated) by her survived the most recent revision of the page.We model contributors' interactions to increase their content ownership as a non-cooperative game, where a player's utility is associated with content owned and cost is a function of effort expended. Our results capture several real-life aspects of contributors interactions within peer-production projects. Namely, we show that at the Nash equilibrium there is an inverse relationship between the effort required to make a contribution and the survival of a contributor's content. In other words, majority of the content that survives is necessarily contributed by experts who expend relatively less effort than non-experts. An empirical analysis of Wikipedia articles provides support for our model's predictions. Implications for research and practice are discussed in the context of trustworthy collaboration as well as vandalism. 0 0
A multilingual semantic wiki based on attempto controlled english and grammatical framework Kaljurand K.
Kuhn T.
Attempto Controlled English
Controlled natural language
Grammatical Framework
Semantic wiki
English 2013 We describe a semantic wiki system with an underlying controlled natural language grammar implemented in Grammatical Framework (GF). The grammar restricts the wiki content to a well-defined subset of Attempto Controlled English (ACE), and facilitates a precise bidirectional automatic translation between ACE and language fragments of a number of other natural languages, making the wiki content accessible multilingually. Additionally, our approach allows for automatic translation into the Web Ontology Language (OWL), which enables automatic reasoning over the wiki content. The developed wiki environment thus allows users to build, query and view OWL knowledge bases via a user-friendly multilingual natural language interface. As a further feature, the underlying multilingual grammar is integrated into the wiki and can be collaboratively edited to extend the vocabulary of the wiki or even customize its sentence structures. This work demonstrates the combination of the existing technologies of Attempto Controlled English and Grammatical Framework, and is implemented as an extension of the existing semantic wiki engine AceWiki. 0 0
A quick tour of BabelNet 1.1 Roberto Navigli BabelNet
Knowledge acquisition
Multilingual ontologies
Semantic networks
English 2013 In this paper we present BabelNet 1.1, a brand-new release of the largest "encyclopedic dictionary", obtained from the automatic integration of the most popular computational lexicon of English, i.e. WordNet, and the largest multilingual Web encyclopedia, i.e. Wikipedia. BabelNet 1.1 covers 6 languages and comes with a renewed Web interface, graph explorer and programmatic API. BabelNet is available online at http://www.babelnet.org. 0 0
A support framework for argumentative discussions management in the web Cabrio E.
Villata S.
Fabien Gandon
English 2013 On the Web, wiki-like platforms allow users to provide arguments in favor or against issues proposed by other users. The increasing content of these platforms as well as the high number of revisions of the content through pros and cons arguments make it difficult for community managers to understand and manage these discussions. In this paper, we propose an automatic framework to support the management of argumentative discussions in wiki-like platforms. Our framework is composed by (i) a natural language module, which automatically detects the arguments in natural language returning the relations among them, and (ii) an argumentation module, which provides the overall view of the argumentative discussion under the form of a directed graph highlighting the accepted arguments. Experiments on the history of Wikipedia show the feasibility of our approach. 0 0
A virtual player for "who Wants to Be a Millionaire?" based on Question Answering Molino P.
Pierpaolo Basile
Santoro C.
Pasquale Lops
De Gemmis M.
Giovanni Semeraro
English 2013 This work presents a virtual player for the quiz game "Who Wants to Be a Millionaire?". The virtual player demands linguistic and common sense knowledge and adopts state-of-the-art Natural Language Processing and Question Answering technologies to answer the questions. Wikipedia articles and DBpedia triples are used as knowledge sources and the answers are ranked according to several lexical, syntactic and semantic criteria. Preliminary experiments carried out on the Italian version of the boardgame proves that the virtual player is able to challenge human players. 0 0
Analysis of cluster structure in large-scale English Wikipedia category networks Klaysri T.
Fenner T.
Lachish O.
Mark Levene
Papapetrou P.
Connected component
Graph structure analysis
Large-scale social network analysis
Wikipedia category network
English 2013 In this paper we propose a framework for analysing the structure of a large-scale social media network, a topic of significant recent interest. Our study is focused on the Wikipedia category network, where nodes correspond to Wikipedia categories and edges connect two nodes if the nodes share at least one common page within the Wikipedia network. Moreover, each edge is given a weight that corresponds to the number of pages shared between the two categories that it connects. We study the structure of category clusters within the three complete English Wikipedia category networks from 2010 to 2012. We observe that category clusters appear in the form of well-connected components that are naturally clustered together. For each dataset we obtain a graph, which we call the t-filtered category graph, by retaining just a single edge linking each pair of categories for which the weight of the edge exceeds some specified threshold t. Our framework exploits this graph structure and identifies connected components within the t-filtered category graph. We studied the large-scale structural properties of the three Wikipedia category networks using the proposed approach. We found that the number of categories, the number of clusters of size two, and the size of the largest cluster within the graph all appear to follow power laws in the threshold t. Furthermore, for each network we found the value of the threshold t for which increasing the threshold to t + 1 caused the "giant" largest cluster to diffuse into two or more smaller clusters of significant size and studied the semantics behind this diffusion. 0 0
Automatic extraction of Polish language errors from text edition history Grundkiewicz R. Error corpora
Language errors detection
Data mining
English 2013 There are no large error corpora for a number of languages, despite the fact that they have multiple applications in natural language processing. The main reason underlying this situation is a high cost of manual corpora creation. In this paper we present the methods of automatic extraction of various kinds of errors such as spelling, typographical, grammatical, syntactic, semantic, and stylistic ones from text edition histories. By applying of these methods to the Wikipedia's article revision history, we created the large and publicly available corpus of naturally-occurring language errors for Polish, called PlEWi. Finally, we analyse and evaluate the detected error categories in our corpus. 0 0
Boot-strapping language identifiers for short colloquial postings Goldszmidt M.
Najork M.
Paparizos S.
Language Identification
English 2013 There is tremendous interest in mining the abundant user generated content on the web. Many analysis techniques are language dependent and rely on accurate language identification as a building block. Even though there is already research on language identification, it focused on very 'clean' editorially managed corpora, on a limited number of languages, and on relatively large-sized documents. These are not the characteristics of the content to be found in say, Twitter or Facebook postings, which are short and riddled with vernacular. In this paper, we propose an automated, unsupervised, scalable solution based on publicly available data. To this end we thoroughly evaluate the use of Wikipedia to build language identifiers for a large number of languages (52) and a large corpus and conduct a large scale study of the best-known algorithms for automated language identification, quantifying how accuracy varies in correlation to document size, language (model) profile size and number of languages tested. Then, we show the value in using Wikipedia to train a language identifier directly applicable to Twitter. Finally, we augment the language models and customize them to Twitter by combining our Wikipedia models with location information from tweets. This method provides massive amount of automatically labeled data that act as a bootstrapping mechanism which we empirically show boosts the accuracy of the models. With this work we provide a guide and a publicly available tool [1] to the mining community for language identification on web and social data. 0 0
COLLEAP - COntextual Language LEArning Pipeline Wloka B.
Werner Winiwarter
Language learning
Natural Language Processing
Web crawling
English 2013 In this paper we present a concept as well as a prototype of a tool pipeline to utilize the abundant information available on the World Wide Web for contextual, user driven creation and display of language learning material. The approach is to capture Wikipedia articles of the user's choice by crawling, to analyze the linguistic aspects of the text via natural language processing and to compile the gathered information into a visually appealing presentation of enriched language information. The tool is designed to address the Japanese language, with a focus on kanji, the pictographic characters used in Japanese scripture. 0 0
Communities, artifacts, interaction and contribution on the web Eleni Stroulia Computer-supported collaboration
Social network
Virtual worlds
Web-based collaborative platforms
English 2013 Today, most of us are members of multiple online communities, in the context of which we engage in a multitude of personal and professional activities. These communities are supported by different web-based platforms and enable different types of collaborative interactions. Through our experience with the development of and experimentation with three different such platforms in support of collaborative communities, we recognized a few core research problems relevant across all such tools, and we developed SociQL, a language, and a corresponding software framework, to study them. 0 0
Comparing expert and non-expert conceptualisations of the land: An analysis of crowdsourced land cover data Comber A.
Brunsdon C.
Linda See
Steffen Fritz
Ian McCallum
Geographically Weighted Kernel
Land Cover
Volunteered Geographical Information (VGI)
English 2013 This research compares expert and non-expert conceptualisations of land cover data collected through a Google Earth web-based interface. In so doing it seeks to determine the impacts of varying landscape conceptualisations held by different groups of VGI contributors on decisions that may be made using crowdsourced data, in this case to select the best global land cover dataset in each location. Whilst much other work has considered the quality of VGI, as yet little research has considered the impact of varying semantics and conceptualisations on the use of VGI in formal scientific analyses. This study found that conceptualisation of cropland varies between experts and non-experts. A number of areas for further research are outlined. 0 0
Complementary information for Wikipedia by comparing multilingual articles Fujiwara Y.
Yu Suzuki
Konishi Y.
Akiyo Nadamoto
English 2013 Information of many articles is lacking in Wikipedia because users can create and edit the information freely. We specifically examined the multilinguality of Wikipedia and proposed a method to complement information of articles which lack information based on comparing different language articles that have similar contents. However, much non-complementary information is unrelated to a user's browsing article in the results. Herein, we propose improvement of the comparison area based on the classified complementary target. 0 0
Constructing a focused taxonomy from a document collection Olena Medelyan
Manion S.
Broekstra J.
Divoli A.
Huang A.-L.
Witten I.H.
English 2013 We describe a new method for constructing custom taxonomies from document collections. It involves identifying relevant concepts and entities in text; linking them to knowledge sources like Wikipedia, DBpedia, Freebase, and any supplied taxonomies from related domains; disambiguating conflicting concept mappings; and selecting semantic relations that best group them hierarchically. An RDF model supports interoperability of these steps, and also provides a flexible way of including existing NLP tools and further knowledge sources. From 2000 news articles we construct a custom taxonomy with 10,000 concepts and 12,700 relations, similar in structure to manually created counterparts. Evaluation by 15 human judges shows the precision to be 89% and 90% for concepts and relations respectively; recall was 75% with respect to a manually generated taxonomy for the same domain. 0 0
Cross language prediction of vandalism on wikipedia using article views and revisions Tran K.-N.
Christen P.
English 2013 Vandalism is a major issue on Wikipedia, accounting for about 2% (350,000+) of edits in the first 5 months of 2012. The majority of vandalism are caused by humans, who can leave traces of their malicious behaviour through access and edit logs. We propose detecting vandalism using a range of classifiers in a monolingual setting, and evaluated their performance when using them across languages on two data sets: the relatively unexplored hourly count of views of each Wikipedia article, and the commonly used edit history of articles. Within the same language (English and German), these classifiers achieve up to 87% precision, 87% recall, and F1-score of 87%. Applying these classifiers across languages achieve similarly high results of up to 83% precision, recall, and F1-score. These results show characteristic vandal traits can be learned from view and edit patterns, and models built in one language can be applied to other languages. 0 0
Detection of article qualities in the chinese wikipedia based on c4.5 decision tree Xiao K.
Li B.
He P.
Yang X.-H.
Application of supervised learning
Article quality
Data ming
Decision tree
English 2013 The number of articles in Wikipedia is growing rapidly. It is important for Wikipedia to provide users with high quality and reliable articles. However, the quality assessment metric provided by Wikipedia are inefficient, and other mainstream quality detection methods only focus on the qualities of the English Wikipedia articles, and usually analyze the text contents of articles, which is also a time-consuming process. In this paper, we propose a method for detecting the article qualities of the Chinese Wikipedia based on C4.5 decision tree. The problem of quality detection is transformed to classification problem of high-quality and low-quality articles. By using the fields from the tables in the Chinese Wikipedia database, we built the decision trees to distinguish high-quality articles from low-quality ones. 0 0
Disambiguation to Wikipedia: A language and domain independent approach Nguyen T.-V.T. English 2013 Disambiguation to Wikipedia (D2W) is the task of linking mentions of concepts in text to their corresponding Wikipedia articles. Traditional approaches to D2W has focused either in only one language (e.g. English) or in formal texts (e.g. news articles). In this paper, we present a multilingual framework with a set of new features that can be obtained purely from the online encyclopedia, without the need of any natural language specific tool. We analyze these features with different languages and different domains. The approach shows as fully language-independent and has been applied successfully to English, Italian, Polish, with a consistent improvement. We show that only a sufficient number of Wikipedia articles is needed for training. When trained on real-world data sets for English, our new features yield substantial improvement compared to current local and global disambiguation algorithms. Finally, the adaption to the Bridgeman query logs in digital libraries shows the robustness of our approach even in the lack of disambiguation context. Also, as no natural language specific tool is needed, the method can be applied to other languages in a similar manner with little adaptation. 0 0
Discovering missing semantic relations between entities in Wikipedia Xu M.
Zhe Wang
Bie R.
Jing-Woei Li
Zheng C.
Ke W.
Zhou M.
Linked data
English 2013 Wikipedia's infoboxes contain rich structured information of various entities, which have been explored by the DBpedia project to generate large scale Linked Data sets. Among all the infobox attributes, those attributes having hyperlinks in its values identify semantic relations between entities, which are important for creating RDF links between DBpedia's instances. However, quite a few hyperlinks have not been anotated by editors in infoboxes, which causes lots of relations between entities being missing in Wikipedia. In this paper, we propose an approach for automatically discovering the missing entity links in Wikipedia's infoboxes, so that the missing semantic relations between entities can be established. Our approach first identifies entity mentions in the given infoboxes, and then computes several features to estimate the possibilities that a given attribute value might link to a candidate entity. A learning model is used to obtain the weights of different features, and predict the destination entity for each attribute value. We evaluated our approach on the English Wikipedia data, the experimental results show that our approach can effectively find the missing relations between entities, and it significantly outperforms the baseline methods in terms of both precision and recall. 0 0
Distant supervision learning of DBPedia relations Zajac M.
Przepiorkowski A.
Distant supervision learning
Information extraction
Ontology construction
Semantic web
English 2013 This paper presents DBPediaExtender, an information extraction system that aims at extending an existing ontology of geographical entities by extracting information from text. The system uses distant supervision learning - the training data is constructed on the basis of matches between values from infoboxes (taken from the Polish DBPedia) and Wikipedia articles. For every relevant relation, a sentence classifier and a value extractor are trained; the sentence classifier selects sentences expressing a given relation and the value extractor extracts values from selected sentences. The results of manual evaluation for several selected relations are reported. 0 0
Document analytics through entity resolution Santos J.
Martins B.
Batista D.S.
Entity Resolution
Information extraction
Text Mining
English 2013 We present a prototype system for resolving named entities, mentioned in textual documents, into the corresponding Wikipedia entities. This prototype can aid in document analysis, by using the disambiguated references to provide useful information in context. 0 0
Document listing on versioned documents Claude F.
Munro J.I.
English 2013 Representing versioned documents, such as Wikipedia history, web archives, genome databases, backups, is challenging when we want to support searching for an exact substring and retrieve the documents that contain the substring. This problem is called document listing. We present an index for the document listing problem on versioned documents. Our index is the first one based on grammar-compression. This allows for good results on repetitive collections, whereas standard techniques cannot achieve competitive space for solving the same problem. Our index can also be addapted to work in a more standard way, allowing users to search for word-based phrase queries and conjunctive queries at the same time. Finally, we discuss extensions that may be possible in the future, for example, supporting ranking capabilities within the index itself. 0 0
English nominal compound detection with Wikipedia-based methods Nagy T. I.
Veronika Vincze
Multiword expressions
MWE detection
Nominal compounds
Silver standard corpus
English 2013 Nominal compounds (NCs) are lexical units that consist of two or more elements that exist on their own, function as a noun and have a special added meaning. Here, we present the results of our experiments on how the growth of Wikipedia added to the performance of our dictionary labeling methods to detecting NCs. We also investigated how the size of an automatically generated silver standard corpus can affect the performance of our machine learning-based method. The results we obtained demonstrate that the bigger the dataset, the better the performance will be. 0 0
Entityclassifier.eu: Real-time classification of entities in text with Wikipedia Dojchinovski M.
Kliegr T.
English 2013 Targeted Hypernym Discovery (THD) performs unsupervised classification of entities appearing in text. A hypernym mined from the free-text of the Wikipedia article describing the entity is used as a class. The type as well as the entity are cross-linked with their representation in DBpedia, and enriched with additional types from DBpedia and YAGO knowledge bases providing a semantic web interoperability. The system, available as a web application and web service at entityclassifier.eu , currently supports English, German and Dutch. 0 0
Escaping the trap of too precise topic queries Libbrecht P. Learning resources
Mathematical documents search
Mathematics classifications
Mathematics subjects
Search user interface
Topics search
Web mathematics library
English 2013 At the very center of digital mathematics libraries lie controlled vocabularies which qualify the topic of the documents. These topics are used when submitting a document to a digital mathematics library and to perform searches in a library. The latter are refined by the use of these topics as they allow a precise classification of the mathematics area this document addresses. However, there is a major risk that users employ too precise topics to specify their queries: they may be employing a topic that is only "close-by" but missing to match the right resource. We call this the topic trap. Indeed, since 2009, this issue has appeared frequently on the i2geo.net platform. Other mathematics portals experience the same phenomenon. An approach to solve this issue is to introduce tolerance in the way queries are understood by the user. In particular, the approach of including fuzzy matches but this introduces noise which may prevent the user of understanding the function of the search engine. In this paper, we propose a way to escape the topic trap by employing the navigation between related topics and the count of search results for each topic. This supports the user in that search for close-by topics is a click away from a previous search. This approach was realized with the i2geo search engine and is described in detail where the relation of being related is computed by employing textual analysis of the definitions of the concepts fetched from the Wikipedia encyclopedia. 0 0
Evaluation of ILP-based approaches for partitioning into colorful components Bruckner S.
Huffner F.
Komusiewicz C.
Niedermeier R.
English 2013 The NP-hard Colorful Components problem is a graph partitioning problem on vertex-colored graphs. We identify a new application of Colorful Components in the correction of Wikipedia interlanguage links, and describe and compare three exact and two heuristic approaches. In particular, we devise two ILP formulations, one based on Hitting Set and one based on Clique Partition. Furthermore, we use the recently proposed implicit hitting set framework [Karp, JCSS 2011; Chandrasekaran et al., SODA 2011] to solve Colorful Components. Finally, we study a move-based and a merge-based heuristic for Colorful Components. We can optimally solve Colorful Components for Wikipedia link correction data; while the Clique Partition-based ILP outperforms the other two exact approaches, the implicit hitting set is a simple and competitive alternative. The merge-based heuristic is very accurate and outperforms the move-based one. The above results for Wikipedia data are confirmed by experiments with synthetic instances. 0 0
Evaluation of WikiTalk - User studies of human-robot interaction Anastasiou D.
Kristiina Jokinen
Graham Wilcock
Multimodal human-robot interaction
English 2013 The paper concerns the evaluation of Nao WikiTalk, an application that enables a Nao robot to serve as a spoken open-domain knowledge access system. With Nao WikiTalk the robot can talk about any topic the user is interested in, using Wikipedia as its knowledge source. The robot suggests some topics to start with, and the user shifts to related topics by speaking their names after the robot mentions them. The user can also switch to a totally new topic by spelling the first few letters. As well as speaking, the robot uses gestures, nods and other multimodal signals to enable clear and rich interaction. The paper describes the setup of the user studies and reports on the evaluation of the application, based on various factors reported by the 12 users who participated. The study compared the users' expectations of the robot interaction with their actual experience of the interaction. We found that the users were impressed by the lively appearance and natural gesturing of the robot, although in many respects they had higher expectations regarding the robot's presentation capabilities. However, the results are positive enough to encourage research on these lines. 0 0
Extracting event-related information from article updates in Wikipedia Georgescu M.
Kanhabua N.
Krause D.
Wolfgang Nejdl
Siersdorfer S.
English 2013 Wikipedia is widely considered the largest and most up-to-date online encyclopedia, with its content being continuously maintained by a supporting community. In many cases, real-life events like new scientific findings, resignations, deaths, or catastrophes serve as triggers for collaborative editing of articles about affected entities such as persons or countries. In this paper, we conduct an in-depth analysis of event-related updates in Wikipedia by examining different indicators for events including language, meta annotations, and update bursts. We then study how these indicators can be employed for automatically detecting event-related updates. Our experiments on event extraction, clustering, and summarization show promising results towards generating entity-specific news tickers and timelines. 0 0
ISICIL: Semantics and social networks for business intelligence Michel Buffa
Delaforge N.
Ereteo G.
Fabien Gandon
Giboin A.
Limpens F.
Business intelligence
Semantic wiki
Social network
Social network analysis
Social semantic web
English 2013 The ISICIL initiative (Information Semantic Integration through Communities of Intelligence onLine) mixes viral new web applications with formal semantic web representations and processes to integrate them into corporate practices for technological watch, business intelligence and scientific monitoring. The resulting open source platform proposes three functionalities: (1) a semantic social bookmarking platform monitored by semantic social network analysis tools, (2) a system for semantically enriching folksonomies and linking them to corporate terminologies and (3) semantically augmented user interfaces, activity monitoring and reporting tools for business intelligence. 0 0
Improving semi-supervised text classification by using wikipedia knowledge Zhang Z.
Hong Lin
Li P.
Haofen Wang
Lu D.
Clustering Based Classification
Semi-supervised Text Classification
English 2013 Semi-supervised text classification uses both labeled and unlabeled data to construct classifiers. The key issue is how to utilize the unlabeled data. Clustering based classification method outperforms other semi-supervised text classification algorithms. However, its achievements are still limited because the vector space model representation largely ignores the semantic relationships between words. In this paper, we propose a new approach to address this problem by using Wikipedia knowledge. We enrich document representation with Wikipedia semantic features (concepts and categories), propose a new similarity measure based on the semantic relevance between Wikipedia features, and apply this similarity measure to clustering based classification. Experiment results on several corpora show that our proposed method can effectively improve semi-supervised text classification performance. 0 0
Issues for linking geographical open data of GeoNames and Wikipedia Yoshioka M.
Kando N.
English 2013 It is now possible to use various geographical open data sources such as GeoNames and Wikipedia to construct geographic information systems. In addition, these open data sources are integrated by the concept of Linked Open Data. There have been several attempts to identify links between existing data, but few studies have focused on the quality of such links. In this paper, we introduce an automatic link discovery method for identifying the correspondences between GeoNames entries and Wikipedia pages, based on Wikipedia category information. This method finds not only appropriate links but also inconsistencies between two databases. Based on this integration results, we discuss the type of inconsistencies for making consistent Linked Open Data. 0 0
Leveraging encyclopedic knowledge for transparent and serendipitous user profiles Narducci F.
Musto C.
Giovanni Semeraro
Pasquale Lops
De Gemmis M.
English 2013 The main contribution of this work is the comparison of different techniques for representing user preferences extracted by analyzing data gathered from social networks, with the aim of constructing more transparent (human-readable) and serendipitous user profiles. We compared two different user models representations: one based on keywords and one exploiting encyclopedic knowledge extracted from Wikipedia. A preliminary evaluation involving 51 Facebook and Twitter users has shown that the use of an encyclopedic-based representation better reflects user preferences, and helps to introduce new interesting topics. 0 0
Lo mejor de dos idiomas - Cross-lingual linkage of geotagged Wikipedia articles Ahlers D. Cross-lingual Information Retrieval
Data fusion
Entity Resolution
Geospatial Web Search
Record Linkage
English 2013 Different language versions of Wikipedia contain articles referencing the same place. However, an article in one language does not necessarily mean it is available in another language as well and linked to. This paper examines geotagged articles describing places in Honduras in both the Spanish and the English language versions. It demonstrates that a method based on simple features can reliably identify article pairs describing the same semantic place concept and evaluates it against the existing interlinks as well as a manual assessment. 0 0
MDL-based models for transliteration generation Nouri J.
Pivovarova L.
Yangarber R.
English 2013 This paper presents models for automatic transliteration of proper names between languages that use different alphabets. The models are an extension of our work on automatic discovery of patterns of etymological sound change, based on the Minimum Description Length Principle. The models for pairwise alignment are extended with algorithms for prediction that produce transliterated names. We present results on 13 parallel corpora for 7 languages, including English, Russian, and Farsi, extracted from Wikipedia headlines. The transliteration corpora are released for public use. The models achieve up to 88% on word-level accuracy and up to 99% on symbol-level F-score. We discuss the results from several perspectives, and analyze how corpus size, the language pair, the type of names (persons, locations), and noise in the data affect the performance. 0 0
Making collective wisdom wiser Milo T. English 2013 Many popular sites, such as Wikipedia and Tripadvisor, rely on public participation to gather information - a process known as crowd data sourcing. While this kind of collective intelligence is extremely valuable, it is also fallible, and policing such sites for inaccuracies or missing material is a costly undertaking. In this talk we will overview the MoDaS project that investigates how database technology can be put to work to effectively gather information from the public, efficiently moderate the process, and identify questionable input with minimal human interaction [1-4, 7]. We will consider the logical, algorithmic, and methodological foundations for the management of large scale crowd-sourced data as well as the development of applications over such information. 0 0
Making sense of open data statistics with information from Wikipedia Hienert D.
Wegener D.
Schomisch S.
English 2013 Today, more and more open data statistics are published by governments, statistical offices and organizations like the United Nations, The World Bank or Eurostat. This data is freely available and can be consumed by end users in interactive visualizations. However, additional information is needed to enable laymen to interpret these statistics in order to make sense of the raw data. In this paper, we present an approach to combine open data statistics with historical events. In a user interface we have integrated interactive visualizations of open data statistics with a timeline of thematically appropriate historical events from Wikipedia. This can help users to explore statistical data in several views and to get related events for certain trends in the timeline. Events include links to Wikipedia articles, where details can be found and the search process can be continued. We have conducted a user study to evaluate if users can use the interface intuitively, if relations between trends in statistics and historical events can be found and if users like this approach for their exploration process. 0 0
Method and tool support for classifying software languages with Wikipedia Lammel R.
Mosen D.
Varanovich A.
English 2013 Wikipedia provides useful input for efforts on mining taxonomies or ontologies in specific domains. In particular, Wikipedia's categories serve classification. In this paper, we describe a method and a corresponding tool, WikiTax, for exploring Wikipedia's category graph with the objective of supporting the development of a classification of software languages. The category graph is extracted level by level. The extracted graph is visualized in a tree-like manner. Category attributes (i.e., metrics) such as depth are visualized. Irrelevant edges and nodes may be excluded. These exclusions are documented while using a manageable and well-defined set of 'exclusion types' as comments. 0 0
PATHSenrich: A web service prototype for automatic cultural heritage item enrichment Eneko Agirre
Barrena A.
Fernandez K.
Miranda E.
Otegi A.
Aitor Soroa
English 2013 Large amounts of cultural heritage material are nowadays available through online digital library portals. Most of these cultural items have short descriptions and lack rich contextual information. The PATHS project has developed experimental enrichment services. As a proof of concept, this paper presents a web service prototype which allows independent content providers to enrich cultural heritage items with a subset of the full functionality: links to related items in the collection and links to related Wikipedia articles. In the future we plan to provide more advanced functionality, as available offline for PATHS. 0 0
Parsit at Evalita 2011 dependency parsing task Grella M.
Nicola M.
Constraints grammar
Dependency parsing
English 2013 This article describes the Constraint-based Dependency Parser architecture used at Evalita 2011 Dependency Parsing Task, giving a detailed analysis of the results obtained at the official evaluation. The Italian grammar has been expressed for the first time as a set of constraints and an ad-hoc constraints solver has been then applied to restrict possible analysis. Multiple solutions of a given sentence have been reduced to one by means of an evidence scoring system that makes use of an indexed version of Italian Wikipedia created for the purpose. The attachment score obtained is 96.16%, giving the best result so far for a dependency parser for the Italian language. 0 0
Probabilistic explicit topic modeling using Wikipedia Hansen J.A.
Ringger E.K.
Seppi K.D.
English 2013 Despite popular use of Latent Dirichlet Allocation (LDA) for automatic discovery of latent topics in document corpora, such topics lack connections with relevant knowledge sources such as Wikipedia, and they can be difficult to interpret due to the lack of meaningful topic labels. Furthermore, the topic analysis suffers from a lack of identifiability between topics across independently analyzed corpora but also across distinct runs of the algorithm on the same corpus. This paper introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). Both of these methods estimate topic-word distributions a priori from Wikipedia articles, with each article corresponding to one topic and the article title serving as a topic label. LDA-STWD and EDA overcome the nonidentifiability, isolation, and unintepretability of LDA output. We assess their effectiveness by means of crowd-sourced user studies on two tasks: topic label generation and document label generation. We find that LDA-STWD improves substantially upon the performance of the state-of-the-art on the document labeling task, and that both methods otherwise perform on par with a state-of-the-art post hoc method. 0 0
Querying multilingual DBpedia with QAKiS Cabrio E.
Cojan J.
Fabien Gandon
Hallili A.
English 2013 We present an extension of QAKiS, a system for open domain Question Answering over linked data, that allows to query DBpedia multilingual chapters. Such chapters can contain different information with respect to the English version, e.g. they provide more specificity on certain topics, or fill information gaps. QAKiS exploits the alignment between properties carried out by DBpedia contributors as a mapping from Wikipedia terms to a common ontology, to exploit information coming from DBpedia multilingual chapters, broadening therefore its coverage. For the demo, English, French and German DBpedia chapters are the RDF data sets to be queried using a natural language interface. 0 0
Recompilation of broadcast videos based on real-world scenarios Ichiro Ide English 2013 In order to effectively make use of videos stored in a broadcast video archive, we have been working on their recompilation. In order to realize this, we take an approach that considers the videos in the archive as video materials, and recompiling them by considering various kinds of social media information as "scenarios". In this paper, we will introduce our works in news, sports, and cooking domains, that makes use of Wikipedia articles, demoscopic polls, twitter tweets, and cooking recipes in order to recompile video clips from corresponding TV shows. 0 0
Related entity finding using semantic clustering based on wikipedia categories Stratogiannis G.
Georgios Siolas
Andreas Stafylopatis
Related Entity Finding
Semantic clustering
Wikipedia category vector representation
English 2013 We present a system that performs Related Entity Finding, that is, Question Answering that exploits Semantic Information from the WWW and returns URIs as answers. Our system uses a search engine to gather all candidate answer entities and then a linear combination of Information Retrieval measures to choose the most relevant. For each one we look up its Wikipedia page and construct a novel vector representation based on the tokenization of the Wikipedia category names. This novel representation gives our system the ability to compute a measure of semantic relatedness between entities, even if the entities do not share any common category. We use this property to perform a semantic clustering of the candidate entities and show that the biggest cluster contains entities that are closely related semantically and can be considered as answers to the query. Performance measured on 20 topics from the 2009 TREC Related Entity Finding task shows competitive results. 0 0
Representation and verification of attribute knowledge Zhang C.
Niu Z.
Shi C.
Tan M.
Fu H.
Xu S.
Attribute values
Knowledge verification
Ontology verification
Taxonomy of attribute relations
English 2013 With the increasing growth and popularization of the Internet, knowledge extraction from the web is an important issue in the fields of web mining, ontology engineering and intelligent information processing. The availability of real big corpora and the development of technologies of internet network and machine learning make it feasible to acquire massive knowledge from the web. In addition, many web-based encyclopedias such as Wikipedia and Baidu Baike include much structured knowledge. However, knowledge qualities including the incorrectness, inconsistency, and incompleteness become a serious obstacle for the wide practical applications of those extracted and structured knowledge. In this paper, we build a taxonomy of relations between attributes of concepts, and propose a taxonomy of attribute relations driven approach to evaluating the knowledge about attribute values of attributes of entities. We also address an application of our approach to building and verifying attribute knowledge of entities in different domains. 0 0
Selecting features with SVM Rzeniewicz J.
Szymanski J.
Documents categorization
Feature selection
English 2013 A common problem with feature selection is to establish how many features should be retained at least so that important information is not lost. We describe a method for choosing this number that makes use of Support Vector Machines. The method is based on controlling an angle by which the decision hyperplane is tilt due to feature selection. Experiments were performed on three text datasets generated from a Wikipedia dump. Amount of retained information was estimated by classification accuracy. Even though the method is parametric, we show that, as opposed to other methods, once its parameter is chosen it can be applied to a number of similar problems (e.g. one value can be used for various datasets originating from Wikipedia). For a constant value of the parameter, dimensionality was reduced by from 78% to 90%, depending on the data set. Relative accuracy drop due to feature removal was less than 0.5% in those experiments. 0 0
Semantic message passing for generating linked data from tables Mulwad V.
Tim Finin
Joshi A.
Graphical Models
Linked data
Semantic web
English 2013 We describe work on automatically inferring the intended meaning of tables and representing it as RDF linked data, making it available for improving search, interoperability and integration. We present implementation details of a joint inference module that uses knowledge from the linked open data (LOD) cloud to jointly infer the semantics of column headers, table cell values (e.g., strings and numbers) and relations between columns. We also implement a novel Semantic Message Passing algorithm which uses LOD knowledge to improve existing message passing schemes. We evaluate our implemented techniques on tables from the Web and Wikipedia. 0 0
Social relation extraction based on Chinese Wikipedia articles Liu M.
Xiao Y.
Lei C.
Xiaofeng Zhou
Chinese Wikipedia Article
Social Relation Extraction
Social Relation Network
English 2013 Our work in this paper pays more attention to information extraction about social relations from Chinese Wikipedia articles and construction of social relation network. After obtaining the Chinese Wikipedia articles according to the provided person name, locating the relationship description sentences in the Chinese Wikipedia articles and extracting the social relation information based on the sentence semantic parser, we can construct the social network centered with the provided person name, using the social relation information. The relation set also can be iteratively expanded based on the person names associated with the provided person name in the related Chinese Wikipedia articles. 0 0
Temporal, cultural and thematic aspects of web credibility Radoslaw Nielek
Wawer A.
Jankowski-Lorek M.
Adam Wierzbicki
English 2013 Is trust to web pages related to nation-level factors? Do trust levels change in time and how? What categories (topics) of pages tend to be evaluated as not trustworthy, and what categories of pages tend to be trustworthy? What could be the reasons of such evaluations? The goal of this paper is to answer these questions using large scale data of trustworthiness of web pages, two sets of websites, Wikipedia and an international survey. 0 0
The Tanl lemmatizer enriched with a sequence of cascading filters Giuseppe Attardi
Dei Rossi S.
Simi M.
Deep Search
Part-of-Speech tagging
English 2013 We have extended an existing lemmatizer, which relies on a lexicon of about 1.2 millions form, where lemmas are indexed by rich PoS tags, with a sequence of cascading filters, each one in charge of dealing with specific issues related to out-of-dictionary words. The last two filters are devoted to resolve semantic ambiguities between words of the same syntactic category, by querying external resources: an enriched index built on the Italian Wikipedia and the Google index. 0 0
The category structure in Wikipedia: To analyze and know how it grows Wang Q.
Xiaolong Wang
Zheng Chen
Wang R.
Category structure
Complex network
English 2013 Wikipedia is a famous encyclopedia and is applied to a lot of famous fields for many years, such as natural language processing. The category structure is used and analyzed in this paper. We take the important topological properties into account, such as the connectivity distribution. What's the most important of all is to analyze the growth of the structure from 2004 to 2012 in detail. In order to tell about the growth, the basic properties and the small-worldness is brought in. Some different edge attachment models based on the properties of nodes are tested in order to study how the properties of nodes influence the creation of edges. We are very interested in the phenomenon that the data in 2011 and 2012 is so strange and study the reason closely. Our results offer useful insights for the structure and the growth of the category structure. 0 0
The category structure in wikipedia: To analyze and know its quality using k-core decomposition Wang Q.
Xiaolong Wang
Zheng Chen
Complex network
Overall topology
English 2013 Wikipedia is a famous and free encyclopedia. A network based on its category structure is built and then analyzed from various aspects, such as the connectivity distribution, evolution of the overall topology. As an innovative point of our paper, the model that is on the base of the k-core decomposition is used to analyze evolution of the overall topology and test the quality (that is, the error and attack tolerance) of the structure when nodes are removed. The model based on removal of edges is compared. Our results offer useful insights for the growth and the quality of the category structure, and the methods how to better organize the category structure. 0 0
The impact of temporal intent variability on diversity evaluation Zhou K.
Whiting S.
Jose J.M.
Lalmas M.
English 2013 To cope with the uncertainty involved with ambiguous or underspecified queries, search engines often diversify results to return documents that cover multiple interpretations, e.g. the car brand, animal or operating system for the query 'jaguar'. Current diversity evaluation measures take the popularity of the subtopics into account and aim to favour systems that promote most popular subtopics earliest in the result ranking. However, this subtopic popularity is assumed to be static over time. In this paper, we hypothesise that temporal subtopic popularity change is common for many topics and argue this characteristic should be considered when evaluating diversity. Firstly, to support our hypothesis we analyse temporal subtopic popularity changes for ambiguous queries through historic Wikipedia article viewing statistics. Further, by simulation, we demonstrate the impact of this temporal intent variability on diversity evaluation. 0 0
Towards an automatic creation of localized versions of DBpedia Palmero Aprosio A.
Claudio Giuliano
Lavelli A.
English 2013 DBpedia is a large-scale knowledge base that exploits Wikipedia as primary data source. The extraction procedure requires to manually map Wikipedia infoboxes into the DBpedia ontology. Thanks to crowdsourcing, a large number of infoboxes has been mapped in the English DBpedia. Consequently, the same procedure has been applied to other languages to create the localized versions of DBpedia. However, the number of accomplished mappings is still small and limited to most frequent infoboxes. Furthermore, mappings need maintenance due to the constant and quick changes of Wikipedia articles. In this paper, we focus on the problem of automatically mapping infobox attributes to properties into the DBpedia ontology for extending the coverage of the existing localized versions or building from scratch versions for languages not covered in the current version. The evaluation has been performed on the Italian mappings. We compared our results with the current mappings on a random sample re-annotated by the authors. We report results comparable to the ones obtained by a human annotator in term of precision, but our approach leads to a significant improvement in recall and speed. Specifically, we mapped 45,978 Wikipedia infobox attributes to DBpedia properties in 14 different languages for which mappings were not yet available. The resource is made available in an open format. 0 0
TransWiki: Supporting translation teaching Biuk-Aghai R.P.
Hari Venkatesan
Collaborative learning
English 2013 Web-based learning systems have become common in recent years and wikis, websites whose pages anyone can edit, have enabled online collaborative text production. When applied to education, wikis have the potential to facilitate collaborative learning. We have developed a customized wiki system which we have used at our university in teaching translation in collaborative student groups. We report on the design and implementation of our wiki system and an evaluation of its use. 0 0
Tìpalo: A tool for automatic typing of DBpedia entities Nuzzolese A.G.
Aldo Gangemi
Valentina Presutti
Draicchio F.
Alberto Musetti
Paolo Ciancarini
English 2013 In this paper we demonstrate the potentiality of Tìpalo, a tool for automatically typing DBpedia entities. Tìpalo identifies the most appropriate types for an entity in DBpedia by interpreting its definition extracted from its corresponding Wikipedia abstract. Tìpalo relies on FRED, a tool for ontology learning from natural language text, and on a set of graph-pattern-based heuristics which work on the output returned by FRED in order to select the most appropriate types for a DBpedia entity. The tool returns a RDF graph composed of rdf:type, rdfs:subClassOf, owl:sameAs, and owl:equivalentTo statements providing typing information about the entity. Additionally the types are aligned to two lists of top-level concepts, i.e., Wordnet supersenses and a subset of DOLCE Ultra Lite classes. Tìpalo is available as a Web-based tool and exposes its API as HTTP REST services. 0 0
Ukrainian WordNet: Creation and filling Anisimov A.
Marchenko O.
Nikonenko A.
Porkhun E.
Taranukha V.
Information extraction
Knowledge representation
English 2013 This paper deals with the process of developing a lexical semantic database for Ukrainian language - UkrWordNet. The architecture of the developed system is described in detail. The data storing structure and mechanisms of access to knowledge are reviewed along with the internal logic of the system and some key software modules. The article is also concerned with the research and development of automated techniques of UkrWordNet Semantic Network replenishment and extension. 0 0
Unsupervised gazette creation using information distance Patil S.
Pawar S.
Palshikar G.K.
Bhat S.
Srivastava R.
Information distance
Information extraction
Named entity extraction
Unsupervised learning
English 2013 Named Entity extraction (NEX) problem consists of automatically constructing a gazette containing instances for each NE of interest. NEX is important for domains which lack a corpus with tagged NEs. In this paper, we propose a new unsupervised (bootstrapping) NEX technique, based on a new variant of the Multiword Expression Distance (MED)[1] and information distance [2]. Efficacy of our method is shown using comparison with BASILISK and PMI in agriculture domain. Our method discovered 8 new diseases which are not found in Wikipedia. 0 0
Utilizing annotated wikipedia article titles to improve a rule-based named entity recognizer for Turkish Kucuk D. English 2013 Named entity recognition is one of the information extraction tasks which aims to identify named entities such as person/ location/organization names along with some numeric and temporal expressions in free natural language texts. In this study, we target at named entity recognition from Turkish texts on which information extraction research is considerably rare compared to other well-studied languages. The effects of utilizing annotated Wikipedia article titles to enrich the lexical resources of a rule-based named entity recognizer for Turkish are discussed after evaluating the enriched named entity recognizer against its initial version. The evaluation results demonstrate that the presented extension improves the recognition performance on different text genres, particularly on historical and financial news text sets for which the initial recognizer has not been engineered for. The current study is significant as it is the first study to address the utilization of Wikipedia articles as an information source to improve named entity recognition on Turkish texts. 0 0
WikiDetect: Automatic vandalism detection for Wikipedia using linguistic features Cioiu D.
Rebedea T.
Natural Language Processing
Vandalism detection
English 2013 Vandalism of the content has always been one of the greatest problems for Wikipedia, yet only few completely automatic solutions for solving it have been developed so far. Volunteers still spend large amounts of time correcting vandalized page edits, instead of using this time to improve the quality of the content of articles. The purpose of this paper is to introduce a new vandalism detection system, that only uses natural language processing and machine learning techniques. The system has been evaluated on a corpus of real vandalized data in order to test its performance and justify the design choices. The same expert annotated wikitext, extracted from the encyclopedia's database, is used to evaluate different vandalism detection algorithms. The paper presents a critical analysis of the obtained results, comparing them to existing solutions, and suggests different statistical classification methods that bring several improvements to the task at hand. 0 0
Wikipedia articles representation with matrix'u Szymanski J. Documents classification
Text representation
English 2013 In the article we evaluate different text representation methods used for a task of Wikipedia articles categorization. We present the Matrix'u application used for creating computational datasets of Wikipedia articles. The representations have been evaluated with SVM classifiers used for reconstruction human made categories. 0 0
A case study on scaffolding design for wiki-based collaborative knowledge building Li S.
Shi P.
Tang Q.
Collaborative knowledge building
English 2012 Social software, particularly Wiki, is providing new opportunities for computer-based collaborative learning by supporting more flexible sharing, communication, co-writing, collaborative knowledge building and learning community building. This paper presents a case study on how to scaffold wiki-based collaborative knowledge building in a tertiary education environment, which is expected to be a useful exploration of pedagogy with wikis. The paper proposes a theoretical scaffolding framework for wiki-based collaborative knowledge building, in which cognitive process, motivation and skills are concerned as the backbone of scaffolding design. Then, implementation results of this framework in a bachelor degree course are reported in the paper. Results of the implementation were positive. As predicted, both participation rate and quality of social construction were improved. The paper concludes with a discussion and reflection on issues relevant to implementation of scaffolding framework, including designing scaffolding strategies, the role of instructors, improvement of wiki systems and further researches. 0 0
A conceptual framework and experimental workbench for architectures Konersmann M.
Goedicke M.
English 2012 When developing the architecture of a software system, inconsistent architecture representations and missing specifications or documentations are often a problem. We present a conceptual framework for software architecture that can help to avoid inconsistencies between the specification and the implementation, and thus helps during the maintenance and evolution of software systems. For experimenting with the framework, we present an experimental workbench. Within this workbench, architecture information is described in an intermediate language in a semantic wiki. The semantic information is used as an experimental representation of the architecture and provides a basis for bidirectional transformations between implemented and specified architecture. A systematic integration of model information in the source code of component models allows for maintaining only one representation of the architecture: the source code. The workbench can be easily extended to experiment with other Architecture Description Languages, Component Models, and analysis languages. 0 0
A graph-based summarization system at QA@INEX track 2011 Laureano-Cruces A.L.
Ramirez-Rodriguez J.
Automatic summarization system
Question-answering system
English 2012 In this paper we use REG, a graph-based system to study a fundamental problem of Natural Language Processing: the automatic summarization of documents. The algorithm models a document as a graph, to obtain weighted sentences. We applied this approach to the INEX@QA 2011 task (question-answering). We have extracted the title and some key or related words according to two people from the queries, in order to recover 50 documents from english wikipedia. Using this strategy, REG obtained good results with the automatic evaluation system FRESA. 0 0
A hybrid QA system with focused IR and automatic summarization for INEX 2011 Bhaskar P.
Somnath Banerjee
Neogi S.
Bandyopadhyay S.
Automatic summarization
INEX 2011
Information extraction
Information retrieval
Question answering
English 2012 The article presents the experiments carried out as part of the participation in the QA track of INEX 2011. We have submitted two runs. The INEX QA task has two main sub tasks, Focused IR and Automatic Summarization. In the Focused IR system, we first preprocess the Wikipedia documents and then index them using Nutch. Stop words are removed from each query tweet and all the remaining tweet words are stemmed using Porter stemmer. The stemmed tweet words form the query for retrieving the most relevant document using the index. The automatic summarization system takes as input the query tweet along with the tweet's text and the title from the most relevant text document. Most relevant sentences are retrieved from the associated document based on the TF-IDF of the matching query tweet, tweet's text and title words. Each retrieved sentence is assigned a ranking score in the Automatic Summarization system. The answer passage includes the top ranked retrieved sentences with a limit of 500 words. The two unique runs differ in the way in which the relevant sentences are retrieved from the associated document. Our first run got the highest score of 432.2 in Relaxed metric of Readability evaluation among all the participants. 0 0
A method for automatically extracting domain semantic networks from Wikipedia Xavier C.C.
De Lima V.L.S.
Knowledge acquisition
Semantic networks
English 2012 This paper describes a method for automatically extracting domain semantic networks of concepts connected by non-specific relations from Wikipedia. We propose an approach based on category and link structure analysis. The method consists of two main tasks: concepts extraction and relations acquisition. For each task we developed two different implementation strategies. Aiming to identify what strategies have the best performances we conducted different extractions for two domains and we analyze their results. From this evaluation we discuss the best approach to implement the extraction method. 0 0
A semantic-based social network of academic researchers Davoodi E.
Kianmehr K.
Clustering Analysis
Information retrieval
Semantic-based Similarity
Social Network Analysis
English 2012 We proposed a framework to construct a semantic-based social network of academic researchers to discover hidden social relationships among the researchers in a particular domain. The challenging task in in the process is to detect accurate relationships that exist among researchers according to their expertise and academic experience. In this paper, we first construct content-based profiles of researchers by crawling online resources. Then background knowledge derived from Wikipedia ,represented in a semantic kernel, is employed to enrich the researchers' profiles. Researchers' social network is then constructed based on the similarities among semantic-based profiles. Social communities are then detected by applying the social network analysis and using factors such as experience, background, knowledge level, personal preferences. Representative members of a community are identified using the eigenvector centrality measure. An interesting application of the constructed social network in academic conferences, when there is a need to assign papers to relevant researchers for the review process, is investigated. 0 0
A supervised method for lexical annotation of schema labels based on wikipedia Sorrentino S.
Bergamaschi S.
Parmiggiani E.
English 2012 Lexical annotation is the process of explicit assignment of one or more meanings to a term w.r.t. a sense inventory (e.g., a thesaurus or an ontology). We propose an automatic supervised lexical annotation method, called ALA TK (Automatic Lexical Annotation -Topic Kernel), based on the Topic Kernel function for the annotation of schema labels extracted from structured and semi-structured data sources. It exploits Wikipedia as sense inventory and as resource of training data. 0 0
An ontology evolution-based framework for semantic information retrieval Rodriguez-Garcia M.A.
Valencia-Garcia R.
Garcia-Sanchez F.
English 2012 Ontologies evolve continuously during their life cycle to adapt to new requirements and necessities. Ontology-based information retrieval systems use semantic annotations that are also regularly updated to reflect new points of view. In order to provide a general solution and to minimize the users' effort in the ontology enriching process, a methodology for extracting terms and evolve the domain ontology from Wikipedia is proposed in this work. The framework presented here combines an ontology-based information retrieval system with an ontology evolution approach in such a way that it simplifies the tasks of updating concepts and relations in domain ontologies. This framework has been validated in a scenario where ICT-related cloud services matching the user needs are to be found. 0 0
Analyzing design tradeoffs in large-scale socio-technical systems through simulation of dynamic collaboration patterns Dorn C.
Edwards G.
Medvidovic N.
Collaboration Patterns
Design Tools and Techniques
Large-scale Socio-Technical Systems
System Simulation
English 2012 Emerging online collaboration platforms such as Wikipedia, Twitter, or Facebook provide the foundation for socio-technical systems where humans have become both content consumer and provider. Existing software engineering tools and techniques support the system engineer in designing and assessing the technical infrastructure. Little research, however, addresses the engineer's need for understanding the overall socio-technical system behavior. The effect of fundamental design decisions becomes quickly unpredictable as multiple collaboration patterns become integrated into a single system. We propose the simulation of human and software elements at the collaboration level. We aim for detecting and evaluating undesirable system behavior such as users experiencing repeated update conflicts or software components becoming overloaded. To this end, this paper contributes (i) a language and (ii) methodology for specifying and simulating large-scale collaboration structures, (iii) example individual and aggregated pattern simulations, and (iv) evaluation of the overall approach. 0 0
Annotating words using wordnet semantic glosses Szymanski J.
Duch W.
Natural Language Processing
Word Sense Disambiguation
English 2012 An approach to the word sense disambiguation (WSD) relaying on the WordNet synsets is proposed. The method uses semantically tagged glosses to perform a process similar to the spreading activation in semantic network, creating ranking of the most probable meanings for word annotation. Preliminary evaluation shows quite promising results. Comparison with the state-of-the-art WSD methods indicates that the use of WordNet relations and semantically tagged glosses should enhance accuracy of word disambiguation methods. 0 0
Architecture-driven modeling of adaptive collaboration structures in large-scale social web applications Dorn C.
Taylor R.N.
Adaptation Flexibility
Collaboration Patterns
Design Tools and Techniques
English 2012 Internet-based, large-scale systems provide the technical foundation for massive online collaboration forms such as social networks, crowdsourcing, content sharing, or source code generation. Such systems are typically designed to adapt at the software level to achieve availability and scalability. They, however, remain mostly unaware of the changing requirements of the various ongoing collaborations. As a consequence, cooperative efforts cannot grow and evolve as easily nor efficiently as they need to. An adaptation mechanism needs to become aware of a collaboration's structure and flexibility to consider changing collaboration requirements during system reconfiguration. To this end, this paper presents the human Architecture Description Language (hADL) for describing the envisioned collaboration dynamics. Inspired by software architecture concepts, hADL introduces human components and collaboration connectors for describing the underlying human coordination dependencies. We further outline a methodology for designing collaboration patterns based on a set of fundamental principles that facilitate runtime adaptation. An exemplary model transformation demonstrates hADL's feasibility. It produces the group permission configuration for MediaWiki in reaction to changing collaboration conditions. 0 0
Are human-input seeds good enough for entity set expansion? Seeds rewriting by leveraging Wikipedia semantic knowledge Qi Z.
Kang Liu
Jun Zhao
Information extraction
Seed rewrite
Semantic knowledge
English 2012 Entity Set Expansion is an important task for open information extraction, which refers to expanding a given partial seed set to a more complete set that belongs to the same semantic class. Many previous researches have proved that the quality of seeds can influence expansion performance a lot since human-input seeds may be ambiguous, sparse etc. In this paper, we propose a novel method which can generate new, high-quality seeds and replace original, poor-quality ones. In our method, we leverage Wikipedia as a semantic knowledge to measure semantic relatedness and ambiguity of each seed. Moreover, to avoid the sparseness of the seed, we use web resources to measure its population. Then new seeds are generated to replace original, poor-quality seeds. Experimental results show that new seed sets generated by our method can improve entity expansion performance by up to average 9.1% over original seed sets. 0 0
Assessing quality values of Wikipedia articles using implicit positive and negative ratings Yu Suzuki Edit history
English 2012 In this paper, we propose a method to identify high-quality Wikipedia articles by mutually evaluating editors and text using implicit positive and negative ratings. One of major approaches for assessing Wikipedia articles is a text survival ratio based approach. However, the problem of this approach is that many low quality articles are misjudged as high quality, because of two issues. This is because, every editor does not always read the whole articles. Therefore, if there is a low quality text at the bottom of a long article, and the text have not seen by the other editors, then the text survives beyond many edits, and the survival ratio of the text is high. To solve this problem, we use a section or a paragraph as a unit of remaining instead of a whole page. This means that if an editor edits an article, the system treats that the editor gives positive ratings to the section or the paragraph that the editor edits. This is because, we believe that if editors edit articles, the editors may not read the whole page, but the editors should read the whole sections or paragraphs, and delete low-quality texts. From experimental evaluation, we confirmed that the proposed method could improve the accuracy of quality values for articles. 0 0
Author disambiguation using wikipedia-based explicit semantic analysis Kang I.-S. Author Disambiguation
Explicit Semantic Analysis
Topical Representation
English 2012 Author disambiguation suffers from the shortage of topical terms to identify authors. This study attempts to augment term-based topical representation of authors with the concept-based one obtained from Wikipedia-based explicit semantic analysis (ESA). Experiments showed that the use of additional ESA concepts improves author-resolving performance by 13.5%. 0 0
Automatic subject metadata generation for scientific documents using wikipedia and genetic algorithms Joorabchi A.
Mahdi A.E.
Genetic algorithms
Keyphrase annotation
Keyphrase indexing
Scientific digital libraries
Subject metadata
Text mining
English 2012 Topical annotation of documents with keyphrases is a proven method for revealing the subject of scientific and research documents. However, scientific documents that are manually annotated with keyphrases are in the minority. This paper describes a machine learning-based automatic keyphrase annotation method for scientific documents, which utilizes Wikipedia as a thesaurus for candidate selection from documents' content and deploys genetic algorithms to learn a model for ranking and filtering the most probable keyphrases. Reported experimental results show that the performance of our method, evaluated in terms of inter-consistency with human annotators, is on a par with that achieved by humans and outperforms rival supervised methods. 0 0
Automatic taxonomy extraction in different languages using wikipedia and minimal language-specific information Dominguez Garcia R.
Schmidt S.
Rensing C.
Steinmetz R.
Hyponymy Detection
Multilingual large-scale taxonomies
Natural Language Processing
Data mining
English 2012 Knowledge bases extracted from Wikipedia are particularly useful for various NLP and Semantic Web applications due to their co- verage, actuality and multilingualism. This has led to many approaches for automatic knowledge base extraction from Wikipedia. Most of these approaches rely on the English Wikipedia as it is the largest Wikipedia version. However, each Wikipedia version contains socio-cultural knowledge, i.e. knowledge with relevance for a specific culture or language. In this work, we describe a method for extracting a large set of hyponymy relations from the Wikipedia category system that can be used to acquire taxonomies in multiple languages. More specifically, we describe a set of 20 features that can be used for for Hyponymy Detection without using additional language-specific corpora. Finally, we evaluate our approach on Wikipedia in five different languages and compare the results with the WordNet taxonomy and a multilingual approach based on interwiki links of the Wikipedia. 0 0
Automatic typing of DBpedia entities Aldo Gangemi
Nuzzolese A.G.
Valentina Presutti
Draicchio F.
Alberto Musetti
Paolo Ciancarini
English 2012 We present Tìpalo, an algorithm and tool for automatically typing DBpedia entities. Tìpalo identifies the most appropriate types for an entity by interpreting its natural language definition, which is extracted from its corresponding Wikipedia page abstract. Types are identified by means of a set of heuristics based on graph patterns, disambiguated to WordNet, and aligned to two top-level ontologies: WordNet supersenses and a subset of DOLCE+DnS Ultra Lite classes. The algorithm has been tuned against a golden standard that has been built online by a group of selected users, and further evaluated in a user study. 0 0
Automatic vandalism detection in Wikipedia with active associative classification Maria Sumbana
Goncalves M.A.
Rodrigo Silva
Jussara Almeida
Adriano Veloso
English 2012 Wikipedia and other free editing services for collaboratively generated content have quickly grown in popularity. However, the lack of editing control has made these services vulnerable to various types of malicious actions such as vandalism. State-of-the-art vandalism detection methods are based on supervised techniques, thus relying on the availability of large and representative training collections. Building such collections, often with the help of crowdsourcing, is very costly due to a natural skew towards very few vandalism examples in the available data as well as dynamic patterns. Aiming at reducing the cost of building such collections, we present a new active sampling technique coupled with an on-demand associative classification algorithm for Wikipedia vandalism detection. We show that our classifier enhanced with a simple undersampling technique for building the training set outperforms state-of-the-art classifiers such as SVMs and kNNs. Furthermore, by applying active sampling, we are able to reduce the need for training in almost 96% with only a small impact on detection results. 0 0
BiCWS: Mining cognitive differences from bilingual web search results Xiangji Huang
Wan X.
Jie Xiao
Comparative Text Mining
Cross Lingual Text Mining
Information retrieval
English 2012 In this paper we propose a novel comparative web search system - BiCWS, which can mine cognitive differences from web search results in a multi-language setting. Given a topic represented by two queries (they are the translations of each other) in two languages, the corresponding web search results for the two queries are firstly retrieved by using a general web search engine, and then the bilingual facets for the topic are mined by using a bilingual search results clustering algorithm. The semantics in Wikipedia are leveraged to improve the bilingual clustering performance. After that, the semantic distributions of the search results over the mined facets are visually presented, which can reflect the cognitive differences in the bilingual communities. Experimental results show the effectiveness of our proposed system. 0 0
Building a large scale knowledge base from Chinese Wiki Encyclopedia Zhe Wang
Jing-Woei Li
Pan J.Z.
Knowledge base
Linked data
Semantic web
English 2012 DBpedia has been proved to be a successful structured knowledge base, and large scale Semantic Web data has been built by using DBpedia as the central interlinking-hubs of the Web of Data in English. But in Chinese, due to the heavily imbalance in size (no more than one tenth) between English and Chinese in Wikipedia, there are few Chinese linked data are published and linked to DBpedia, which hinders the structured knowledge sharing both within Chinese resources and cross-lingual resources. This paper aims at building large scale Chinese structured knowledge base from Hudong, which is one of the largest Chinese Wiki Encyclopedia websites. In this paper, an upper-level ontology schema in Chinese is first learned based on the category system and Infobox information in Hudong. Totally, there are 19542 concepts are inferred, which are organized in hierarchy with maximally 20 levels. 2381 properties with domain and range information are learned according to the attributes in the Hudong Infoboxes. Then, 802593 instances are extracted and described using the concepts and properties in the learned ontology. These extracted instances cover a wide range of things, including persons, organizations, places and so on. Among all the instances, 62679 of them are linked to identical instances in DBpedia. Moreover, the paper provides RDF dump or SPARQL to access the established Chinese knowledge base. The general upper-level ontology and wide coverage makes the knowledge base a valuable Chinese semantic resource. It not only can be used in Chinese linked data building, the fundamental work for building multi lingual knowledge base across heterogeneous resources of different languages, but also can largely facilitate many useful applications of large-scale knowledge base such as knowledge question-answering and semantic search. 0 0
Catching the drift - Indexing implicit knowledge in chemical digital libraries Kohncke B.
Tonnies S.
Balke W.-T.
Chemical digital collections
Document ranking
English 2012 In the domain of chemistry the information gathering process is highly focused on chemical entities. But due to synonyms and different entity representations the indexing of chemical documents is a challenging process. Considering the field of drug design, the task is even more complex. Domain experts from this field are usually not interested in any chemical entity itself, but in representatives of some chemical class showing a specific reaction behavior. For describing such a reaction behavior of chemical entities the most interesting parts are their functional groups. The restriction of each chemical class is somehow also related to the entities' reaction behavior, but further based on the chemist's implicit knowledge. In this paper we present an approach dealing with this implicit knowledge by clustering chemical entities based on their functional groups. However, since such clusters are generally too unspecific, containing chemical entities from different chemical classes, we further divide them into sub-clusters using fingerprint based similarity measures. We analyze several uncorrelated fingerprint/similarity measure combinations and show that the most similar entities with respect to a query entity can be found in the respective sub-cluster. Furthermore, we use our approach for document retrieval introducing a new similarity measure based on Wikipedia categories. Our evaluation shows that the sub-clustering leads to suitable results enabling sophisticated document retrieval in chemical digital libraries. 0 0
Categorizing search results using WordNet and Wikipedia Hemayati R.T.
Meng W.
Yu C.
Search Engine
Search Result Clustering and Categorization
English 2012 Terms used in search queries often have multiple meanings and usages. Consequently, search results corresponding to different meanings or usages may be retrieved, making identifying relevant results inconvenient and time-consuming. In this paper, we study the problem of grouping the search results based on the different meanings and usages of a query. We build on a previous work that identifies and ranks possible categories of any user query based on the meanings and common usages of the terms and phrases within the query. We use these categories to group search results. In this paper, we study different methods, including several new methods, to assign search result record (SRRs) to the categories. Our SRR grouping framework supports a combination of categorization, clustering and query rewriting techniques. Our experimental results show that some of our grouping methods can achieve high accuracy. 0 0
Classification of short texts by deploying topical annotations Vitale D.
Paolo Ferragina
Ugo Scaiella
English 2012 We propose a novel approach to the classification of short texts based on two factors: the use of Wikipedia-based annotators that have been recently introduced to detect the main topics present in an input text, represented via Wikipedia pages, and the design of a novel classification algorithm that measures the similarity between the input text and each output category by deploying only their annotated topics and the Wikipedia link-structure. Our approach waives the common practice of expanding the feature-space with new dimensions derived either from explicit or from latent semantic analysis. As a consequence it is simple and maintains a compact intelligible representation of the output categories. Our experiments show that it is efficient in construction and query time, accurate as state-of-the-art classifiers (see e.g. Phan et al. WWW '08), and robust with respect to concept drifts and input sources. 0 0
Classifying image galleries into a taxonomy using metadata and wikipedia Kramer G.
Gosse Bouma
Hendriksen D.
Homminga M.
Hierarchical classification
Image gallery
English 2012 This paper presents a method for the hierarchical classification of image galleries into a taxonomy. The proposed method links textual gallery metadata to Wikipedia pages and categories. Entity extraction from metadata, entity ranking, and selection of categories is based on Wikipedia and does not require labeled training data. The resulting system performs well above a random baseline, and achieves a (micro-averaged) F-score of 0.59 on the 9 top categories of the taxonomy and 0.40 when using all 57 categories. 0 0
Cross-lingual knowledge discovery: Chinese-to-English article linking in wikipedia Tang L.-X.
Andrew Trotman
Shlomo Geva
Xu Y.
Anchor identification
Chinese segmentation
Cross-lingual link discovery
Link mining
Link recommendation
English 2012 In this paper we examine automated Chinese to English link discovery in Wikipedia and the effects of Chinese segmentation and Chinese to English translation on the hyperlink recommendation. Our experimental results show that the implemented link discovery framework can effectively recommend Chinese-to-English cross-lingual links. The techniques described here can assist bi-lingual users where a particular topic is not covered in Chinese, is not equally covered in both languages, or is biased in one language; as well as for language learning. 0 0
Cross-modal information retrieval - A case study on Chinese wikipedia Cong Y.
Qin Z.
Jian Yu
Wan T.
Character-based topics
Cross-modal information retrieval
Topic correlation model (TCM)
Word-based topics
English 2012 Probability models have been used in cross-modalmultimedia information retrieval recently by building conjunctive models bridging the text and image components. Previous studies have shown that cross-modal information retrieval systemusing the topic correlation model (TCM) outperforms state-of-the-art models in English corpus. In this paper, we will focus on the Chinese language, which is different from western languages composed by alphabets. Words and characters will be chosen as the basic structural units of Chinese, respectively. We also set up a test database, named Ch-Wikipedia, in which documents with paired image and text are extracted fromChinese website ofWikipedia.We investigate the problems of retrieving texts (ranked by semantic closeness) given an image query, and vice versa. The capabilities of the TCM model is verified by experiments across the Ch-Wikipedia dataset. 0 0
DAnIEL: Language independent character-based news surveillance Lejeune G.
Brixtel R.
Antoine Doucet
Lucas N.
English 2012 This study aims at developing a news surveillance system able to address multilingual web corpora. As an example of a domain where multilingual capacity is crucial, we focus on Epidemic Surveillance. This task necessitates worldwide coverage of news in order to detect new events as quickly as possible, anywhere, whatever the language it is first reported in. In this study, text-genre is used rather than sentence analysis. The news-genre properties allow us to assess the thematic relevance of news, filtered with the help of a specialised lexicon that is automatically collected on Wikipedia. Afterwards, a more detailed analysis of text specific properties is applied to relevant documents to better characterize the epidemic event (i.e., which disease spreads where?). Results from 400 documents in each language demonstrate the interest of this multilingual approach with light resources. DAnIEL achieves an F 1-measure score around 85%. Two issues are addressed: the first is morphology rich languages, e.g. Greek, Polish and Russian as compared to English. The second is event location detection as related to disease detection. This system provides a reliable alternative to the generic IE architecture that is constrained by the lack of numerous components in many languages. 0 0
Detecting Korean hedge sentences in Wikipedia documents Kang S.-J.
Jeong J.-S.
Kang I.-S.
Korean Hedge Detection
Machine learning
Uncertainty Detection
English 2012 In this paper we propose automatic hedge detection methods for Korean. We select sentential contextual features adjusted for Korean, and used supervised machine-learning algorithms to train models to detect hedges in Wikipedia documents. Our SVM-based model achieved an F1-score of 90.8% for Korean. 0 0
Detecting Wikipedia vandalism with a contributing efficiency-based approach Tang X.
Guangyou Zhou
Fu Y.
Gan L.
Yu W.
Li S.
Vandalism detection
English 2012 The collaborative nature of wiki has distinguished Wikipedia as an online encyclopedia but also makes the open contents vulnerable against vandalism. The current vandalism detection methods relying on basic statistic language features work well for explicitly offensive edits that perform massive changes. However, these techniques are evadable for the elusive vandal edits which make only a few unproductive or dishonest modifications. In this paper we proposed a contributing efficiency-based approach to detect the vandalism in Wikipedia and implement it with machine-learning based classifiers that incorporate the contributing efficiency along with other languages features. The results of extensional experiment show that the contributing efficiency can improve the recall of machine learning-based vandalism detection algorithms significantly. 0 0
Discovery of novel term associations in a document collection Hynonen T.
Mahler S.
Toivonen H.
English 2012 We propose a method to mine novel, document-specific associations between terms in a collection of unstructured documents. We believe that documents are often best described by the relationships they establish. This is also evidenced by the popularity of conceptual maps, mind maps, and other similar methodologies to organize and summarize information. Our goal is to discover term relationships that can be used to construct conceptual maps or so called BisoNets. The model we propose, tpf-idf-tpu, looks for pairs of terms that are associated in an individual document. It considers three aspects, two of which have been generalized from tf-idf to term pairs: term pair frequency (tpf; importance for the document), inverse document frequency (idf; uniqueness in the collection), and term pair uncorrelation (tpu; independence of the terms). The last component is needed to filter out statistically dependent pairs that are not likely to be considered novel or interesting by the user. We present experimental results on two collections of documents: one extracted from Wikipedia, and one containing text mining articles with manually assigned term associations. The results indicate that the tpf-idf-tpu method can discover novel associations, that they are different from just taking pairs of tf-idf keywords, and that they match better the subjective associations of a reader. 0 0
Dynamic PageRank using evolving teleportation Rossi R.A.
Gleich D.F.
English 2012 The importance of nodes in a network constantly fluctuates based on changes in the network structure as well as changes in external interest. We propose an evolving teleportation adaptation of the PageRank method to capture how changes in external interest influence the importance of a node. This framework seamlessly generalizes PageRank because the importance of a node will converge to the PageRank values if the external influence stops changing. We demonstrate the effectiveness of the evolving teleportation on the Wikipedia graph and the Twitter social network. The external interest is given by the number of hourly visitors to each page and the number of monthly tweets for each user. 0 0
Engineering a controlled natural language into semantic MediaWiki Dantuluri P.
Davis B.
Ludwick P.
Handschuh S.
English 2012 The Semantic Web is yet to gain mainstream recognition. In part this is caused by the relative complexity of the various semantic web formalisms, which act as a major barrier of entry to naive web users. In addition, in order for the Semantic Web to become a reality, we need semantic metadata. While controlled natural language research has sought to address these challenges, in the context of user friendly ontology authoring for domain experts, there has been little focus on how to adapt controlled languages for novice social web users. The paper describes an approach to using controlled languages for fact creation and management as opposed to ontology authoring, focusing on the domain of meeting minutes. For demonstration purposes, we developed a plug-in to the Semantic MediaWiki, which adds a controlled language editor extension. This editor aids the user while authoring or annotating in a controlled language in a user friendly manner. Controlled content is sent to a parsing service which generates semantic metadata from the sentences which are subsequently displayed and stored in the Semantic MediaWiki. The semantic metadata generated by the parser is grounded against a project documents ontology. The controlled language modeled covers a wide variety of sentences and topics used in the context of a meeting minute. Finally this paper provides a architectural overview of the annotation system. 0 0
Exploiting a web-based encyclopedia as a knowledge base for the extraction of multilingual terminology Sadat F. Comparable corpora
Cross-Language Information Retrieval
Linguistics-based information
English 2012 Multilingual linguistic resources are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well. This article seeks to explore and to exploit the idea of using multilingual web-based encyclopaedias such as Wikipedia as comparable corpora for bilingual terminology extraction. We propose an approach to extract terms and their translations from different types of Wikipedia link information and data. The next step will be using linguistic-based information to re-rank and filter the extracted term candidates in the target language. Preliminary evaluations using the combined statistics-based and linguistic-based approaches were applied on different pairs of languages including Japanese, French and English. These evaluations showed a real open improvement and a good quality of the extracted term candidates for building or enriching multilingual anthologies, dictionaries or feeding a cross-language information retrieval system with the related expansion terms of the source query. 0 0
Exploration and visualization of administrator network in wikipedia Yousaf J.
Jing-Woei Li
Haisu Zhang
Hou L.
Human factors
Social network analysis
English 2012 Wikipedia has become one of the most widely used knowledge systems on the Web. It contains the resources and information with different qualities contributed by different set of authors. A special group of authors named administrators plays an important role for content quality in Wikipedia. Understanding the behaviors of administrators in Wikipedia can facilitate the management of Wikipedia system, and empower some applications such as article recommendation and expertise administrator finding for given articles. This paper addresses the work of the exploration and visualization of the administrator network in Wikipedia. Administrator network is firstly constructed by using co-editing relationship and six characteristics for administrators are proposed to describe the behaviors of administrators in Wikipedia from different perspectives. Quantified calculation of these characteristics is then put forwarded by using social network analysis techniques. Topic model is used to relate content of Wikipedia to the interest diversity of administrators. Based on the media wiki history records from the January 2010 to January 2011, we develop an administrator exploration prototype system which can rank the selected characteristics for administrators and can be used as a decision support system. Furthermore, some meaningful observations are found to show that the administrator network is a healthy small world community and a strong centralization of the network around some hubs/stars is obtained to mean a considerable nucleus of very active administrators that seems to be omnipresent. These top ranked administrators ranking is found to be consistent with the number of barn stars awarded to them. 0 0
Extracting difference information from multilingual wikipedia Fujiwara Y.
Yu Suzuki
Konishi Y.
Akiyo Nadamoto
English 2012 Wikipedia articles for a particular topic are written in many languages. When we select two articles which are about a single topic but which are written in different languages, the contents of these two articles are expected to be identical because of the Wikipedia policy. However, these contents are actually different, especially topics related to culture. In this paper, we propose a system to extract different Wikipedia information between that shown for Japan and that of other countries. An important technical problem is how to extract comparison target articles of Wikipedia. A Wikipedia article is written in different languages, with their respective linguistic structures. For example, "Cricket" is an important part of English culture, but the Japanese Wikipedia article related to cricket is too simple. Actually, it is only a single page. In contrast, the English version is substantial. It includes multiple pages. For that reason, we must consider which articles can be reasonably compared. Subsequently, we extract comparison target articles of Wikipedia based on a link graph and article structure. We implement our proposed method, and confirm the accuracy of difference extraction methods. 0 0
Extracting property semantics from Japanese Wikipedia Susumu Tamagawa
Takeshi Morita
Takahira Yamaguchi
Ontology learning
Property definition
English 2012 Here is discussed how to build up ontology with many properties from Japanese Wikipedia. The ontology includes is-a relationship (rdfs:subClassOf), class-instance relationship (rdf:type) and synonym relation (skos:altLabel) moreover it includes property relations and types. Property relations are triples, property domain (rdfs:domain) and property range (rdfs:range). Property types are object (owl:ObjectProperty), data (owl:DatatypeProperty), symmetric (owl:SymmetricProperty), transitive (owl:TransitiveProperty), functional (owl:FunctionalProperty) and inverse functional (owl:InverseFunctionalProperty). 0 0
Extraction of bilingual cognates from Wikipedia Gamallo P.
Garcia M.
English 2012 In this article, we propose a method to extract translation equivalents with similar spelling from comparable corpora. The method was applied on Wikipedia to extract a large amount of Portuguese-Spanish bilingual terminological pairs that were not found in existing dictionaries. The resulting bilingual lexicons consists of more than 27,000 new pairs of lemmas and multiwords, with about 92% accuracy. 0 0
Feature transformation method enhanced vandalism detection in wikipedia Chang T.
Hong Lin
Yi-Sheng Lin
English 2012 A very example of web 2.0 application is Wikipedia, an online encyclopedia where anyone can edit and share information. However, blatantly unproductive edits greatly undermine the quality of Wikipedia. Their irresponsible acts force editors to waste time undoing vandalisms. For the purpose of improving information quality on Wikipedia and freeing the maintainer from such repetitive tasks, machine learning methods have been proposed to detect vandalism automatically. However, most of them focused on mining new features which seem to be inexhaustible to be discovered. Therefore, the question of how to make the best use of these features needs to be tackled. In this paper, we leverage feature transformation techniques to analyze the features and propose a framework using these methods to enhance detection. Experiment results on the public dataset PAN-WVC-10 show that our method is effective and it provides another useful method to help detect vandalism in Wikipedia. 0 0
Focused elements and snippets Crouch C.J.
Crouch D.B.
Acquilla N.
Banhatti R.
Chittilla S.
Nagalla S.
Narenvarapu R.
English 2012 This paper reports briefly on the final results of experiments to produce competitive (i.e., highly ranked) focused elements in response to the various tasks of the INEX 2010 Ad Hoc Track. These experiments are based on an entirely new analysis and indexing of the INEX 2009 Wikipedia collection. Using this indexing and our basic methodology for dynamic element retrieval [5, 6], described herein, yields highly competitive results for all the tasks involved. This is important because our approach to snippet retrieval is based on the conviction that good snippets can be generated from good focused elements. Our work to date in snippet generation is described; this early work ranked 9 th in the official results. 0 0
From web 1.0 to social semantic web: Lessons learnt from a migration to a medical semantic wiki Meilender T.
Lieber J.
Palomares F.
Jay N.
Decision knowledge
Medical information systems
Semantic wiki
English 2012 Oncolor is an association whose mission is to publish and share medical guidelines in oncology. As many scientific information websites built in the early times of the Internet, its website deals with unstructured data that cannot be automatically querried and is getting more and more difficult to maintain over time. The online contents access and the editing process can be improved by using web 2.0 and semantic web technologies, which allow to build collaboratively structured information bases in semantic portals. The work described in this paper aims at reporting a migration from a static HTML website to a semantic wiki in the medical domain. This approach has raised various issues that had to be addressed, such as the introduction of structured data in the unstructured imported guidelines or the linkage of content to external medical resources. An evaluation of the result by final users is also provided, and proposed solutions are discussed. 0 0
General architecture of a controlled natural language based multilingual semantic wiki Kaljurand K. AceWiki
Attempto Controlled English
Controlled natural language
Grammatical Framework
Semantic wiki
English 2012 In this paper we propose the components, the general architecture and application areas for a controlled natural language based multilingual semantic wiki. Such a wiki is a collaborative knowledge engineering environment that makes its content available via multiple languages, both natural and formal, all of them synchronized via their abstract form that is assigned by a shared grammar. We also describe a preliminary implementation of such a system based on the existing technologies of Grammatical Framework, Attempto Controlled English, and AceWiki. 0 0
Good quality complementary information for multilingual Wikipedia Yu Suzuki
Fujiwara Y.
Konishi Y.
Akiyo Nadamoto
English 2012 Many Wikipedia articles lack information, because not all users submit truly complete information to Wikipedia. However, Wikipedia has many language versions that have been developed independently. Therefore, if we supply these complementary information from many language versions, the users must satisfy the amount of information of Wikipedia articles with the complementary information, instead of only one language version of Wikipedia articles. In this study, we specifically examine multilingual Wikipedia and propose a method of extracting good quality complementary information from Wikipedia of other languages. Specifically, we compare Wikipedia articles with less information to those with more information. From Wikipedia articles, which can have the same theme and different languages, we extract different information as complementary information. As described herein, we extract comparison target articles of Wikipedia based on a link graph, because cases exist in which information included in an articles is written in multiple pages of different languages. Furthermore, some low-quality information is extracted as complementary information because Wikipedia articles are written by not only good editors but also bad editors such as vandals. We propose a method to calculate the quality of information based on the editors, and we extract good quality complementary information. 0 0
Harnessing Wikipedia semantics for computing contextual relatedness Jabeen S.
Gao X.
Andreae P.
Contextual relatedness
Relatedness measures
Wikipedia hyperlinks
English 2012 This paper proposes a new method of automatically measuring semantic relatedness by exploiting Wikipedia as an external knowledge source. The main contribution of our research is to propose a relatedness measure based on Wikipedia senses and hyperlink structure for computing contextual relatedness of any two terms. We have evaluated the effectiveness of our approach using three datasets and have shown that our approach competes well with other well known existing methods. 0 0
Heuristics- and statistics-based wikification Nguyen H.T.
Cao T.H.
Nguyen T.T.
Vo-Thi T.-L.
English 2012 With the wide usage of Wikipedia in research and applications, disambiguation of concepts and entities to Wikipedia is an essential component in natural language processing. This paper addresses the task of identifying and linking specific words or phrases in a text to their referents described by Wikipedia articles. In this work, we propose a method that combines some heuristics with a statistical model for disambiguation. The method exploits disambiguated entities to disambiguate the others in an incremental process. Experiments are conducted to evaluate and show the advantages of the proposed method. 0 0
How random walks can help tourism Lucchese C.
Perego R.
Silvestri F.
Vahabi H.
Venturini R.
English 2012 On-line photo sharing services allow users to share their touristic experiences. Tourists can publish photos of interesting locations or monuments visited, and they can also share comments, annotations, and even the GPS traces of their visits. By analyzing such data, it is possible to turn colorful photos into metadata-rich trajectories through the points of interest present in a city. In this paper we propose a novel algorithm for the interactive generation of personalized recommendations of touristic places of interest based on the knowledge mined from photo albums and Wikipedia. The distinguishing features of our approach are multiple. First, the underlying recommendation model is built fully automatically in an unsupervised way and it can be easily extended with heterogeneous sources of information. Moreover, recommendations are personalized according to the places previously visited by the user. Finally, such personalized recommendations can be generated very efficiently even on-line from a mobile device. 0 0
How to get around with wikis in teaching Kubincova Z.
Homola M.
English 2012 Wikis were showed to be an interesting and powerful tool in education, supporting tasks starting from project management, collaborative data management, etc., up to more elaborate tasks such as collaborative production of lecture notes, reports, and essays. On the other hand, most wiki softwares were not created as educational tools in the first place, hence their application in curricula with groups of students faces some obstacles which need to be dealt with. These include motivating the students to engage with the tool, boosting collaboration between students, supervising and tracking student's activity, and evaluation. A number of tools were developed to enable or ease these tasks for the teacher. This paper takes a look on selected tools developed with this aim with two main goals: to produce a concise list of functionalities that are needed, and to compare and evaluate the tools that are available. 0 0
Implementing an automated ventilation guideline using the semantic wiki KnowWE Hatko R.
Schadler D.
Mersmann S.
Joachim Baumeister
Weiler N.
Frank Puppe
English 2012 In this paper, we report on the experiences made during the implementation of a therapeutic process, i.e. a guideline, for automated mechanical ventilation of patients in intensive care units. The semantic wiki KnowWE was used as a collaborative development platform for domain specialists, knowledge and software engineers, and reviewers. We applied the graphical guideline language DiaFlux to represent medical expertise about mechanical ventilation in a flowchart-oriented manner. Finally, the computerized guideline was embedded seamlessly into a mechanical ventilator for autonomous execution. 0 0
Improving cross-document knowledge discovery using explicit semantic analysis Yan P.
Jin W.
Cross-Document Knowledge Discovery
Document Representation
Knowledge Discovery
Semantic relatedness
English 2012 Cross-document knowledge discovery is dedicated to exploring meaningful (but maybe unapparent) information from a large volume of textual data. The sparsity and high dimensionality of text data present great challenges for representing the semantics of natural language. Our previously introduced Concept Chain Queries (CCQ) was specifically designed to discover semantic relationships between two concepts across documents where relationships found reveal semantic paths linking two concepts across multiple text units. However, answering such queries only employed the Bag of Words (BOW) representation in our previous solution, and therefore terms not appearing in the text literally are not taken into consideration. Explicit Semantic Analysis (ESA) is a novel method proposed to represent the meaning of texts in a higher dimensional space of concepts which are derived from large-scale human built repositories such as Wikipedia. In this paper, we propose to integrate the ESA technique into our query processing, which is capable of using vast knowledge from Wikipedia to complement existing information from text corpus and alleviate the limitations resulted from the BOW representation. The experiments demonstrate the search quality has been greatly improved when incorporating ESA into answering CCQ, compared with using a BOW-based approach. 0 0
Indian school of mines at INEX 2011 snippet retrieval task Pal S.
Tamrakar P.
English 2012 This paper describes the work that we did at Indian School of Mines, Dhanbad towards Snippet Retrieval for INEX 2011. During official submissions, we pre-processed the XML-ified Wikipedia collection to a simplified txt-only version. This collection and the reference document run were used as inputs to a simple Snippet Retrieval system that we developed. We submitted 3 runs to INEX 2011. Post submission we apply a passage retrieval technique based on a Language Modelling approach for snippet retrieval. The performance of our submissions at the INEX SR task was moderate, but promising enough for further exploration. 0 0
Interactive information retrieval algorithm for wikipedia articles Szymanski J. Documents clustering
Information retrieval
Text processing
English 2012 The article presents an algorithm for retrieving textual information in documents collection. The algorithm employs a category system that organizes the repository and using interaction with the user improves search precision. The algorithm was implemented for simple English Wikipedia and the first evaluation results indicates the proposed method can help to retrieve information from large document repositories. 0 0
Kernel-based logical and relational learning with klog for hedge cue detection Verbeke M.
Frasconi P.
Van Asch V.
Morante R.
Daelemans W.
De Raedt L.
Kernel methods
Natural language learning
Statistical relational learning
English 2012 Hedge cue detection is a Natural Language Processing (NLP) task that consists of determining whether sentences contain hedges. These linguistic devices indicate that authors do not or cannot back up their opinions or statements with facts. This binary classification problem, i.e. distinguishing factual versus uncertain sentences, only recently received attention in the NLP community. We use kLog, a new logical and relational language for kernel-based learning, to tackle this problem. We present results on the CoNLL 2010 benchmark dataset that consists of a set of paragraphs from Wikipedia, one of the domains in which uncertainty detection has become important. Our approach shows competitive results compared to state-of-the-art systems. 0 0
Knowledge pattern extraction and their usage in exploratory search Nuzzolese A.G. English 2012 Knowledge interaction in Web context is a challenging problem. For instance, it requires to deal with complex structures able to filter knowledge by drawing a meaningful context boundary around data. We assume that these complex structures can be formalized as Knowledge Patterns (KPs), aka frames. This Ph.D. work is aimed at developing methods for extracting KPs from the Web and at applying KPs to exploratory search tasks. We want to extract KPs by analyzing the structure of Web links from rich resources, such as Wikipedia. 0 0
LDA-based topic modeling in labeling blog posts with wikipedia entries Daisuke Yokomoto
Makita K.
Suzuki H.
Koike D.
Takehito Utsuro
Kawada Y.
Tomohiro Fukuhara
Topic Analysis
Topic Model
English 2012 Given a search query, most existing search engines simply return a ranked list of search results. However, it is often the case that those search result documents consist of a mixture of documents that are closely related to various contents. In order to address the issue of quickly overviewing the distribution of contents, this paper proposes a framework of labeling blog posts with Wikipedia entries through LDA (latent Dirichlet allocation) based topic modeling. More specifically, this paper applies an LDA-based document model to the task of labelling blog posts with Wikipedia entries. One of the most important advantages of this LDA-based document model is that the collected Wikipedia entries and their LDA parameters heavily depend on the distribution of keywords across all the search result of blog posts. This tendency actually contributes to quickly overviewing the search result of blog posts through the LDA-based topic distribution. In the evaluation of the paper, we also show that the LDA-based document retrieval scheme outperforms our previous approach. 0 0
Leave or stay: The departure dynamics of wikipedia editors Dell Zhang
Karl Prior
Mark Levene
Mao R.
Van Liere D.
English 2012 In this paper, we investigate how Wikipedia editors leave the community, i.e., become inactive, from the following three aspects: (1) how long Wikipedia editors will stay active in editing; (2) which Wikipedia editors are likely to leave; and (3) what reasons would make Wikipedia editors leave. The statistical models built on Wikipedia edit log datasets provide insights about the sustainable growth of Wikipedia. 0 0
Location-based services for technology enhanced learning and teaching Rensing C.
Tittel S.
Steinmetz R.
Location-based learning
Mobile learning
Technology enhanced learning
English 2012 Learning does not only take place in a conventional classroom setting but also during everyday activities such as field trips. The increasing availability of mobile devices and network access opens up new possibilities for providing location-based services which support for such learning scenarios. In this paper, we argue the need for providing context aware services for authors of learning content as well as for learners. We present two scenarios and new services which fit to these scenarios. The first is an extension of docendo, an open learning content authoring and management platform, to support teachers while creating location-based learning material for field trips. Second service is a mobile application which allows learners to participate in the creation of learning resources by writing a wiki article and retrieve learning modules from a semantic MediaWiki using a facetted search. Learner location is one search parameter within the search. 0 0
MOTIF-RE: Motif-based hypernym/hyponym relation extraction from wikipedia links Wei B.
Liu J.
Jun Ma
Zheng Q.
Weinan Zhang
Feng B.
Hypernym/hyponym relation
Network motif
Wikipedia link
English 2012 Hypernym/hyponym relation extraction plays an essential role in taxonomy learning. The conventional methods based on lexico-syntactic patterns or machine learning usually make use of content-related features. In this paper, we find that the proportions of hyperlinks with different semantic type vary markedly in different network motifs. Based on this observation, we propose MOTIF-RE, an algorithm of extracting hypernym/hyponym relation from Wikipedia hyperlinks. The extraction process consists of three steps: 1) Build a directed graph from a set of domain-specific Wikipedia articles. 2) Count the occurrences of hyperlinks in every three-node network motif and create a feature vector for every hyperlink. 3) Train a classifier to identify semantic relation of hyperlinks. We created three domain-specific Wikipedia article sets to test MOTIF-RE. Experiments on individual dataset show that MOTIF-RE outperforms the baseline algorithm by about 30% in terms of F1-measure. Cross-domain experimental results show similar, which proves that MOTIF-RE has fairly good domain adaptation ability. 0 0
Malleable finding aids Anderson S.R.
Allen R.B.
Finding Aid
User Engagement
English 2012 We show a prototype implementation of a Wiki-based Malleable Finding Aid that provides features to support user engagement and we discuss the contribution of individual features such as graphical representations, a table of contents, interactive sorting of entries, and the possibility for user tagging. Finally, we explore the implications of Malleable Finding Aids for collections which are richly inter-linked and which support a fully social Archival Commons. 0 0
Mining semantic relations between research areas Osborne F.
Motta E.
Bibliographic Data
Data mining
Empirical Evaluation
Ontology Population
Research Data
Scholarly Ontologies
English 2012 For a number of years now we have seen the emergence of repositories of research data specified using OWL/RDF as representation languages, and conceptualized according to a variety of ontologies. This class of solutions promises both to facilitate the integration of research data with other relevant sources of information and also to support more intelligent forms of querying and exploration. However, an issue which has only been partially addressed is that of generating and characterizing semantically the relations that exist between research areas. This problem has been traditionally addressed by manually creating taxonomies, such as the ACM classification of research topics. However, this manual approach is inadequate for a number of reasons: these taxonomies are very coarse-grained and they do not cater for the fine-grained research topics, which define the level at which typically researchers (and even more so, PhD students) operate. Moreover, they evolve slowly, and therefore they tend not to cover the most recent research trends. In addition, as we move towards a semantic characterization of these relations, there is arguably a need for a more sophisticated characterization than a homogeneous taxonomy, to reflect the different ways in which research areas can be related. In this paper we propose Klink, a new approach to i) automatically generating relations between research areas and ii) populating a bibliographic ontology, which combines both machine learning methods and external knowledge, which is drawn from a number of resources, including Google Scholar and Wikipedia. We have tested a number of alternative algorithms and our evaluation shows that a method relying on both external knowledge and the ability to detect temporal relations between research areas performs best with respect to a manually constructed standard. 0 0
Named entity disambiguation based on explicit semantics Jacala M.
Tvarozek J.
English 2012 In our work we present an approach to the Named Entity Disambiguation based on semantic similarity measure. We employ existing explicit semantics present in datasets such as Wikipedia to construct a disambiguation dictionary and vector-based word model. The analysed documents are transformed into semantic vectors using explicit semantic analysis. The relatedness is computed as cosine similarity between the vectors. The experimental evaluation shows that the proposed approach outperforms traditional approaches such as latent semantic analysis. 0 0
Ontology-based identification of research gaps and immature research areas Beckers K.
Eicker S.
Fassbender S.
Heisel M.
Schmidt H.
Schwittek W.
Facetted search
Knowledge management
Research gaps
English 2012 Researchers often have to understand new knowledge areas, and identify research gaps and immature areas in them. They have to understand and link numerous publications to achieve this goal. This is difficult, because natural language has to be analyzed in the publications, and implicit relations between them have to be discovered. We propose to utilize the structuring possibilities of ontologies to make the relations between publications, knowledge objects (e.g., methods, tools, notations), and knowledge areas explicit. Furthermore, we use Kitchenham's work on structured literature reviews and apply it to the ontology. We formalize relations between objects in the ontology using Codd's relational algebra to support different kinds of literature research. These formal expressions are implemented as ontology queries. Thus, we implement an immature research area analysis and research gap identification mechanism. The ontology and its relations are implemented based on the Semantic MediaWiki+ platform. 0 0
Overview of the INEX 2011 question answering track (QA@INEX) SanJuan E.
Moriceau V.
Tannier X.
Bellot P.
Mothe J.
Automatic summarization
Focus information retrieval
Natural Language Processing
Question answering
Text informativeness
Text readability
English 2012 The INEX QA track aimed to evaluate complex question-answering tasks where answers are short texts generated from the Wikipedia by extraction of relevant short passages and aggregation into a coherent summary. In such a task, Question-answering, XML/passage retrieval and automatic summarization are combined in order to get closer to real information needs. Based on the groundwork carried out in 2009-2010 edition to determine the sub-tasks and a novel evaluation methodology, the 2011 edition experimented contextualizing tweets using a recent cleaned dump of the Wikipedia. Participants had to contextualize 132 tweets from the New York Times (NYT). Informativeness of answers has been evaluated, as well as their readability. 13 teams from 6 countries actively participated to this track. This tweet contextualization task will continue in 2012 as part of the CLEF INEX lab with same methodology and baseline but on a much wider range of tweet types. 0 0
Probabilistically ranking web article quality based on evolution patterns Jangwhan Han
Chen K.
Jiang D.
English 2012 User-generated content (UGC) is created, updated, and maintained by various web users, and its data quality is a major concern to all users. We observe that each Wikipedia page usually goes through a series of revision stages, gradually approaching a relatively steady quality state and that articles of different quality classes exhibit specific evolution patterns. We propose to assess the quality of a number of web articles using Learning Evolution Patterns (LEP). First, each article's revision history is mapped into a state sequence using the Hidden Markov Model (HMM). Second, evolution patterns are mined for each quality class, and each quality class is characterized by a set of quality corpora. Finally, an article's quality is determined probabilistically by comparing the article with the quality corpora. Our experimental results demonstrate that the LEP approach can capture a web article's quality precisely. 0 0
QA@INEX track 2011: Question expansion and reformulation using the REG summarization system Vivaldi J.
Da Cunha I.
Automatic summarization
Named entities
English 2012 In this paper, our strategy and results for the INEX@QA 2011 question-answering task are presented. In this task, a set of 50 documents is provided by the search engine Indri, using some queries. The initial queries are titles associated with tweets. Reformulation of these queries is carried out using terminological and named entities information. To design the queries, the full process is divided into 2 steps: a) both titles and tweets are POS tagged, and b) queries are expanded or reformulated, using: terms and named entities included in the title, terms and named entities found in the tweet related to those ones, and Wikipedia redirected terms and named entities from those ones included in the title. In our work, the automatic summarization system REG is used to summarize the 50 documents obtained with these queries. The algorithm models a document as a graph to obtain weighted sentences. A single document is generated and it is considered the answer of the query. This strategy, combining summarization and question reformulation, obtains good results regarding informativeness and readability. 0 0
Query directed web page clustering using suffix tree and wikipedia links Jonghun Park
Gao X.
Andreae P.
Document clustering
Semantic distance
Semantic relatedness
English 2012 Recent research on Web page clustering has shown that the user query plays a critical role in guiding the categorisation of web search results. This paper combines our Query Directed Clustering algorithm (QDC) with another existing algorithm, Suffix Tree Clustering (STC), to identify common phrases shared by documents for base cluster identification. One main contribution is the utilising of a new Wikipedia link based measure to estimate the semantic relatedness between query and the base cluster labels, which has shown great promise in identifying the good base clusters. Our experimental results show that the performance is improved by utilising suffix trees and Wikipedia links. 0 0
Query expansion powered by wikipedia hyperlinks Bruce C.
Gao X.
Andreae P.
Jabeen S.
English 2012 This research introduces a new query expansion method that uses Wikipedia and its hyperlink structure to find related terms for reformulating a query. Queries are first understood better by splitting into query aspects. Further understanding is gained through measuring how well each aspect is represented in the original search results. Poorly represented aspects are found to be an excellent source of query improvement. Our main contribution is the way of using Wikipedia to identify aspects and underrepresented aspects, and to weight the expansion terms. Results have shown that our approach improves the original query and search results, and outperforms two existing query expansion methods. 0 0
Query-oriented keyphrase extraction Qiu M.
Yanyan Li
Jian Jiang
Keyphrase extraction
Language model
English 2012 People often issue informational queries to search engines to find out more about some entities or events.While aWikipedia-like summary would be an ideal answer to such queries, not all queries have a corresponding Wikipedia entry. In this work we propose to study query-oriented keyphrase extraction, which can be used to assist search results summarization. We propose a general method for keyphrase extraction for our task, where we consider both phraseness and informativeness. We discuss three criteria for phraseness and four ways to compute informativeness scores. Using a large Wikipedia corpus and 40 queries, our empirical evaluation shows that using a named entity-based phraseness criterion and a language model-based informativeness score gives the best performance on our task. This method also outperforms two state-of-the-art baseline methods. 0 0
Research on the construction of open education resources based on semantic wiki Mu S.
Xiaodan Zhang
Zuo P.
Open education resources
Resource co-construction
Semantic wiki
English 2012 Since the MIT's OpenCourseWare project in 2001, open education resources movement has gone through more than ten years. Except for the fruitful results, some problems of resource construction are also exposed. Part of open education resources projects cannot be carried out or even were forced to drop out for a shortage of personnel or funds. A lack of uniform norms or standards leads to the duplication of resource construction and low resource utilization. Semantic media Wiki combines the openness, self-organization and collaboration of Wiki with the structured knowledge in the Semantic Web, which meets the needs of resource co-construction and sharing in open education resources movement. In this study, based on the online course Education Information Processing, we explore the Semantic MediaWiki's application in the open education resources construction. 0 0
SIGA, a system to manage information retrieval evaluations Costa L.
Mota C.
Diana Santos
Information extraction
Information retrieval
Question answering
English 2012 This paper provides an overview of the current version of SIGA, a system that supports the organization of information retrieval (IR) evaluations. SIGA was recently used in Págico, an evaluation contest where both automatic and human participants competed to find answers to 150 topics in the Portuguese Wikipedia, and we describe its new capabilities in this context as well as provide preliminary results from Págico. 0 0
Search for minority information from wikipedia based on similarity of majority information Hattori Y.
Akiyo Nadamoto
English 2012 In this research, we propose a method of searching for minority information, which is less acknowledged and less popular, on the internet. We propose two methods to extract minority information. One is that of calculating relevance of content. The other is based on analogy expression. In this paper, we propose such a minority search system. At this time, we consider it necessary to search for minority information in which a user is interested. Using our proposed system, the user inputs a query which represents their interest in majority information. Then the system searches for minority information that is similar to the majority information provided. Consequently, users can obtain the new information that users do not know and can discover new knowledge and new interests. 0 0
Self organizing maps for visualization of categories Szymanski J.
Duch W.
Categories visualization
Documents categorization
Self organizing maps
English 2012 Visualization of Wikipedia categories using Self Organizing Maps shows an overview of categories and their relations, helping to narrow down search domains. Selecting particular neurons this approach enables retrieval of conceptually similar categories. Evaluation of neural activations indicates that they form coherent patterns that may be useful for building user interfaces for navigation over category structures. 0 0
Semantic Wikis: Approaches, applications, and perspectives Bry F.
Sebastian Schaffert
Vrandecic D.
Weiand K.
English 2012 In the decade (2001-2011) that has passed since Semantic Wikis were first proposed, systems have been conceived, developed and used for various purposes. This article aims at giving a comprehensive state-of-the-art overview of the research on Semantic Wikis, stressing what makes them easy to use by a wide and possibly inexperienced audience. This article further describes applications and use cases that have driven the research on Semantic Wikis, software techniques, and architectures that have been proposed for Semantic Wikis. Finally, this article suggests possible ways ahead for further research. 0 0
Smart documents for home automation Albanese A. Applescript
Home Automation
Smart Door
Smart Home
System Integration
Wiki Server
English 2012 This work describes the use of well known computer applications to enable smart home users to monitor and control their homes using customized documents. Middleware written in applescript and perl-cgi was used to integrate the computer applications with the OpenWebNet protocol used in home automation. The events triggered by the applications are easily log by web and mail servers to facilitate diagnostic operations and their archival. This software was tested on the implementation of "The Smart Door Project" to remotely manage door access, monitor the door and archive the events. One of the features is that the door opens after the user receives an e-mail with the magic words "Apriti Sesamo" in the subject field, and "Alibaba" in the text. 0 0
Snip! Andrew Trotman
Crane M.
Snippet generation
English 2012 The University of Otago submitted runs to the Snippet Retrieval Track and the Relevance Feedback tracks at INEX 2011. Snippets were generated using vector space ranking functions, taking into account or ignoring structural hints, and using word clouds. We found that using passages made better snippets than XML elements and that word clouds make bad snippets. In our runs in the Relevance Feedback track we were testing the INEX gateway to C/C++ and blind relevance feedback (with and without stemming). We found that blind relevance feedback with stemming does improve prevision in the INEX framework. 0 0
Social recommendation and external resources for book search Deveaud R.
SanJuan E.
Bellot P.
English 2012 In this paper we describe our participation in the INEX 2011 Book Track and present our contributions. This year a brand new collection of documents issued from Amazon was introduced. It is composed of Amazon entries for real books, and their associated user reviews, ratings and tags. We tried a traditional approach for retrieval with two query expansion approaches involving Wikipedia as an external source of information. We also took advantage of the social data with recommendation runs that use user ratings and reviews. Our query expansion approaches did not perform well this year, but modeling the popularity and the interestingness of books based on user opinion achieved encouraging results. We also provide in this paper an insight into the combination of several external resources for contextualizing tweets, as part of the Tweet Contextualization track (former QA track). 0 0
Spamming for science: Active measurement in web 2.0 abuse research West A.G.
Pedram Hayati
Vidyasagar Potdar
Insup Lee
English 2012 Spam and other electronic abuses have long been a focus of computer security research. However, recent work in the domain has emphasized an economic analysis of these operations in the hope of understanding and disrupting the profit model of attackers. Such studies do not lend themselves to passive measurement techniques. Instead, researchers have become middle-men or active participants in spam behaviors; methodologies that lie at an interesting juncture of legal, ethical, and human subject (e.g., IRB) guidelines. In this work two such experiments serve as case studies: One testing a novel link spam model on Wikipedia and another using blackhat software to target blog comments and forums. Discussion concentrates on the experimental design process, especially as influenced by human-subject policy. Case studies are used to frame related work in the area, and scrutiny reveals the computer science community requires greater consistency in evaluating research of this nature. 0 0
The co-creation machine: Managing co-creative processes for the crowd Debenham J.
Simoff S.
English 2012 Co-creative processes have spawned successes such as Wikipedia. They are also used to draw innovative ideas from consumers to producers, and from voters to government. This paper describes the initial stages of a collaboration between two Sydney-based universities to build a customisable co-creative process management system. The system has embedded intelligence that will make it easy and enjoyable to use. It will enable these powerful systems to be quickly deployed on the Internet to the benefit of the universities as well as industry and government. The innovation in the design of this project is that it is founded on normative multiagent systems that are an established technology for (business) process management but have yet to be deployed to support the co-creative process. 0 0
The dicta-sign Wiki: Enabling web communication for the deaf Efthimiou E.
Fotinea S.-E.
Hanke T.
Glauert J.
Bowden R.
Braffort A.
Collet C.
Maragos P.
Lefebvre-Albaret F.
Deaf communication
Deaf user-centred HCI
Multilingual sign language resources
Sign language technologies
English 2012 The paper provides a report on the user-centred showcase prototypes of the DICTA-SIGN project (http://www.dictasign.eu/), an FP7-ICT project which ended in January 2012. DICTA-SIGN researched ways to enable communication between Deaf individuals through the development of human-computer interfaces (HCI) for Deaf users, by means of Sign Language. Emphasis is placed on the Sign-Wiki prototype that demonstrates the potential of sign languages to participate in contemporary Web 2.0 applications where user contributions are editable by an entire community and sign language users can benefit from collaborative editing facilities. 0 0
The semantic web linker: A multilingual and multisource framework La Polla M.N.
Lo Duca A.
Andrea Marchetti
English 2012 In this demonstration we present the Semantic Web Linker (SWL), a framework for helping Name Entity Recognition (NER) procedures. The strength of the SWL is the integration of data coming from different Web sources, such as Wikipedia and DBpedia. The SWL also provides a multilingual repository, in the sense that every entity is associated to its synonyms and translations in many languages. Furthermore, the SWL manages a classification of entities through their hierarchical categorization. The SWL can be browsed through a Web interface. 0 0
TopicExplorer: Exploring document collections with topic models Hinneburg A.
Preiss R.
Schroder R.
Document browser
Topic model
English 2012 The demo presents a prototype - called TopicExplorer - that combines topic modeling, key word search and visualization techniques to explore a large collection of Wikipedia documents. Topics derived by Latent Dirichlet Allocation are presented by top words. In addition, topics are accompanied by image thumbnails extracted from related Wikipedia documents to aid sense making of derived topics during browsing. Topics are shown in a linear order such that similar topics are close. Topics are mapped to color using that order. The auto-completion of search terms suggests words together with their color coded topics, which allows to explore the relation between search terms and topics. Retrieved documents are shown with color coded topics as well. Relevant documents and topics found during browsing can be put onto a shortlist. The tool can recommend further documents with respect to the average topic mixture of the shortlist. 0 0
Towards a two-way participatory process Silva A.
Rocha J.G.
Social Media
English 2012 In less than a decade, several millions of articles have been written in Wikipedia and several million roads have been traced out on Open Street Map (OSM). In the meantime, the authorities have still not been able to understand and use the power of crowd sourcing. In this paper, we present the design principles of a new Public Participation Geographic Information System (PPGIS). We aim to eliminate the typical limitations of previous unsuccessful platforms, that have mostly failed due to conceptual design issues. We argue that two fundamental changes must exist in new PPGIS platforms: there is a shift from hierarchies to increased equal rights platforms; improved communication, more transparency, and bi-directionality. The role of the authority in former platforms was really an authoritarian role: having all the power and only partly knowing and controlling the entire platform. This is completely different from the crowd source platforms we know to be successful. So, one fundamental change is to diminish hierarchies and prevent people from hiding themselves behind the institution. The second major conceptual design issue is related to transparency and communication. While former platforms use mechanisms to prevent citizens from seeing each other's participation, we aim to enable people to see the participation of others. That's a fundamental feature in social networks. We will also design it to be a two-way communication platform. If citizens are requested to participate, the administration must use the same platform to communicate with them. Not only to provide feedback, but also to publish useful information for the citizen. In this paper we describe how social media meets our design principles. We decide to implement our case study, the "Fix my Street" application, on top of a social engine, to take advantage of all social media features. Two necessary extensions to the social engine are briefly described, to capture the core logic of our application. 0 0
Using information extraction to generate trigger questions for academic writing support Liu M.
Calvo R.A.
Academic Writing Support
Information extraction
Question Generation
English 2012 Automated question generation approaches have been proposed to support reading comprehension. However, these approaches are not suitable for supporting writing activities. We present a novel approach to generate different forms of trigger questions (directive and facilitative) aimed at supporting deep learning. Useful semantic information from Wikipedia articles is extracted and linked to the key phrases in a students' literature review, particularly focusing on extracting information containing 3 types of relations (Kind of, Similar-to and Different-to) by using syntactic pattern matching rules. We collected literature reviews from 23 Engineering research students, and evaluated the quality of 306 computer generated questions and 115 generic questions. Facilitative questions are more useful when it comes to deep learning about the topic, while directive questions are clearer and useful for improving the composition. 0 0
Using lexical and thematic knowledge for name disambiguation Wang J.
Zhao W.X.
Yan R.
Wei H.
Nie J.-Y.
Li X.
Lexical and thematic knowledge
Name disambiguation
English 2012 In this paper we present a novel approach to disambiguate names based on two different types of semantic information: lexical and thematic. We propose to use translation-based language models to resolve the synonymy problem in every word match, and to use topic-based ranking function to capture rich thematic contexts for names. We test three ranking functions that combine lexical relatedness and thematic relatedness. The experiments on Wikipedia data set and TAC-KBP 2010 data set show that our proposed method is very effective for name disambiguation. 0 0
Using web mining for discovering spatial patterns and hot spots for spatial generalization Burdziej J.
Piotr Gawrysiak
Spatial generalization
Spatial patterns
Web mining
English 2012 In this paper we propose a novel approach to spatial data generalization, in which web user behavior information influences the generalization and mapping process. Our approach relies on combining usage information from web resources such as Wikipedia with search engines index statistics in order to determine an importance score for geographical objects that is used during map preparation. 0 0
Using wikipedia anchor text and weighted clustering coefficient to enhance the traditional multi-document summarization Kumar N.
Srinathan K.
Vasudeva Varma
Multi-document summarization
Page rank
Sentence clusters
Weighted clustering coefficient
Wikipedia anchor text
English 2012 Similar to the traditional approach, we consider the task of summarization as selection of top ranked sentences from ranked sentence-clusters. To achieve this goal, we rank the sentence clusters by using the importance of words calculated by using page rank algorithm on reverse directed word graph of sentences. Next, to rank the sentences in every cluster we introduce the use of weighted clustering coefficient. We use page rank score of words for calculation of weighted clustering coefficient. Finally the most important issue is the presence of a lot of noisy entries in the text, which downgrades the performance of most of the text mining algorithms. To solve this problem, we introduce the use of Wikipedia anchor text based phrase mapping scheme. Our experimental results on DUC-2002 and DUC-2004 dataset show that our system performs better than unsupervised systems and better than/comparable with novel supervised systems of this area. 0 0
What makes corporate wikis work? wiki affordances and their suitability for corporate knowledge work Yeo M.L.
Ofer Arazy
Knowledge management
English 2012 Wikis were originally intended for knowledge work in the open Internet environment, and there seems to be an inherent tension between wikis' affordances and the nature knowledge work in organizations. The objective of this paper is to investigate how tailoring wikis to corporate settings would impact users' wiki activity. We begin by synthesizing prior works on wikis' design principles; identifying several areas where we anticipate high tension between wikis' affordances and organizational work practices. We put forward five propositions regarding how changes in corporate wikis deployment procedures may impact users' wiki activity. An empirical study in one multi-national organization tested users' perceptions towards these propositions, revealing that in some cases there may be a need for modifying wiki's design, while in other cases corporations may wish to change their knowledge work practices to align with wikis' affordances. 0 0
Wiki - A useful tool to fight classroom cheating? Putnik Z.
Ivanovic M.
Budimac Z.
Samuelis L.
English 2012 As a part of the activities of Chair of Computer Science, Department of Mathematics and Informatics various types of eLearning activities have been applied for the last eight years. Using open source LMS system Moodle, we started with a simple repository of learning resources, went over creation of eLessons, quizzes, and glossaries, but recently also started using elements of Web 2.0 in teaching. Usage of forums, blogs, and wikis as simulation of classroom activities proved to be very successful and our students welcomed these trends. What we didn't expect, but what we gladly embraced was the fact that usage of Wikis helped us fighting cheating in teamwork assignment solving. Namely, practice of application of teamwork at several courses was spoiled by students who didn't do their part of the task. Yet, their teammates covered for them and only later, within a survey about their satisfaction with the course, complained about the fact. Usage of Wikis for assignment solving combined with the ability of LMS Moodle to reveal all of the activities and history of changes, enabled us to separate actual doers and non-doers for each of the assignments, to the satisfaction of both teachers and students. 0 0
Wiki refactoring as mind map reshaping Gorka Puente
Diaz O.
Mind maps
English 2012 Wikis' organic growth inevitably leads to wiki degradation and the need for regular wiki refactoring. So far, wiki refactoring is a manual, time-consuming and error-prone activity. We strive to ease wiki refactoring by using mind maps as a graphical representation of the wiki structure, and mind map manipulations as a way to express refactoring. This paper (i) defines the semantics of common refactoring operations based on Wikipedia best practices, (ii) advocates for the use of mind maps as a visualization of wikis for refactoring, and (iii) introduces a DSL for wiki refactoring built on top of FreeMind, a mind mapping tool. Thus, wikis are depicted as FreeMind maps, and map manipulations are interpreted as refactoring operations over the wiki. The rationales for the use of a DSL are based not only on reliability grounds but also on facilitating end-user participation. 0 0
WikiSent: Weakly supervised sentiment analysis through extractive summarization with Wikipedia Saswati Mukherjee
Prantik Bhattacharyya
Information extraction
Sentiment Analysis
Text mining
Weakly Supervised System
English 2012 This paper describes a weakly supervised system for sentiment analysis in the movie review domain. The objective is to classify a movie review into a polarity class, positive or negative, based on those sentences bearing opinion on the movie alone, leaving out other irrelevant text. Wikipedia incorporates the world knowledge of movie-specific features in the system which is used to obtain an extractive summary of the review, consisting of the reviewer's opinions about the specific aspects of the movie. This filters out the concepts which are irrelevant or objective with respect to the given movie. The proposed system, WikiSent, does not require any labeled data for training. It achieves a better or comparable accuracy to the existing semi-supervised and unsupervised systems in the domain, on the same dataset. We also perform a general movie review trend analysis using WikiSent. 0 0
Wikimantic: Disambiguation for short queries Boston C.
Carberry S.
Fang H.
Short queries
English 2012 This paper presents an implemented and evaluated methodology for disambiguating terms in search queries. By exploiting Wikipedia articles and their reference relations, our method is able to disambiguate terms in particularly short queries with few context words. This work is part of a larger project to retrieve information graphics in response to user queries. 0 0
Wikipedia revision graph extraction based on n-gram cover Wu J.
Mizuho Iwaihara
Mass collaboration
Wikipedia revision graph
English 2012 During the past decade, mass collaboration systems have emerged and thrived on the World-Wide Web, with numerous user contents generated. As one of such systems, Wikipedia allows users to add and edit articles in this encyclopedic knowledge base and piles of revisions have been contributed. Wikipedia maintains a linear record of edit history with timestamp for each article, which includes precious information on how each article has evolved. However, meaningful revision evolution features like branching and revert are implicit and needed to be reconstructed. Also, existence of merges from multiple ancestors indicates that the edit history shall be modeled as a directed acyclic graph. To address these issues, we propose a revision graph extraction method based on n-gram cover that effectively find branching and revert. We evaluate the accuracy of our method by comparing with manually constructed revision graphs. 0 0
Wikipedia-based document categorization Ciesielski K.
Borkowski P.
Klopotek M.A.
Trojanowski K.
Wysocki K.
English 2012 A novel method of text categorization for Polish language documents, based on Polish Wikipedia resources is presented. The distinctive feature of the approach is that document labelling can be performed with no additional categorized corpora. Experiments with two different types of document semantic disambiguation have been performed, and evaluated according to the several quality metrics. 0 0
Words context analysis for improvement of information retrieval Szymanski J. Information retrieval
Semantic indexes
Text indexing
English 2012 In the article we present an approach to improvement of retrieval information from large text collections using words context vectors. The vectors have been created analyzing English Wikipedia with Hyperspace Analogue to Language model of words similarity. For test phrases we evaluate retrieval with direct user queries as well as retrieval with context vectors of these queries. The results indicate that the proposed method can not replace retrieval based on direct user queries but it can be used for refining the search results. 0 0
Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features B. Thomas Adler
Luca de Alfaro
Santiago M. Mola Velasco
Paolo Rosso
Andrew G. West
Machine learning
Natural Language Processing
English February 2011 Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, about 7% are acts of vandalism. Such behavior is characterized by modifications made in bad faith; introducing spam and other inappropriate content. In this work, we present the results of an effort to integrate three of the leading approaches to Wikipedia vandalism detection: a spatio-temporal analysis of metadata (STiki), a reputation-based system (WikiTrust), and natural language processing features. The performance of the resulting joint system improves the state-of-the-art from all previous methods and establishes a new baseline for Wikipedia vandalism detection. We examine in detail the contribution of the three approaches, both for the task of discovering fresh vandalism, and for the task of locating vandalism in the complete set of Wikipedia revisions. 0 1
A DSL for corporate wiki initialization English 2011 0 0
A bounded confidence approach to understanding user participation in peer production systems Ciampaglia G.L. English 2011 Commons-based peer production does seem to rest upon a paradox. Although users produce all contents, at the same time participation is commonly on a voluntary basis, and largely incentivized by achievement of project's goals. This means that users have to coordinate their actions and goals, in order to keep themselves from leaving. While this situation is easily explainable for small groups of highly committed, like-minded individuals, little is known about large-scale, heterogeneous projects, such as Wikipedia. In this contribution we present a model of peer production in a large online community. The model features a dynamic population of bounded confidence users, and an endogenous process of user departure. Using global sensitivity analysis, we identify the most important parameters affecting the lifespan of user participation. We find that the model presents two distinct regimes, and that the shift between them is governed by the bounded confidence parameter. For low values of this parameter, users depart almost immediately. For high values, however, the model produces a bimodal distribution of user lifespan. These results suggest that user participation to online communities could be explained in terms of group consensus, and provide a novel connection between models of opinion dynamics and commons-based peer production. 0 0
A resource-based method for named entity extraction and classification Gamallo P.
Garcia M.
English 2011 We propose a resource-based Named Entity Classification (NEC) system, which combines named entity extraction with simple language-independent heuristics. Large lists (gazetteers) of named entities are automatically extracted making use of semi-structured information from the Wikipedia, namely infoboxes and category trees. Language-independent heuristics are used to disambiguate and classify entities that have been already identified (or recognized) in text. We compare the performance of our resource-based system with that of a supervised NEC module implemented for the FreeLing suite, which was the winner system in CoNLL-2002 competition. Experiments were performed over Portuguese text corpora taking into account several domains and genres. 0 0
A survey on web archiving initiatives Gomes D.
Miranda J.
Costa M.
English 2011 Web archiving has been gaining interest and recognized importance for modern societies around the world. However, for web archivists it is frequently difficult to demonstrate this fact, for instance, to funders. This study provides an updated and global overview of web archiving. The obtained results showed that the number of web archiving initiatives significantly grew after 2003 and they are concentrated on developed countries. We statistically analyzed metrics, such as, the volume of archived data, archive file formats or number of people engaged. Web archives all together must process more data than any web search engine. Considering the complexity and large amounts of data involved in web archiving, the results showed that the assigned resources are scarce. A Wikipedia page was created to complement the presented work and be collaboratively kept up-to-date by the community. 3 0
A web 2.0 approach for organizing search results using Wikipedia Darvish Morshedi Hosseini M.
Shakery A.
Moshiri B.
Search result Organization
English 2011 Most current search engines return a ranked list of results in response to the user's query. This simple approach may require the user to go through a long list of results to find the documents related to his information need. A common alternative is to cluster the search results and allow the user to browse the clusters, but this also imposes two challenges: 'how to define the clusters' and 'how to label the clusters in an informative way'. In this study, we propose an approach which uses Wikipedia as the source of information to organize the search results and addresses these two challenges. In response to a query, our method extracts a hierarchy of categories from Wikipedia pages and trains classifiers using web pages related to these categories. The search results are organized in the extracted hierarchy using the learned classifiers. Experiment results confirm the effectiveness of the proposed approach. 0 0
An exploratory study of navigating Wikipedia semantically: Model and application Wu I.-C.
Lin Y.-S.
Liu C.-H.
Normalized Google Distance
SNA-based summary
English 2011 Due to the popularity of link-based applications like Wikipedia, one of the most important issues in online research is how to alleviate information overload on the World Wide Web (WWW) and facilitate effective information-seeking. To address the problem, we propose a semantically-based navigation application that is based on the theories and techniques of link mining, semantic relatedness analysis and text summarization. Our goal is to develop an application that assists users in efficiently finding the related subtopics for a seed query and then quickly checking the content of articles. We establish a topic network by analyzing the internal links of Wikipedia and applying the Normalized Google Distance algorithm in order to quantify the strength of the semantic relationships between articles via key terms. To help users explore and read topic-related articles, we propose a SNA-based summarization approach to summarize articles. To visualize the topic network more efficiently, we develop a semantically-based WikiMap to help users navigate Wikipedia effectively. 0 0
An iterative clustering method for the XML-mining task of the INEX 2010 Tovar M.
Cruz A.
Vazquez B.
Pinto D.
Vilarino D.
Montes A.
English 2011 In this paper we propose two iterative clustering methods for grouping Wikipedia documents of a given huge collection into clusters. The recursive method clusters iteratively subsets of the complete collection. In each iteration, we select representative items for each group, which are then used for the next stage of clustering. The presented approaches are scalable algorithms which may be used with huge collections that in other way (for instance, using the classic clustering methods) would be computationally expensive of being clustered. The obtained results outperformed the random baseline presented in the INEX 2010 clustering task of the XML-Mining track. 0 0
Assessments in large- and small-scale wiki collaborative learning environments: Recommendations for educators and wiki designers Portia Pusey
Gabriele Meiselwitz
Wiki Learning
Wiki Learning Environment
English 2011 This paper discusses assessment practice when wikis are used as learning environments in higher education. Wikis are simple online information systems which often serve user communities. In higher education, wikis have been used in a supporting function to traditional courses; however, there is little research on wikis taking on a larger role as learning environments and even less research on assessment practice for these learning environments. This paper reports on the assessment techniques for large- and small scale- learning environments. It explores the barriers to assessment described in the studies. The paper concludes with a proposal of five improvements to the wiki engine which could facilitate assessment when wikis are used as learning environments in higher education. 0 0
Automated construction of domain ontology taxonomies from wikipedia Juric D.
Banek M.
Skocir Z.
English 2011 The key step for implementing the idea of the Semantic Web into a feasible system is providing a variety of domain ontologies that are constructed on demand, in an automated manner and in a very short time. In this paper we introduce an unsupervised method for constructing domain ontology taxonomies from Wikipedia. The benefit of using Wikipedia as the source is twofold: first, the Wikipedia articles are concise and have a particularly high "density"of domain knowledge; second, the articles represent a consensus of a large community, thus avoiding term disagreements and misinterpretations. The taxonomy construction algorithm, aimed at finding the subsumption relation, is based on two different techniques, which both apply linguistic parsing: analyzing the first sentence of each Wikipedia article and processing the categories associated with the article. The method has been evaluated against human judgment for two independent domains and the experimental results have proven its robustness and high precision. 0 0
Automatic semantic web annotation of named entities Charton E.
Marie-Pierre Gagnon
Ozell B.
English 2011 This paper describes a method to perform automated semantic annotation of named entities contained in large corpora. The semantic annotation is made in the context of the Semantic Web. The method is based on an algorithm that compares the set of words that appear before and after the name entity with the content of Wikipedia articles, and identifies the more relevant one by means of a similarity measure. It then uses the link that exists between the selected Wikipedia entry and the corresponding RDF description in the Linked Data project to establish a connection between the named entity and some URI in the Semantic Web. We present our system, discuss its architecture, and describe an algorithm dedicated to ontological disambiguation of named entities contained in large-scale corpora. We evaluate the algorithm, and present our results. 0 0
Capability modeling of knowledge-based agents for commonsense knowledge integration Kuo Y.-L.
Hsu J.Y.-J.
Agent description
Capability model
Common sense
Commonsense knowledge integration
Multi-agent system
English 2011 Robust intelligent systems require commonsense knowledge. While significant progress has been made in building large commonsense knowledge bases, they are intrinsically incomplete. It is difficult to combine multiple knowledge bases due to their different choices of representation and inference mechanisms, thereby limiting users to one knowledge base and its reasonable methods for any specific task. This paper presents a multi-agent framework for commonsense knowledge integration, and proposes an approach to capability modeling of knowledge bases without a common ontology. The proposed capability model provides a general description of large heterogeneous knowledge bases, such that contents accessible by the knowledge-based agents may be matched up against specific requests. The concept correlation matrix of a knowledge base is transformed into a k-dimensional vector space using low-rank approximation for dimensionality reduction. Experiments are performed with the matchmaking mechanism for commonsense knowledge integration framework using the capability models of ConceptNet, WordNet, and Wikipedia. In the user study, the matchmaking results are compared with the ranked lists produced by online users to show that over 85% of them are accurate and have positive correlation with the user-produced ranked lists. 0 0
Categorization of wikipedia articles with spectral clustering Szymanski J. English 2011 The article reports application of clustering algorithms for creating hierarchical groups within Wikipedia articles. We evaluate three spectral clustering algorithms based on datasets constructed with usage of Wikipedia categories. Selected algorithm has been implemented in the system that categorize Wikipedia search results in the fly. 0 0
Citizens as database: Conscious ubiquity in data collection Richter K.-F.
Winter S.
English 2011 Crowd sourcing [1], citzens as sensors [2], user-generated content [3,4], or volunteered geographic information [5] describe a relatively recent phenomenon that points to dramatic changes in our information economy. Users of a system, who often are not trained in the matter at hand, contribute data that they collected without a central authority managing or supervising the data collection process. The individual approaches vary and cover a spectrum from conscious user actions ('volunteered') to passive modes ('citizens as sensors'). Volunteered user-generated content is often used to replace existing commercial or authoritative datasets, for example, Wikipedia as an open encyclopaedia, or OpenStreetMap as an open topographic dataset of the world. Other volunteered content exploits the rapid update cycles of such mechanisms to provide improved services. For example, fixmystreet.com reports damages related to streets; Google, TomTom and other dataset providers encourage their users to report updates of their spatial data. In some cases, the database itself is the service; for example, Flickr allows users to upload and share photos. At the passive end of the spectrum, data mining methods can be used to further elicit hidden information out of the data. Researchers identified, for example, landmarks defining a town from Flickr photo collections [6], and commercial services track anonymized mobile phone locations to estimate traffic flow and enable real-time route planning. 0 0
Coherence progress: A measure of interestingness based on fixed compressors Schaul T.
Pape L.
Glasmachers T.
Graziano V.
Schmidhuber J.
English 2011 The ability to identify novel patterns in observations is an essential aspect of intelligence. In a computational framework, the notion of a pattern can be formalized as a program that uses regularities in observations to store them in a compact form, called a compressor. The search for interesting patterns can then be stated as a search to better compress the history of observations. This paper introduces coherence progress, a novel, general measure of interestingness that is independent of its use in a particular agent and the ability of the compressor to learn from observations. Coherence progress considers the increase in coherence obtained by any compressor when adding an observation to the history of observations thus far. Because of its applicability to any type of compressor, the measure allows for an easy, quick, and domain-specific implementation. We demonstrate the capability of coherence progress to satisfy the requirements for qualitatively measuring interestingness on a Wikipedia dataset. 0 0
Collaborative sensemaking during admin permission granting in Wikipedia Katie Derthick
Patrick Tsao
Travis Kriplean
Alan Borning
Mark Zachry
David W. McDonald
Collaboration software
Contributor systems
System administration
English 2011 A self-governed, open contributor system such as Wikipedia depends upon those who are invested in the system to participate as administrators. Processes for selecting which system contributors will be allowed to assume administrative roles in such communities have developed in the last few years as these systems mature. However, little is yet known about such processes, which are becoming increasingly important for the health and maintenance of contributor systems that are becoming increasingly important in the knowledge economy. This paper reports the results of an exploratory study of how members of the Wikipedia community engage in collaborative sensemaking when deciding which members to advance to admin status. 0 0
Combining heterogeneous knowledge resources for improved distributional semantic models Szarvas G.
Torsten Zesch
Iryna Gurevych
English 2011 The Explicit Semantic Analysis (ESA) model based on term cooccurrences in Wikipedia has been regarded as state-of-the-art semantic relatedness measure in the recent years. We provide an analysis of the important parameters of ESA using datasets in five different languages. Additionally, we propose the use of ESA with multiple lexical semantic resources thus exploiting multiple evidence of term cooccurrence to improve over the Wikipedia-based measure. Exploiting the improved robustness and coverage of the proposed combination, we report improved performance over single resources in word semantic relatedness, solving word choice problems, classification of semantic relations between nominals, and text similarity. 0 0
Cross-language information retrieval with latent topic models trained on a comparable corpus Vulic I.
De Smet W.
Moens M.-F.
Comparable corpora
Cross-language retrieval
Document models
Multilingual retrieval
Topic models
English 2011 In this paper we study cross-language information retrieval using a bilingual topic model trained on comparable corpora such as Wikipedia articles. The bilingual Latent Dirichlet Allocation model (BiLDA) creates an interlingual representation, which can be used as a translation resource in many different multilingual settings as comparable corpora are available for many language pairs. The probabilistic interlingual representation is incorporated in a statistical language model for information retrieval. Experiments performed on the English and Dutch test datasets of the CLEF 2001-2003 CLIR campaigns show the competitive performance of our approach compared to cross-language retrieval methods that rely on pre-existing translation dictionaries that are hand-built or constructed based on parallel corpora. 0 0
Cross-lingual recommendations in a resource-based learning scenario Schmidt S.
Scholl P.
Rensing C.
Steinmetz R.
Cross-Language Semantic Relatedness
Explicit Semantic Analysis
Reference Corpus
English 2011 CROKODIL is a platform supporting resource-based learning scenarios for self-directed, on-task learning with web resources. As CROKODIL enables the forming of possibly large learning communities, the stored data is growing in a large scale. Thus, an appropriate recommendation of tags and learning resources becomes increasingly important for supporting learners. We propose semantic relatedness between tags and resources as a basis of recommendation and identify Explicit Semantic Analysis (ESA) using Wikipedia as reference corpus as a viable option. However, data from CROKODIL shows that tags and resources are often composed in different languages. Thus, a monolingual approach to provide recommendations is not applicable in CROKODIL. Thus, we examine strategies for providing mappings between different languages, extending ESA to provide cross-lingual capabilities. Specifically, we present mapping strategies that utilize additional semantic information contained in Wikipedia. Based on CROKODIL's application scenario, we present an evaluation design and show results of cross-lingual ESA. 0 0
Crowd-based data sourcing (Abstract) Milo T. English 2011 Harnessing a crowd of Web users for the collection of mass data has recently become a wide-spread phenomenon [9]. Wikipedia [20] is probably the earliest and best known example of crowd-sourced data and an illustration of what can be achieved with a crowd-based data sourcing model. Other examples include social tagging systems for images, which harness millions of Web users to build searchable databases of tagged images; traffic information aggregators like Waze [17]; and hotel and movie ratings like TripAdvisor [19] and IMDb [18]. 0 0
Discovering context: Classifying tweets through a semantic transform based on wikipedia Yegin Genc
Yasuaki Sakamoto
Nickerson J.V.
Latent semantic analysis
Text classification
English 2011 By mapping messages into a large context, we can compute the distances between them, and then classify them. We test this conjecture on Twitter messages: Messages are mapped onto their most similar Wikipedia pages, and the distances between pages are used as a proxy for the distances between messages. This technique yields more accurate classification of a set of Twitter messages than alternative techniques using string edit distance and latent semantic analysis. 0 0
Enhancing document snippets using temporal information Alonso O.
Gertz M.
Baeza-Yates R.
English 2011 In this paper we propose an algorithm to enhance the quality of document snippets shown in a search engine by using temporal expressions. We evaluate our proposal in a subset of the Wikipedia corpus using crowdsourcing, showing that snippets that have temporal information are preferred by the users. 0 0
Exploiting unlabeled data for question classification Tomas D.
Claudio Giuliano
Kernel methods
Question classification
Semi-supervised learning
English 2011 In this paper, we introduce a kernel-based approach to question classification. We employed a kernel function based on latent semantic information acquired from Wikipedia. This kernel allows including external semantic knowledge into the supervised learning process. We obtained a highly effective question classifier combining this knowledge with a bag-of-words approach by means of composite kernels. As the semantic information is acquired from unlabeled text, our system can be easily adapted to different languages and domains. We tested it on a parallel corpus of English and Spanish questions. 0 0
External query reformulation for text-based image retrieval Min J.
Jones G.J.F.
English 2011 In text-based image retrieval, the Incomplete Annotation Problem (IAP) can greatly degrade retrieval effectiveness. A standard method used to address this problem is pseudo relevance feedback (PRF) which updates user queries by adding feedback terms selected automatically from top ranked documents in a prior retrieval run. PRF assumes that the target collection provides enough feedback information to select effective expansion terms. This is often not the case in image retrieval since images often only have short metadata annotations leading to the IAP. Our work proposes the use of an external knowledge resource (Wikipedia) in the process of refining user queries. In our method, Wikipedia documents strongly related to the terms in user query ("definition documents") are first identified by title matching between the query and titles of Wikipedia articles. These definition documents are used as indicators to re-weight the feedback documents from an initial search run on a Wikipedia abstract collection using the Jaccard coefficient. The new weights of the feedback documents are combined with the scores rated by different indicators. Query-expansion terms are then selected based on these new weights for the feedback documents. Our method is evaluated on the ImageCLEF WikipediaMM image retrieval task using text-based retrieval on the document metadata fields. The results show significant improvement compared to standard PRF methods. 0 0
Extracting events from Wikipedia as RDF triples linked to widespread semantic web datasets Carlo Aliprandi
Francesco Ronzano
Andrea Marchetti
Maurizio Tesconi
Salvatore Minutoli
Knowledge Extraction
Knowledge representation
Natural Language Processing
Semantic web
English 2011 Many attempts have been made to extract structured data from Web resources, exposing them as RDF triples and interlinking them with other RDF datasets: in this way it is possible to create clouds of highly integrated Semantic Web data collections. In this paper we describe an approach to enhance the extraction of semantic contents from unstructured textual documents, in particular considering Wikipedia articles and focusing on event mining. Starting from the deep parsing of a set of English Wikipedia articles, we produce a semantic annotation compliant with the Knowledge Annotation Format (KAF). We extract events from the KAF semantic annotation and then we structure each event as a set of RDF triples linked to both DBpedia and WordNet. We point out examples of automatically mined events, providing some general evaluation of how our approach may discover new events and link them to existing contents. 0 0
Focus and element length for book and Wikipedia retrieval Jaap Kamps
Marijn Koolen
English 2011 In this paper we describe our participation in INEX 2010 in the Ad Hoc Track and the Book Track. In the Ad Hoc track we investigate the impact of propagated anchor-text on article level precision and the impact of an element length prior on the within-document precision and recall. Using the article ranking of an document level run for both document and focused retrieval techniques, we find that focused retrieval techniques clearly outperform document retrieval, especially for the Focused and Restricted Relevant in Context Tasks, which limit the amount of text than can be returned per topic and per article respectively. Somewhat surprisingly, an element length prior increases within-document precision even when we restrict the amount of retrieved text to only 1000 characters per topic. The query-independent evidence of the length prior can help locate elements with a large fraction of relevant text. For the Book Track we look at the relative impact of retrieval units based on whole books, individual pages and multiple pages. 0 0
Graph-based named entity linking with Wikipedia Ben Hachey
Will Radford
Curran J.R.
Entity resolution
Text mining
Web intelligence
English 2011 Named entity linking (NEL) grounds entity mentions to their corresponding Wikipedia article. State-of-the-art supervised NEL systems use features over the rich Wikipedia document and link-graph structure. Graph-based measures have been effective over WordNet for word sense disambiguation (wsd). We draw parallels between NEL and (wsd), motivating our unsupervised NEL approach that exploits the Wikipedia article and category link graphs. Our system achieves 85.5% accuracy on the TAC 2010 shared task - competitive with the best supervised and unsupervised systems. 0 0
High-order co-clustering text data on semantics-based representation model Liping Jing
Jiali Yun
Jian Yu
Jiao-Sheng Huang
High-order co-clustering
Representation Model
Text mining
English 2011 The language modeling approach is widely used to improve the performance of text mining in recent years because of its solid theoretical foundation and empirical effectiveness. In essence, this approach centers on the issue of estimating an accurate model by choosing appropriate language models as well as smooth techniques. Semantic smoothing, which incorporates semantic and contextual information into the language models, is effective and potentially significant to improve the performance of text mining. In this paper, we proposed a high-order structure to represent text data by incorporating background knowledge, Wikipedia. The proposed structure consists of three types of objects, term, document and concept. Moreover, we firstly combined the high-order co-clustering algorithm with the proposed model to simultaneously cluster documents, terms and concepts. Experimental results on benchmark data sets (20Newsgroups and Reuters-21578) have shown that our proposed high-order co-clustering on high-order structure outperforms the general co-clustering algorithm on bipartite text data, such as document-term, document-concept and document-(term+concept). 0 0
Hybrid and interactive domain-specific translation for multilingual access to digital libraries Jones G.J.F.
Fuller M.
Newman E.
YanChun Zhang
English 2011 Accurate high-coverage translation is a vital component of reliable cross language information retrieval (CLIR) systems. This is particularly true for retrieval from archives such as Digital Libraries which are often specific to certain domains. While general machine translation (MT) has been shown to be effective for CLIR tasks in laboratory information retrieval evaluation tasks, it is generally not well suited to specialized situations where domain-specific translations are required. We demonstrate that effective query translation in the domain of cultural heritage (CH) can be achieved using a hybrid translation method which augments a standard MT system with domain-specific phrase dictionaries automatically mined from Wikipedia . We further describe the use of these components in a domain-specific interactive query translation service. The interactive system selects the hybrid translation by default, with other possible translations being offered to the user interactively to enable them to select alternative or additional translation(s). The objective of this interactive service is to provide user control of translation while maximising translation accuracy and minimizing the translation effort of the user. Experiments using our hybrid translation system with sample query logs from users of CH websites demonstrate a large improvement in the accuracy of domain-specific phrase detection and translation. 0 0
ITEM: Extract and integrate entities from tabular data to RDF knowledge base Guo X.
Yirong Chen
Jilin Chen
Du X.
Entity Extraction
RDF Knowledge Base
Schema Mapping
English 2011 Many RDF Knowledge Bases are created and enlarged by mining and extracting web data. Hence their data sources are limited to social tagging networks, such as Wikipedia, WordNet, IMDB, etc., and their precision is not guaranteed. In this paper, we propose a new system, ITEM, for extracting and integrating entities from tabular data to RDF knowledge base. ITEM can efficiently compute the schema mapping between a table and a KB, and inject novel entities into the KB. Therefore, ITEM can enlarge and improve RDF KB by employing tabular data, which is assumed of high quality. ITEM detects the schema mapping between table and RDF KB only by tuples, rather than the table's schema information. Experimental results show that our system has high precision and good performance. 0 0
Improving query expansion for image retrieval via saliency and picturability Leong C.W.
Hassan S.
Ruiz M.E.
Rada Mihalcea
English 2011 In this paper, we present a Wikipedia-based approach to query expansion for the task of image retrieval, by combining salient encyclopaedic concepts with the picturability of words. Our model generates the expanded query terms in a definite two-stage process instead of multiple iterative passes, requires no manual feedback, and is completely unsupervised. Preliminary results show that our proposed model is effective in a comparative study on the ImageCLEF 2010 Wikipedia dataset. 0 0
Informative sentence retrieval for domain specific terminologies Koh J.-L.
Cho C.-W.
Definitional question answering
Information retrieval
Sentence retrieval
English 2011 Domain specific terminologies represent important concepts when students study a subject. If the sentences which describe important concepts related to a terminology can be accessed easily, students will understand the semantics represented in the sentences which contain the terminology in depth. In this paper, an effective sentence retrieval system is provided to search informative sentences of a domain-specific terminology from the electrical books. A term weighting model is constructed in the proposed system by using web resources, including Wikipedia and FOLDOC, to measure the degree of a word relative to the query terminology. Then the relevance score of a sentence is estimated by summing the weights of the words in the sentence, which is used to rank the candidate answer sentences. By adopting the proposed method, the obtained answer sentences are not limited to certain sentence patterns. The results of experiment show that the ranked list of answer sentences retrieved by our proposed system have higher NDCG values than the typical IR approach and pattern-matching based approach. 0 0
Knowledge transfer across multilingual corpora via latent topics De Smet W.
Tang J.
Moens M.-F.
Cross-lingual knowledge transfer
Latent topic models
Text categorization
English 2011 This paper explores bridging the content of two different languages via latent topics. Specifically, we propose a unified probabilistic model to simultaneously model latent topics from bilingual corpora that discuss comparable content and use the topics as features in a cross-lingual, dictionary-less text categorization task. Experimental results on multilingual Wikipedia data show that the proposed topic model effectively discovers the topic information from the bilingual corpora, and the learned topics successfully transfer classification knowledge to other languages, for which no labeled training data are available. 0 0
LIA at INEX 2010 book track Deveaud R.
Boudin F.
Bellot P.
English 2011 In this paper we describe our participation and present our contributions in the INEX 2010 Book Track. Digitized books are now a common source of information on the Web, however OCR sometimes introduces errors that can penalize Information Retrieval. We propose a method for correcting hyphenations in the books and we analyse its impact on the Best Books for Reference task. The observed improvement is around 1%. This year we also experimented different query expansion techniques. The first one consists of selecting informative words from a Wikipedia page related to the topic. The second one uses a dependency parser to enrich the query with the detected phrases using a Markov Random Field model. We show that there is a significant improvement over the state-of-the-art when using a large weighted list of Wikipedia words, meanwhile hyphenation correction has an impact on their distribution over the book corpus. 0 0
Larger residuals, less work: Active document scheduling for latent dirichlet allocation Wahabzada M.
Kersting K.
English 2011 Recently, there have been considerable advances in fast inference for latent Dirichlet allocation (LDA). In particular, stochastic optimization of the variational Bayes (VB) objective function with a natural gradient step was proved to converge and able to process massive document collections. To reduce noise in the gradient estimation, it considers multiple documents chosen uniformly at random. While it is widely recognized that the scheduling of documents in stochastic optimization may have significant consequences, this issue remains largely unexplored. In this work, we address this issue. Specifically, we propose residual LDA, a novel, easy-to-implement, LDA approach that schedules documents in an informed way. Intuitively, in each iteration, residual LDA actively selects documents that exert a disproportionately large influence on the current residual to compute the next update. On several real-world datasets, including 3M articles from Wikipedia, we demonstrate that residual LDA can handily analyze massive document collections and find topic models as good or better than those found with batch VB and randomly scheduled VB, and significantly faster. 0 0
Learning from partially annotated sequences Fernandes E.R.
Brefeld U.
English 2011 We study sequential prediction models in cases where only fragments of the sequences are annotated with the ground-truth. The task does not match the standard semi-supervised setting and is highly relevant in areas such as natural language processing, where completely labeled instances are expensive and require editorial data. We propose to generalize the semi-supervised setting and devise a simple transductive loss-augmented perceptron to learn from inexpensive partially annotated sequences that could for instance be provided by laymen, the wisdom of the crowd, or even automatically. Experiments on mono- and cross-lingual named entity recognition tasks with automatically generated partially annotated sentences from Wikipedia demonstrate the effectiveness of the proposed approach. Our results show that learning from partially labeled data is never worse than standard supervised and semi-supervised approaches trained on data with the same ratio of labeled and unlabeled tokens. 0 0
Leveraging community-built knowledge for type coercion in question answering Kalyanpur A.
Murdock J.W.
Fan J.
Welty C.
Linked data
Question answering
Type Checking
English 2011 Watson, the winner of the Jeopardy! challenge, is a state-of-the-art open-domain Question Answering system that tackles the fundamental issue of answer typing by using a novel type coercion (TyCor) framework, where candidate answers are initially produced without considering type information, and subsequent stages check whether the candidate can be coerced into the expected answer type. In this paper, we provide a high-level overview of the TyCor framework and discuss how it is integrated in Watson, focusing on and evaluating three TyCor components that leverage the community built semi-structured and structured knowledge resources - DBpedia (in conjunction with the YAGO ontology), Wikipedia Categories and Lists. These resources complement each other well in terms of precision and granularity of type information, and through links to Wikipedia, provide coverage for a large set of instances. 0 0
Linguistically informed mining lexical semantic relations from Wikipedia structure Maciej Piasecki
Agnieszka Indyka-Piasecka
Roman Kurc
English 2011 A method of the extraction of the wordnet lexico-semantic relations from the Polish Wikipedia articles was proposed. The method is based on a set of hand-written set of lexico-morphosyntactic extraction patterns that were developed in less than one man-week of workload. Two kinds of patterns were proposed: processing encyclopaedia articles as text documents, and utilising the information about the structure of the Wikipedia article (including links). Two types of evaluation were applied: manual assessment of the extracted data and on the basis of the application of the extracted data as an additional knowledge source in automatic plWordNet expansion. 0 0
ListOPT: Learning to optimize for XML ranking Gao N.
Deng Z.-H.
Yu H.
Jiang J.-J.
English 2011 Many machine learning classification technologies such as boosting, support vector machine or neural networks have been applied to the ranking problem in information retrieval. However, since the purpose of these learning-to-rank methods is to directly acquire the sorted results based on the features of documents, they are unable to combine and utilize the existing ranking methods proven to be effective such as BM25 and PageRank. To solve this defect, we conducted a study on learning-to-optimize, which is to construct a learning model or method for optimizing the free parameters in ranking functions. This paper proposes a listwise learning-to-optimize process ListOPT and introduces three alternative differentiable query-level loss functions. The experimental results on the XML dataset of Wikipedia English show that these approaches can be successfully applied to tuning the parameters used in an existing highly cited ranking function BM25. Furthermore, we found that the formulas with optimized parameters indeed improve the effectiveness compared with the original ones. 0 0
Loki-Semantic wiki with logical knowledge representation English 2011 0 0
Metadata enrichment via topic models for author name disambiguation Bernardi R.
Le D.-T.
Author Name Disambiguation
Digital libraries
Topic Models
English 2011 This paper tackles the well known problem of Author Name Disambiguation (AND) in Digital Libraries (DL). Following [14,13], we assume that an individual tends to create a distinctively coherent body of work that can hence form a single cluster containing all of his/her articles yet distinguishing them from those of everyone else with the same name. Still, we believe the information contained in a DL may be not sufficient to allow an automatic detection of such clusters; this lack of information becomes even more evident in federated digital libraries, where the labels assigned by librarians may belong to different controlled vocabularies or different classification systems, and in digital libraries on the web where records may be not assigned neither subject headings nor classification numbers. Hence, we exploit Topic Models, extracted from Wikipedia, to enhance records metadata and use Agglomerative Clustering to disambiguate ambiguous author names by clustering together similar records; records in different clusters are supposed to have been written by different people. We investigate the following two research questions: (a) are the Classification Systems and Subject Heading labels manually assigned by librarians general and informative enough to disambiguate Author Names via clustering techniques? (b) Do Topic Models induce from large corpora the conceptual information necessary for labelling automatically DL metadata and grasp topic similarities of the records? To answer these questions, we will use the Library Catalogue of the Bolzano University Library as case study. 0 0
MikiWiki: A meta wiki architecture and prototype based on the hive-mind space model Li Zhu
Ivan Vaghi
Barricelli B.R.
Boundary Objects
End User Development
Habitable Environment
HMS model
English 2011 This paper presents MikiWiki, a meta-wiki developed to prototype key aspects of the Hive-Mind Space (HMS) model. The HMS model has been proposed to share the visions of End-User Development and meta-design in collaborative online environment development. It aims to support cultures of participation and to tackle the co-evolution of users and systems. The model provides localized habitable environments for diverse stakeholders and tools for them to tailor the system under design, allowing the co-evolution of systems and practices. MikiWiki is aimed at supporting the exploration of opportunities to enable software tailoring at use time. Such an open-ended collaborative design process is realized by providing basic building blocks as boundary object prototypes, allowing end users to remix, modify, and create their own boundary objects. Moreover, MikiWiki minimizes essential services at the server-side, while putting the main functionalities on the client-side, opening the whole system to its users for further tailoring. 0 0
Mining fault-tolerant item sets using subset size occurrence distributions Borgelt C.
Kotter T.
English 2011 Mining fault-tolerant (or approximate or fuzzy) item sets means to allow for errors in the underlying transaction data in the sense that actually present items may not be recorded due to noise or measurement errors. In order to cope with such missing items, transactions that do not contain all items of a given set are still allowed to support it. However, either the number of missing items must be limited, or the transaction's contribution to the item set's support is reduced in proportion to the number of missing items, or both. In this paper we present an algorithm that efficiently computes the subset size occurrence distribution of item sets, evaluates this distribution to find fault-tolerant item sets, and exploits intermediate data to remove pseudo (or spurious) item sets. We demonstrate the usefulness of our algorithm by applying it to a concept detection task on the 2008/2009 Wikipedia Selection for schools. 0 0
Mining for reengineering: An application to semantic wikis using formal and relational concept analysis Shi L.
Toussaint Y.
Napoli A.
Blansche A.
English 2011 Semantic wikis enable collaboration between human agents for creating knowledge systems. In this way, data embedded in semantic wikis can be mined and the resulting knowledge patterns can be reused to extend and improve the structure of wikis. This paper proposes a method for guiding the reengineering and improving the structure of a semantic wiki. This method suggests the creation of categories and relations between categories using Formal Concept Analysis (FCA) and Relational Concept Analysis (RCA). FCA allows the design of a concept lattice while RCA provides relational attributes completing the content of formal concepts. The originality of the approach is to consider the wiki content from FCA and RCA points of view and to extract knowledge units from this content allowing a factorization and a reengineering of the wiki structure. This method is general and does not depend on any domain and can be generalized to every kind of semantic wiki. Examples are studied throughout the paper and experiments show the substantial results. 0 0
Mobile wikipedia: A case study of information service design for Chinese teenagers Jia Zhou
Rau P.L.P.
Christoph Rohmer
Christophe Ghalayini
Felix Roerig
Chinese teenagers
Information service
Mobile phone
User Centered Design
English 2011 This study applied User Centered Design in mobile service design. First, an interview was conducted to analyze needs of teenagers. Chinese teenagers desire more information about daily life and more interaction between users. Second, based on the results of the interview, a low fidelity prototype was develped. To evaluate the design, teenagers participated in the second interview and told its pros and cons. Finally, refinement was made and a high fidelity prototype was ready. This prototype combined both Wikipedia and the query-based interaction. Results of this study have reference value for practitioners to involve target users into development process of information service. 0 0
Motivation and its mechanisms in virtual communities De Melo Bezerra J.
Hirata C.M.
Virtual community
English 2011 Participation is a key aspect of success of virtual communities. Participation is dependent on the members' motivation that is driven by individual and environmental characteristics. This article investigates the individual and environmental factors that contribute to motivation and discusses mechanisms to improve motivation in virtual communities. The study is based on the Hersey and Blanchard's motivation model, the Maslow's hierarchy of needs, and the virtual community model. For the discussion of motivation mechanisms, we reviewed the literature and made qualitative interviews with members of the Wikipedia community. 0 0
Overview of the INEX 2010 XML mining track: Clustering and classification of XML documents De Vries C.M.
Nayak R.
Kutty S.
Shlomo Geva
Tagarelli A.
XML document mining
English 2011 This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2010 XML Mining track. The report also describes the approaches and results obtained by participants. 0 0
Overview of the INEX 2010 ad hoc track Arvola P.
Shlomo Geva
Jaap Kamps
Ralf Schenkel
Andrew Trotman
Vainio J.
English 2011 This paper gives an overview of the INEX 2010 Ad Hoc Track. The main goals of the Ad Hoc Track were three-fold. The first goal was to study focused retrieval under resource restricted conditions such as a small screen mobile device or a document summary on a hit-list. This leads to variants of the focused retrieval tasks that address the impact of result length/reading effort, thinking of focused retrieval as a form of "snippet" retrieval. The second goal was to extend the ad hoc retrieval test collection on the INEX 2009 Wikipedia Collection with additional topics and judgments. For this reason the Ad Hoc track topics and assessments stayed unchanged. The third goal was to examine the trade-off between effectiveness and efficiency by continuing the Efficiency Track as a task in the Ad Hoc Track. The INEX 2010 Ad Hoc Track featured four tasks: the Relevant in Context Task, the Restricted Relevant in Context Task, the Restrict Focused Task, and the Efficiency Task. We discuss the setup of the track, and the results for the four tasks. 0 0
Overview of the INEX 2010 question answering track (QA@INEX) SanJuan E.
Bellot P.
Moriceau V.
Tannier X.
English 2011 The INEX Question Answering track (QA@INEX) aims to evaluate a complex question-answering task using the Wikipedia. The set of questions is composed of factoid, precise questions that expect short answers, as well as more complex questions that can be answered by several sentences or by an aggregation of texts from different documents. Long answers have been evaluated based on Kullback Leibler (KL) divergence between n-gram distributions. This allowed summarization systems to participate. Most of them generated a readable extract of sentences from top ranked documents by a state-of-the-art document retrieval engine. Participants also tested several methods of question disambiguation. Evaluation has been carried out on a pool of real questions from OverBlog and Yahoo! Answers. Results tend to show that the baseline-restricted focused IR system minimizes KL divergence but misses readability meanwhile summarization systems tend to use longer and stand-alone sentences thus improving readability but increasing KL divergence. 0 0
Probabilistic quality assessment based on article's revision history Jangwhan Han
Chao Wang
Jiang D.
English 2011 The collaborative efforts of users in social media services such as Wikipedia have led to an explosion in user-generated content and how to automatically tag the quality of the content is an eminent concern now. Actually each article is usually undergoing a series of revision phases and the articles of different quality classes exhibit specific revision cycle patterns. We propose to Assess Quality based on Revision History (AQRH) for a specific domain as follows. First, we borrow Hidden Markov Model (HMM) to turn each article's revision history into a revision state sequence. Then, for each quality class its revision cycle patterns are extracted and are clustered into quality corpora. Finally, article's quality is thereby gauged by comparing the article's state sequence with the patterns of pre-classified documents in probabilistic sense. We conduct experiments on a set of Wikipedia articles and the results demonstrate that our method can accurately and objectively capture web article's quality. 0 0
Query phrase expansion using Wikipedia in patent class search Al-Shboul B.
Myaeng S.-H.
Patent Information Retrieval
Phrase-based Query Expansion
Pseudo-Relevance Feedback
Query Expansion
Wikipedia Categories
English 2011 Relevance Feedback methods generally suffer from topic drift caused by words ambiguity and synonymous uses of words. As a way to alleviate the inherent problem, we propose a novel query phrase expansion approach utilizing semantic annotations in Wikipedia pages, trying to enrich queries with context disambiguating phrases. Focusing on the patent domain, especially on patent search where patents are classified into a hierarchy of categories, we attempt to understand the roles of phrases and words in query expansion in determining the relevance of documents and examine their contributions to alleviating the query drift problem. Our approach is compared against Relevance Model, a state-of-the-art, to show its superiority in terms of MAP on all levels of the classification hierarchy. 0 0
Query relaxation for entity-relationship search Elbassuoni S.
Maya Ramanath
Gerhard Weikum
English 2011 Entity-relationship-structured data is becoming more important on the Web. For example, large knowledge bases have been automatically constructed by information extraction from Wikipedia and other Web sources. Entities and relationships can be represented by subject-property-object triples in the RDF model, and can then be precisely searched by structured query languages like SPARQL. Because of their Boolean-match semantics, such queries often return too few or even no results. To improve recall, it is thus desirable to support users by automatically relaxing or reformulating queries in such a way that the intention of the original user query is preserved while returning a sufficient number of ranked results. In this paper we describe comprehensive methods to relax SPARQL-like triple-pattern queries in a fully automated manner. Our framework produces a set of relaxations by means of statistical language models for structured RDF data and queries. The query processing algorithms merge the results of different relaxations into a unified result list, with ranking based on any ranking function for structured queries over RDF-data. Our experimental evaluation, with two different datasets about movies and books, shows the effectiveness of the automatically generated relaxations and the improved quality of query results based on assessments collected on the Amazon Mechanical Turk platform. 0 0
Quick detection of top-k personalized PageRank lists Avrachenkov K.
Litvak N.
Nemirovsky D.
Smirnova E.
Sokol M.
English 2011 We study a problem of quick detection of top-k Personalized PageRank (PPR) lists. This problem has a number of important applications such as finding local cuts in large graphs, estimation of similarity distance and person name disambiguation. We argue that two observations are important when finding top-k PPR lists. Firstly, it is crucial that we detect fast the top-k most important neighbors of a node, while the exact order in the top-k list and the exact values of PPR are by far not so crucial. Secondly, by allowing a small number of "wrong" elements in top-k lists, we achieve great computational savings, in fact, without degrading the quality of the results. Based on these ideas, we propose Monte Carlo methods for quick detection of top-k PPR lists. We demonstrate the effectiveness of these methods on the Web and Wikipedia graphs, provide performance evaluation and supply stopping criteria. 0 0
Ranking multilingual documents using minimal language dependent resources Santosh G.S.K.
Kiran Kumar N.
Vasudeva Varma
Feature Engineering
Levenshtein Edit Distance
Multilingual Document Ranking
English 2011 This paper proposes an approach of extracting simple and effective features that enhances multilingual document ranking (MLDR). There is limited prior research on capturing the concept of multilingual document similarity in determining the ranking of documents. However, the literature available has worked heavily with language specific tools, making them hard to reimplement for other languages. Our approach extracts various multilingual and monolingual similarity features using a basic language resource (bilingual dictionary). No language-specific tools are used, hence making this approach extensible for other languages. We used the datasets provided by Forum for Information Retrieval Evaluation (FIRE) for their 2010 Adhoc Cross-Lingual document retrieval task on Indian languages. Experiments have been performed with different ranking algorithms and their results are compared. The results obtained showcase the effectiveness of the features considered in enhancing multilingual document ranking. 0 0
Selective integration of background knowledge in TCBR systems Patelia A.
Chakraborti S.
Wiratunga N.
English 2011 This paper explores how background knowledge from freely available web resources can be utilised for Textual Case Based Reasoning. The work reported here extends the existing Explicit Semantic Analysis approach to representation, where textual content is represented using concepts with correspondence to Wikipedia articles. We present approaches to identify Wikipedia pages that are likely to contribute to the effectiveness of text classification tasks. We also study the effect of modelling semantic similarity between concepts (amounting to Wikipedia articles) empirically. We conclude with the observation that integrating background knowledge from resources like Wikipedia into TCBR tasks holds a lot of promise as it can improve system effectiveness even without elaborate manual knowledge engineering. Significant performance gains are obtained using a very small number of features that have very strong correspondence to how humans describe the domain. 0 0
Self-organizing map representation for clustering Wikipedia search results Szymanski J. English 2011 The article presents an approach to automated organization of textual data. The experiments have been performed on selected sub-set of Wikipedia. The Vector Space Model representation based on terms has been used to build groups of similar articles extracted from Kohonen Self-Organizing Maps with DBSCAN clustering. To warrant efficiency of the data processing, we performed linear dimensionality reduction of raw data using Principal Component Analysis. We introduce hierarchical organization of the categorized articles changing the granularity of SOM network. The categorization method has been used in implementation of the system that clusters results of keyword-based search in Polish Wikipedia. 0 0
Semantic processing of database textual attributes using Wikipedia Campana J.R.
Medina J.M.
Vila M.A.
Similarity metrics
Text processing
Wikipedia category graph
English 2011 Text attributes in databases contain rich semantic information that is seldom processed or used. This paper proposes a method to extract and semantically represent concepts from texts stored in databases. This process relies on tools such as WordNet and Wikipedia to identify concepts extracted from texts and represent them as a basic ontology whose concepts are annotated with search terms. This ontology can play diverse roles. It can be seen as a conceptual summary of the content of an attribute, which can be used as a means to navigate through the textual content of an attribute. It can also be used as a profile for text search using the terms associated to the ontology concepts. The ontology is built as a subset of Wikipedia category graph, selected using diverse metrics. Category selection using these metrics is discussed and an example application is presented and evaluated. 0 0
Sentiment analysis of news titles: The role of entities and a new affective lexicon Loureiro D.
Marreiros G.
Neves J.
English 2011 The growth of content on the web has been followed by increasing interest in opinion mining. This field of research relies on accurate recognition of emotion from textual data. There's been much research in sentiment analysis lately, but it always focuses on the same elements. Sentiment analysis traditionally depends on linguistic corpora, or common sense knowledge bases, to provide extra dimensions of information to the text being analyzed. Previous research hasn't yet explored a fully automatic method to evaluate how events associated to certain entities may impact each individual's sentiment perception. This project presents a method to assign valence ratings to entities, using information from their Wikipedia page, and considering user preferences gathered from the user's Facebook profile. Furthermore, a new affective lexicon is compiled entirely from existing corpora, without any intervention from the coders. 0 0
Supporting resource-based learning on the web using automatically extracted large-scale taxonomies from multiple wikipedia versions Garcia R.D.
Scholl P.
Rensing C.
Hyponymy Detection
Resource-based Learning
TEL Recommender
Data mining
English 2011 CROKODIL is a platform for the support of collaborative resource-based learning with Web resources. It enables the building of learning communities in which learners annotate their relevant resources using tags. In this paper, we propose the use of automatically generated large-scale taxonomies in different languages to cope with two challenges in CROKODIL: The multilingualism of the resources, i.e. web resources are in different languages and the connectivity of the semantic network, i.e. learners do not tag resources on the same topic with identical tags. More specifically, we describe a set of features that can be used for detecting hyponymy relations from the category system of Wikipedia. 0 0
System description: EgoMath2 as a tool for mathematical searching on wikipedia.org Misutka J.
Galambos L.
English 2011 EgoMath is a full text search engine focused on digital mathematical content with little semantic information available. Recently, we have decided that another step towards making mathematics in digital form more accessible was to enable mathematical searching in one of the world's largest digital libraries - Wikipedia. The library is an excellent candidate for our mathematical search engine because the mathematical notation is represented by fragments which do not contain semantic information. 0 0
Technology-mediated social participation: The next 25 years of HCI challenges Shneiderman B. Blogs
Collective action
Collective intelligence
Community design
Discussion groups
Open Government
Reader-to-leader framework
Social media
Social network analysis
User generated content
English 2011 The dramatic success of social media such as Facebook, Twitter, YouTube, blogs, and traditional discussion groups empowers individuals to become active in local and global communities. Some enthusiasts believe that with modest redesign, these technologies can be harnessed to support national priorities such as healthcare/wellness, disaster response, community safety, energy sustainability, etc. However, accomplishing these ambitious goals will require long-term research to develop validated scientific theories and reliable, secure, and scalable technology strategies. The enduring questions of how to motivate participation, increase social trust, and promote collaboration remain grand challenges even as the technology rapidly evolves. This talk invites researchers across multiple disciplines to participate in redefining our discipline of Human-Computer Interaction (HCI) along more social lines to answer vital research questions while creating inspirational prototypes, conducting innovative evaluations, and developing robust technologies. By placing greater emphasis on social media, the HCI community could constructively influence these historic changes. 0 0
Text clustering based on granular computing and Wikipedia Liping Jing
Jian Yu
Granular computing
Text clustering
English 2011 Text clustering plays an important role in many real-world applications, but it is faced with various challenges, such as, curse of dimensionality, complex semantics and large volume. A lot of researches paid attention to deal with such problems by designing new text representation models and clustering algorithms. However, text clustering still remains a research problem due to the complicated properties of text data. In this paper, a text clustering procedure is proposed based on the principle of granular computing with the aid of Wikipedia. The proposed clustering method firstly identifies the text granules, especially focusing on concepts and words with the aid of Wikipedia. And then, it mines the latent patterns based on the computation of such granules. Experimental results on benchmark data sets (20Newsgroups and Reuters-21578) have shown that the proposed method improves the performance of text clustering by comparing with the existing clustering algorithm together with the existing representation models. 0 0
Topic mining based on graph local clustering Garza Villarreal S.E.
Brena R.F.
Graph clustering
Topic mining
English 2011 This paper introduces an approach for discovering thematically related document groups (a topic mining task) in massive document collections with the aid of graph local clustering. This can be achieved by viewing a document collection as a directed graph where vertices represent documents and arcs represent connections among these (e.g. hyperlinks). Because a document is likely to have more connections to documents of the same theme, we have assumed that topics have the structure of a graph cluster, i.e. a group of vertices with more arcs to the inside of the group and fewer arcs to the outside of it. So, topics could be discovered by clustering the document graph; we use a local approach to cope with scalability. We also extract properties (keywords and most representative documents) from clusters to provide a summary of the topic. This approach was tested over the Wikipedia collection and we observed that the resulting clusters in fact correspond to topics, which shows that topic mining can be treated as a graph clustering problem. 0 0
Unsupervised feature weighting based on local feature relatedness Jiali Yun
Liping Jing
Jian Yu
Houkuan Huang
Feature Relatedness
Feature Weighting
Text Clustering
English 2011 Feature weighting plays an important role in text clustering. Traditional feature weighting is determined by the syntactic relationship between feature and document (e.g. TF-IDF). In this paper, a semantically enriched feature weighting approach is proposed by introducing the semantic relationship between feature and document, which is implemented by taking account of the local feature relatedness - the relatedness between feature and its contextual features within each individual document. Feature relatedness is measured by two methods, document collection-based implicit relatedness measure and Wikipedia link-based explicit relatedness measure. Experimental results on benchmark data sets show that the new feature weighting approach surpasses traditional syntactic feature weighting. Moreover, clustering quality can be further improved by linearly combining the syntactic and semantic factors. The new feature weighting approach is also compared with two existing feature relatedness-based approaches which consider the global feature relatedness (feature relatedness in the entire feature space) and the inter-document feature relatedness (feature relatedness between different documents) respectively. In the experiments, the new feature weighting approach outperforms these two related work in clustering quality and costs much less computational complexity. 0 0
Using a lexical dictionary and a folksonomy to automatically construct domain ontologies Macias-Galindo D.
Wong W.
Cavedon L.
Thangarajah J.
English 2011 We present and evaluate MKBUILD, a tool for creating domain-specific ontologies. These ontologies, which we call Modular Knowledge Bases (MKBs), contain concepts and associations imported from existing large-scale knowledge resources, in particular WordNet and Wikipedia. The combination of WordNet's human-crafted taxonomy and Wikipedia's semantic associations between articles produces a highly connected resource. Our MKBs are used by a conversational agent operating in a small computational environment. We constructed several domains with our technique, and then conducted an evaluation by asking human subjects to rate the domain-relevance of the concepts included in each MKB on a 3-point scale. The proposed methodology achieved precision values between 71% and 88% and recall between 37% and 95% in the evaluation, depending on how the middle-score judgements are interpreted. The results are encouraging considering the cross-domain nature of the construction process and the difficulty of representing concepts as opposed to terms. 0 0
… further results