From WikiPapers
(Redirected from Statistical)
Jump to: navigation, search

statistics is included as keyword or extra keyword in 1 datasets, 6 tools and 33 publications.


Dataset Size Language Description
Domas visits logs ~40GB/month Domas visits logs are page view statistics for Wikimedia projects.


Tool Operating System(s) Language(s) Programming language(s) License Description Image
HistoryFlow Windows English HistoryFlow is a tool for visualizing dynamic, evolving documents and the interactions of multiple collaborating authors. In its current implementation, history flow is being used to visualize the evolutionary history of wiki pages on Wikipedia. English Wikipedia Treaty of Trianon History Flow.png
StatMediaWiki GNU/Linux English Python GPLv3 StatMediaWiki is a project that aims to create a tool to collect and aggregate information available in a MediaWiki installation. Results are static HTML pages including tables and graphics that can help to analyze the wiki status and development, or a CSV file for custom processing. General hour activity-wikihaskell.png
Wikichron Cross-platform English Python Affero GPL (code) WikiChron is a web tool for the analysis and visualization of the evolution of wiki online communities. It uses processed data of the history dumps of mediawiki wikis, computes different metrics on this data and plot it in interactive graphs. It allows to compare different wikis in the same graphs.

This tool will serve investigators in the task of inspecting the behavior of collaborative online communities, in particular wikis, and generate research hypotheses for further and deeper studies. WikiChron has been thought to be very easy to use and highly interactive from the very first beginning. It comes with a bunch of already downloaded and processed wikis from Wikia (but any MediaWiki wiki is supported), and with more than thirty metrics to visualize and compare between wikis.

Moreover, it can be useful in the case of wiki administrators who want to see, analyze and compare how the activity on their wikis is going.

WikiChron is available online here:
WikiEvidens Cross-platform English Python GPLv3 WikiEvidens is a visualization and statistical tool for wikis. Wikievidens0.0.6.png
WikiXRay Python WikiXRay is a robust and extensible software tool for an in-depth quantitative analysis of the whole Wikipedia project. English
PHP For two subjects (wikidata items), compares the pageviews of the articles for them in every linguistic versions of wikipedia existing for the article. Upcoming development to allow comparison of more than two subjects.


Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
A framework for automated construction of resource space based on background knowledge Yu X.
Peng L.
Huang Z.
Zhuge H.
Future Generation Computer Systems English 2014 Resource Space Model is a kind of data model which can effectively and flexibly manage the digital resources in cyber-physical system from multidimensional and hierarchical perspectives. This paper focuses on constructing resource space automatically. We propose a framework that organizes a set of digital resources according to different semantic dimensions combining human background knowledge in WordNet and Wikipedia. The construction process includes four steps: extracting candidate keywords, building semantic graphs, detecting semantic communities and generating resource space. An unsupervised statistical language topic model (i.e., Latent Dirichlet Allocation) is applied to extract candidate keywords of the facets. To better interpret meanings of the facets found by LDA, we map the keywords to Wikipedia concepts, calculate word relatedness using WordNet's noun synsets and construct corresponding semantic graphs. Moreover, semantic communities are identified by GN algorithm. After extracting candidate axes based on Wikipedia concept hierarchy, the final axes of resource space are sorted and picked out through three different ranking strategies. The experimental results demonstrate that the proposed framework can organize resources automatically and effectively.©2013 Published by Elsevier Ltd. All rights reserved. 0 0
Collaborative development for setup, execution, sharing and analytics of complex NMR experiments Irvine A.G.
Slynko V.
Nikolaev Y.
Senthamarai R.R.P.
Pervushin K.
Journal of Magnetic Resonance English 2014 Factory settings of NMR pulse sequences are rarely ideal for every scenario in which they are utilised. The optimisation of NMR experiments has for many years been performed locally, with implementations often specific to an individual spectrometer. Furthermore, these optimised experiments are normally retained solely for the use of an individual laboratory, spectrometer or even single user. Here we introduce a web-based service that provides a database for the deposition, annotation and optimisation of NMR experiments. The application uses a Wiki environment to enable the collaborative development of pulse sequences. It also provides a flexible mechanism to automatically generate NMR experiments from deposited sequences. Multidimensional NMR experiments of proteins and other macromolecules consume significant resources, in terms of both spectrometer time and effort required to analyse the results. Systematic analysis of simulated experiments can enable optimal allocation of NMR resources for structural analysis of proteins. Our web-based application ( provides all the necessary information, includes the auxiliaries (waveforms, decoupling sequences etc.), for analysis of experiments by accurate numerical simulation of multidimensional NMR experiments. The online database of the NMR experiments, together with a systematic evaluation of their sensitivity, provides a framework for selection of the most efficient pulse sequences. The development of such a framework provides a basis for the collaborative optimisation of pulse sequences by the NMR community, with the benefits of this collective effort being available to the whole community. © 2013 Elsevier Inc. All rights reserved. 0 0
Early Prediction of Movie Box Office Success Based on Wikipedia Activity Big Data Márton Mestyán
Taha Yasseri
János Kertész
PLoS ONE English 2013 Use of socially generated "big data" to access information about collective states of the minds in human societies has become a new paradigm in the emerging field of computational social science. A natural application of this would be the prediction of the society's reaction to a new product in the sense of popularity and adoption rate. However, bridging the gap between "real time monitoring" and "early predicting" remains a big challenge. Here we report on an endeavor to build a minimalistic predictive model for the financial success of movies based on collective activity data of online users. We show that the popularity of a movie can be predicted much before its release by measuring and analyzing the activity level of editors and viewers of the corresponding entry to the movie in Wikipedia, the well-known online encyclopedia. 0 0
Labeling blog posts with wikipedia entries through LDA-based topic modeling of wikipedia Makita K.
Suzuki H.
Koike D.
Takehito Utsuro
Kawada Y.
Tomohiro Fukuhara
Journal of Internet Technology English 2013 Given a search query, most existing search engines simply return a ranked list of search results. However, it is often the case that those search result documents consist of a mixture of documents that are closely related to various contents. In order to address the issue of quickly overviewing the distribution of contents, this paper proposes a framework of labeling blog posts with Wikipedia entries through LDA (latent Dirichlet allocation) based topic modeling of Wikipedia. One of the most important advantages of this LDA-based document model is that the collected Wikipedia entries and their LDA parameters heavily depend on the distribution of keywords across all the search result of blog posts. This tendency actually contributes to quickly overviewing the search result of blog posts through the LDA-based topic distribution. We show that the LDA-based document retrieval scheme outperforms our previous approach. Finally, we compare the proposed approach to the standard LDA-based topic modeling without Wikipedia knowledge source. Both LDA-based topic modeling results have quite different nature and contribute to quickly overviewing the search result of blog posts in a quite complementary fashion. 0 0
Making sense of open data statistics with information from Wikipedia Hienert D.
Wegener D.
Schomisch S.
Lecture Notes in Computer Science English 2013 Today, more and more open data statistics are published by governments, statistical offices and organizations like the United Nations, The World Bank or Eurostat. This data is freely available and can be consumed by end users in interactive visualizations. However, additional information is needed to enable laymen to interpret these statistics in order to make sense of the raw data. In this paper, we present an approach to combine open data statistics with historical events. In a user interface we have integrated interactive visualizations of open data statistics with a timeline of thematically appropriate historical events from Wikipedia. This can help users to explore statistical data in several views and to get related events for certain trends in the timeline. Events include links to Wikipedia articles, where details can be found and the search process can be continued. We have conducted a user study to evaluate if users can use the interface intuitively, if relations between trends in statistics and historical events can be found and if users like this approach for their exploration process. 0 0
On detecting Association-Based Clique Outliers in heterogeneous information networks Gupta M.
Gao J.
Yan X.
Cam H.
Jangwhan Han
Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2013 English 2013 In the real world, various systems can be modeled using heterogeneous networks which consist of entities of different types. People like to discover groups (or cliques) of entities linked to each other with rare and surprising associations from such networks. We define such anomalous cliques as Association-Based Clique Outliers (ABCOutliers) for heterogeneous information networks, and design effective approaches to detect them. The need to find such outlier cliques from networks can be formulated as a conjunctive select query consisting of a set of (type, predicate) pairs. Answering such conjunctive queries efficiently involves two main challenges: (1) computing all matching cliques which satisfy the query and (2) ranking such results based on the rarity and the interestingness of the associations among entities in the cliques. In this paper, we address these two challenges as follows. First, we introduce a new low-cost graph index to assist clique matching. Second, we define the outlierness of an association between two entities based on their attribute values and provide a methodology to efficiently compute such outliers given a conjunctive select query. Experimental results on several synthetic datasets and the Wikipedia dataset containing thousands of entities show the effectiveness of the proposed approach in computing interesting ABCOutliers. Copyright 2013 ACM. 0 0
Probabilistic explicit topic modeling using Wikipedia Hansen J.A.
Ringger E.K.
Seppi K.D.
Lecture Notes in Computer Science English 2013 Despite popular use of Latent Dirichlet Allocation (LDA) for automatic discovery of latent topics in document corpora, such topics lack connections with relevant knowledge sources such as Wikipedia, and they can be difficult to interpret due to the lack of meaningful topic labels. Furthermore, the topic analysis suffers from a lack of identifiability between topics across independently analyzed corpora but also across distinct runs of the algorithm on the same corpus. This paper introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). Both of these methods estimate topic-word distributions a priori from Wikipedia articles, with each article corresponding to one topic and the article title serving as a topic label. LDA-STWD and EDA overcome the nonidentifiability, isolation, and unintepretability of LDA output. We assess their effectiveness by means of crowd-sourced user studies on two tasks: topic label generation and document label generation. We find that LDA-STWD improves substantially upon the performance of the state-of-the-art on the document labeling task, and that both methods otherwise perform on par with a state-of-the-art post hoc method. 0 0
Recommending tags with a model of human categorization Seitlinger P.
Kowald D.
Christoph Trattner
Tobias Ley
International Conference on Information and Knowledge Management, Proceedings English 2013 When interacting with social tagging systems, humans exercise complex processes of categorization that have been the topic of much research in cognitive science. In this paper we present a recommender approach for social tags derived from ALCOVE, a model of human category learning. The basic architecture is a simple three-layers connectionist model. The input layer encodes patterns of semantic features of a user-specific resource, such as latent topics elicited through Latent Dirichlet Allocation (LDA) or available external categories. The hidden layer categorizes the resource by matching the encoded pattern against already learned exemplar patterns. The latter are composed of unique feature patterns and associated tag distributions. Finally, the output layer samples tags from the associated tag distributions to verbalize the preceding categorization process. We have evaluated this approach on a real-world folksonomy gathered from Wikipedia bookmarks in Delicious. In the experiment our approach outperformed LDA, a well-established algorithm. We attribute this to the fact that our approach processes semantic information (either latent topics or external categories) across the three different layers. With this paper, we demonstrate that a theoretically guided design of algorithms not only holds potential for improving existing recommendation mechanisms, but it also allows us to derive more generalizable insights about how human information interaction on the Web is determined by both semantic and verbal processes. Copyright 2013 ACM. 0 0
Use and acceptance of Wiki systems for students of veterinary medicine. Kolski D.
Arlt S.
Birk S.
Heuwieser W.
GMS Zeitschrift für medizinische Ausbildung English 2013 Objective: Wiki systems are gaining importance concerning the use in education, especially among young users. The aim of our study was to examine, how students of veterinary medicine commonly use wiki systems, whether they consider a veterinary wiki system useful and if they would participate in writing content. Methodology: For data collection a questionnaire was provided to students (n=210) of the faculty of Veterinary Medicine at the Freie Universität Berlin, Germany. It contained questions regarding the use of Wikipedia in general and concerning educational issues. Results: Most respondents, especially students in the first years, had comprehensive experience in the use of Wikipedia and veterinary wiki systems. In contrast, the experience in writing or editing of information was low (8.6% Wikipedia, 15.3% veterinary wiki systems). Male students had significantly more writing experience than females (p=0,008). In addition, students of the higher years were more experienced in writing and editing than students of the first year (7.4% in the 4(th) year). The familiarity with wiki systems offered by universities was low. The majority of students (96.2%) are willing to use veterinary wiki systems as an information tool in the future. Nevertheless, only a low percentage is willing to write or edit content. Many students, however, expect a better learning success when writing own texts. In general, students consider the quality of information in a wiki system as correct. Conclusion: In conclusion, wiki systems are considered a useful tool to gain information. This will lead to a successful implementation of wiki systems in veterinary education. A main challenge will be to develop concepts to activate students to participate not only in reading but in the writing and editing process. 0 0
Cross-modal topic correlations for multimedia retrieval Jian Yu
Cong Y.
Qin Z.
Wan T.
Proceedings - International Conference on Pattern Recognition English 2012 In this paper, we propose a novel approach for cross-modal multimedia retrieval by jointly modeling the text and image components of multimedia documents. In this model, the image component is represented by local SIFT descriptors based on the bag-of-feature model. The text component is represented by a topic distribution learned from latent topic models such as latent Dirichlet allocation (LDA). The latent semantic relations between texts and images can be reflected by correlations between the word topics and topics of image features. A statistical correlation model conditioned on category information is investigated. Experimental results on a benchmark Wikipedia dataset show that the newly proposed approach outperforms state-of-the-art cross-modal multimedia retrieval systems. 0 0
LDA-based topic modeling in labeling blog posts with wikipedia entries Daisuke Yokomoto
Makita K.
Suzuki H.
Koike D.
Takehito Utsuro
Kawada Y.
Tomohiro Fukuhara
Lecture Notes in Computer Science English 2012 Given a search query, most existing search engines simply return a ranked list of search results. However, it is often the case that those search result documents consist of a mixture of documents that are closely related to various contents. In order to address the issue of quickly overviewing the distribution of contents, this paper proposes a framework of labeling blog posts with Wikipedia entries through LDA (latent Dirichlet allocation) based topic modeling. More specifically, this paper applies an LDA-based document model to the task of labelling blog posts with Wikipedia entries. One of the most important advantages of this LDA-based document model is that the collected Wikipedia entries and their LDA parameters heavily depend on the distribution of keywords across all the search result of blog posts. This tendency actually contributes to quickly overviewing the search result of blog posts through the LDA-based topic distribution. In the evaluation of the paper, we also show that the LDA-based document retrieval scheme outperforms our previous approach. 0 0
Modeling topic hierarchies with the recursive Chinese restaurant process Kim J.H.
Kim D.
Soo-Hwan Kim
Oh A.
ACM International Conference Proceeding Series English 2012 Topic models such as latent Dirichlet allocation (LDA) and hierarchical Dirichlet processes (HDP) are simple solutions to discover topics from a set of unannotated documents. While they are simple and popular, a major shortcoming of LDA and HDP is that they do not organize the topics into a hierarchical structure which is naturally found in many datasets. We introduce the recursive Chinese restaurant process (rCRP) and a nonparametric topic model with rCRP as a prior for discovering a hierarchical topic structure with unbounded depth and width. Unlike previous models for discovering topic hierarchies, rCRP allows the documents to be generated from a mixture over the entire set of topics in the hierarchy. We apply rCRP to a corpus of New York Times articles, a dataset of MovieLens ratings, and a set of Wikipedia articles and show the discovered topic hierarchies. We compare the predictive power of rCRP with LDA, HDP, and nested Chinese restaurant process (nCRP) using heldout likelihood to show that rCRP outperforms the others. We suggest two metrics that quantify the characteristics of a topic hierarchy to compare the discovered topic hierarchies of rCRP and nCRP. The results show that rCRP discovers a hierarchy in which the topics become more specialized toward the leaves, and topics in the immediate family exhibit more affinity than topics beyond the immediate family. 0 0
Publishing statistical data on the web Salas P.E.R.
Marcel Martin
Mota F.M.D.
Sören Auer
Breitman K.
Casanova M.A.
Proceedings - IEEE 6th International Conference on Semantic Computing, ICSC 2012 English 2012 Statistical data is one of the most important sources of information, relevant for large numbers of stakeholders in the governmental, scientific and business domains alike. In this article, we overview how statistical data can be managed on the Web. With OLAP2 Data Cube and CSV2 Data Cube we present two complementary approaches on how to extract and publish statistical data. We also discuss the linking, repair and the visualization of statistical data. As a comprehensive use case, we report on the extraction and publishing on the Web of statistical data describing 10 years of life in Brazil. 0 0
Survey on statics of Wikipedia Deyi Li
Haisu Zhang
Se Wang
Wu J.
Wuhan Daxue Xuebao (Xinxi Kexue Ban)/Geomatics and Information Science of Wuhan University Chinese 2012 This paper mainly focuses on the Wikipedia, a collaborative editing pattern in Web 2. 0. The articles, editors and the editing relationships between the two ones are three important components in Wikipedia statistical analysis. We collected different kinds of statistical tools, methods and results, and further analyzed the problems in the current statistics researches and discussed the possible resolutions. 0 0
TopicExplorer: Exploring document collections with topic models Hinneburg A.
Preiss R.
Schroder R.
Lecture Notes in Computer Science English 2012 The demo presents a prototype - called TopicExplorer - that combines topic modeling, key word search and visualization techniques to explore a large collection of Wikipedia documents. Topics derived by Latent Dirichlet Allocation are presented by top words. In addition, topics are accompanied by image thumbnails extracted from related Wikipedia documents to aid sense making of derived topics during browsing. Topics are shown in a linear order such that similar topics are close. Topics are mapped to color using that order. The auto-completion of search terms suggests words together with their color coded topics, which allows to explore the relation between search terms and topics. Retrieved documents are shown with color coded topics as well. Relevant documents and topics found during browsing can be put onto a shortlist. The tool can recommend further documents with respect to the average topic mixture of the shortlist. 0 0
Wikipedia-based efficient sampling approach for topic model Zhao T.
Chenliang Li
Li M.
Proceedings of the 9th International Network Conference, INC 2012 English 2012 In this paper, we propose a novel approach called Wikipedia-based Collapsed Gibbs sampling (Wikipedia-based CGS) to improve the efficiency of the collapsed Gibbs sampling(CGS), which has been widely used in latent Dirichlet Allocation (LDA) model. Conventional CGS method views each word in the documents as an equal status for the topic modeling. Moreover, sampling all the words in the documents always leads to high computational complexity. Considering this crucial drawback of LDA we propose the Wikipedia-based CGS approach that commits to extracting more meaningful topics and improving the efficiency of the sampling process in LDA by distinguishing different statuses of words in the documents for sampling topics with Wikipedia as the background knowledge. The experiments on real world datasets show that our Wikipedia-based approach for collapsed Gibbs sampling can significantly improve the efficiency and have a better perplexity compared to existing approaches. 0 0
Cross-language information retrieval with latent topic models trained on a comparable corpus Vulic I.
De Smet W.
Moens M.-F.
Lecture Notes in Computer Science English 2011 In this paper we study cross-language information retrieval using a bilingual topic model trained on comparable corpora such as Wikipedia articles. The bilingual Latent Dirichlet Allocation model (BiLDA) creates an interlingual representation, which can be used as a translation resource in many different multilingual settings as comparable corpora are available for many language pairs. The probabilistic interlingual representation is incorporated in a statistical language model for information retrieval. Experiments performed on the English and Dutch test datasets of the CLEF 2001-2003 CLIR campaigns show the competitive performance of our approach compared to cross-language retrieval methods that rely on pre-existing translation dictionaries that are hand-built or constructed based on parallel corpora. 0 0
From names to entities using thematic context distance Pilz A.
Paass G.
International Conference on Information and Knowledge Management, Proceedings English 2011 Name ambiguity arises from the polysemy of names and causes uncertainty about the true identity of entities referenced in unstructured text. This is a major problem in areas like information retrieval or knowledge management, for example when searching for a specific entity or updating an existing knowledge base. We approach this problem of named entity disambiguation (NED) using thematic information derived from Latent Dirichlet Allocation (LDA) to compare the entity mention's context with candidate entities in Wikipedia represented by their respective articles. We evaluate various distances over topic distributions in a supervised classification setting to find the best suited candidate entity, which is either covered in Wikipedia or unknown. We compare our approach to a state of the art method and show that it achieves significantly better results in predictive performance, regarding both entities covered in Wikipedia as well as uncovered entities. We show that our approach is in general language independent as we obtain equally good results for named entity disambiguation using the English, the German and the French Wikipedia. 0 0
Identifying word translations from comparable corpora using latent topic models Vulic I.
De Smet W.
Moens M.-F.
ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies English 2011 A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from word-topic distributions with similarity measures in the original space, are also reported. 0 0
Information-seeking behaviors of first-semester veterinary students: A preliminary report Weiner S.A.
Stephens G.
Nour A.Y.M.
Journal of Veterinary Medical Education English 2011 Although emphasis in veterinary education is increasingly being placed on the ability to find, use, and communicate information, studies on the information behaviors of veterinary students or professionals are few. Improved knowledge in this area will provide valuable information for course and curriculum planning and the design of information resources. This article describes a survey of the information-seeking behaviors of first-semester veterinary students at Purdue University. A survey was administered as the first phase of a progressive semester-long assignment for a first semester DVM course in systemic mammalian physiology. The survey probed for understanding of the scientific literature and its use for course assignments and continuing learning. The survey results showed that students beginning the program tended to use Google for coursework, although some also used the resources found through the Purdue libraries' Web sites. On entering veterinary school, they became aware of specific information resources in veterinary medicine. They used a small number of accepted criteria to evaluate the Web site quality. This study confirms the findings of studies of information-seeking behaviors of undergraduate students. Further studies are needed to examine whether those behaviors change as students learn about specialized veterinary resources that are designed to address clinical needs as they progress through their training. 0 0
Learning-Oriented Assessment of Wiki Contributions: How to Assess Wiki Contributions in a Higher Education Learning Setting Emilio J. Rodríguez-Posada
Juan Manuel Dodero-Beardo
Manuel Palomo-Duarte
Inmaculada Medina-Bulo
International Conference on Computer Supported Education English 2011 Computer-Supported Collaborative Learning based on wikis offers new ways of collaboration and encourages participation. When the number of contributions from students increases, traditional assessment procedures of e-learning settings suffer from scalability problems. In a wiki-based learning experience, some automatic tools are required to support the assessment of such great amounts of data. We have studied readily available analysis tools for the MediaWiki platform, that have complementary input, work modes and output. We comment our experience in two Higher Education courses, one using HistoryFlow and another using StatMediaWiki, and discuss the advantages and drawbacks of each system. 0 0
More influence means less work: Fast latent Dirichlet allocation by influence scheduling Wahabzada M.
Kersting K.
Pilz A.
Bauckhage C.
International Conference on Information and Knowledge Management, Proceedings English 2011 There have recently been considerable advances in fast inference for (online) latent Dirichlet allocation (LDA). While it is widely recognized that the scheduling of documents in stochastic optimization and in turn in LDA may have significant consequences, this issue remains largely unexplored. Instead, practitioners schedule documents essentially uniformly at random, due perhaps to ease of implementation, and to the lack of clear guidelines on scheduling the documents. In this work, we address this issue and propose to schedule documents for an update that exert a disproportionately large influence on the topics of the corpus before less influential ones. More precisely, we justify to sample documents randomly biased towards those ones with higher norms to form mini-batches. On several real-world datasets, including 3M articles from Wikipedia and 8M from PubMed, we demonstrate that the resulting influence scheduled LDA can handily analyze massive document collections and find topic models as good or better than those found with online LDA, often at a fraction of time. 0 0
Multi-view LDA for semantics-based document representation Jiali Yun
Liping Jing
Houkuan Huang
Jian Yu
Journal of Computational Information Systems English 2011 Each document and word can be modeled as a mixture of topics by Latent Dirichlet Allocation (LDA), which does not contain any external semantic information. In this paper, we represent documents as two feature spaces consisting of words and Wikipedia categories respectively, and propose a new method called Multi-View LDA (M-LDA) by combining LDA with explicit human-defined concepts in Wikipedia. M-LDA improves document topic model by taking advantage of both two feature spaces and their mapping relationship. Experimental results on classification and clustering tasks show M-LDA outperforms traditional LDA. 0 0
Semantic relatedness measurement based on Wikipedia link co-occurrence analysis Masahiro Ito
Kotaro Nakayama
Takahiro Hara
Shojiro Nishio
International Journal of Web Information Systems English 2011 Purpose: Recently, the importance and effectiveness of Wikipedia Mining has been shown in several researches. One popular research area on Wikipedia Mining focuses on semantic relatedness measurement, and research in this area has shown that Wikipedia can be used for semantic relatedness measurement. However, previous methods are facing two problems; accuracy and scalability. To solve these problems, the purpose of this paper is to propose an efficient semantic relatedness measurement method that leverages global statistical information of Wikipedia. Furthermore, a new test collection is constructed based on Wikipedia concepts for evaluating semantic relatedness measurement methods. Design/methodology/approach: The authors' approach leverages global statistical information of the whole Wikipedia to compute semantic relatedness among concepts (disambiguated terms) by analyzing co-occurrences of link pairs in all Wikipedia articles. In Wikipedia, an article represents a concept and a link to another article represents a semantic relation between these two concepts. Thus, the co-occurrence of a link pair indicates the relatedness of a concept pair. Furthermore, the authors propose an integration method with tfidf as an improved method to additionally leverage local information in an article. Besides, for constructing a new test collection, the authors select a large number of concepts from Wikipedia. The relatedness of these concepts is judged by human test subjects. Findings: An experiment was conducted for evaluating calculation cost and accuracy of each method. The experimental results show that the calculation cost ofthis approachisvery low compared toone of the previous methods and more accurate than all previous methods for computing semantic relatedness. Originality/value: This is the first proposal of co-occurrence analysis of Wikipedia links for semantic relatedness measurement. The authors show that this approach is effective to measure semantic relatedness among concepts regarding calculation cost and accuracy. The findings may be useful to researchers who are interested in knowledge extraction, as well as ontology researches. 0 0
Wikipedia as domain knowledge networks: Domain extraction and statistical measurement Fang Z.
Wang J.
Ben Liu
Gong W.
KDIR 2011 - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval English 2011 This paper investigates knowledge networks of specific domains extracted from Wikipedia and performs statistical measurements to selected domains. In particular, we first present an efficient method to extract a specific domain knowledge network from Wikipedia. We then extract four domain networks on, respectively, mathematics, physics, biology, and chemistry. We compare the mathematics domain network extracted from Wikipedia with MathWorld, the web's most extensive mathematical resource created and maintained by professional mathematicians, and show that they are statistically similar to each other. This indicates that Math- World and Wikipedia's mathematics domain knowledge share a similar internal structure. Such information may be useful for investigating knowledge networks. 0 0
Analysis of structural relationships for hierarchical cluster labeling Muhr M.
Roman Kern
Michael Granitzer
SIGIR 2010 Proceedings - 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval English 2010 Cluster label quality is crucial for browsing topic hierarchies obtained via document clustering. Intuitively, the hierarchical structure should influence the labeling accuracy. However, most labeling algorithms ignore such structural properties and therefore, the impact of hierarchical structures on the labeling accuracy is yet unclear. In our work we integrate hierarchical information, i.e. sibling and parent-child relations, in the cluster labeling process. We adapt standard labeling approaches, namely Maximum Term Frequency, Jensen-Shannon Divergence, χ 2 Test, and Information Gain, to take use of those relationships and evaluate their impact on 4 different datasets, namely the Open Directory Project, Wikipedia, TREC Ohsumed and the CLEF IP European Patent dataset. We show, that hierarchical relationships can be exploited to increase labeling accuracy especially on high-level nodes. 0 0
C-Link: Concept linkage in knowledge repositories Cowling P.
Remde S.
Hartley P.
Stewart W.
Stock-Brooks J.
Woolley T.
AAAI Spring Symposium - Technical Report English 2010 When searching a knowledge repository such as Wikipedia or the Internet, the user doesn't always know what they are looking for. Indeed, it is often the case that a user wishes to find information about a concept that was completely unknown to them prior to the search. In this paper we describe C-Link, which provides the user with a method for searching for unknown concepts which lie between two known concepts. C-Link does this by modeling the knowledge repository as a weighted, directed graph where nodes are concepts and arc weights give the degree of "relatedness" between concepts. An experimental study was undertaken with 59 participants to investigate the performance of C-Link compared to standard search approaches. Statistical analysis of the results shows great potential for C-Link as a search tool. 0 0
Collaborative educational geoanalytics applied to large statistics temporal data Jern M. CSEDU 2010 - 2nd International Conference on Computer Supported Education, Proceedings English 2010 Recent advances in Web 2.0 graphics technologies have the potential to make a dramatic impact on developing collaborative geovisual analytics that analyse, visualize, communicate and present official statistics. In this paper, we introduce novel "storytelling" means for the experts to first explore large, temporal and multidimensional statistical data, then collaborate with colleagues and finally embed dynamic visualization into Web documents e.g. HTML, Blogs or MediaWiki to communicate essential gained insight and knowledge. The aim is to let the analyst (author) explore data and simultaneously save important discoveries and thus enable sharing of gained insights over the Internet. Through the story mechanism facilitating descriptive metatext, textual annotations hyperlinked through the snapshot mechanism and integrated with interactive visualization, the author can let the reader follow the analyst's way of logical reasoning. This emerging technology could in many ways change the terms and structures for learning. 0 0
Interactive statistics learning with RWikiStat Subianto M.
Sofyan H.
ICNIT 2010 - 2010 International Conference on Networking and Information Technology English 2010 RWikiStat is a web based statistics learning. It is built by using Rweb (a web based interface for R statistical software) and Wiki Technology (MediaWiki). Rweb can be seen as a bridge between content and user in learning process, while MediaWiki is to guarantee the sustainability of the program. The output of this research are a website which can be accessed through intra net or internet, a start up CDIDVD for interactive statistics learning, and statistics modules for learning process. 0 0
Online learning for Latent Dirichlet Allocation Hoffman M.D.
Blei D.M.
Bach F.
Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010, NIPS 2010 English 2010 We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time. 0 0
Extremal dependencies and rank correlations in power law networks Yana Volkovich
Litvak N.
Zwart B.
Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering English 2009 We analyze dependencies in complex networks characterized by power laws (Web sample, Wikipedia sample and a preferential attachment graph) using statistical techniques from the extreme value theory and the theory of multivariate regular variation. To the best of our knowledge, this is the first attempt to apply this well developed methodology to comprehensive graph data. The new insights this yields are striking: the three above-mentioned data sets are shown to have a totally different dependence structure between graph parameters, such as in-degree and PageRank. Based on the proposed approach, we suggest a new measure for rank correlations. Unlike most known methods, this measure is especially sensitive to rank permutations for top-ranked nodes. Using the new correlation measure, we demonstrate that the PageRank ranking is not sensitive to moderate changes in the damping factor. 0 0
Semantic Wikipedia - Checking the Premises Rainer Hammwohner The Social Semantic Web 2007 - Proceedings of the 1st Conference on Social Semantic Web, 2007. 2007 Enhancing Wikipedia by means of semantic representations seems to be a promising issue. From a formal or technical point of view there are no major obstacles in the way. Nevertheless, a close look at Wikipedia, its structure and contents reveals that some questions have to be answered in advance. This paper will deal with these questions and present some first results based on empirical findings. 0 0
Measuring Wikipedia Jakob Voss International Conference of the International Society for Scientometrics and Informetrics English 2005 Wikipedia, an international project that uses Wiki software to collaboratively create an encyclopaedia, is becoming more and more popular. Everyone can directly edit articles and every edit is recorded. The version history of all articles is freely available and allows a multitude of examinations. This paper gives an overview on Wikipedia research. Wikipedia's fundamental components, i.e. articles, authors, edits, and links, as well as content and quality are analysed. Possibilities of research are explored including examples and first results. Several characteristics that are found in Wikipedia, such as exponential growth and scale-free networks are already known in other context. However the Wiki architecture also possesses some intrinsic specialties. General trends are measured that are typical for all Wikipedias but vary between languages in detail. 12 16