Wikimedia dumps
Wikimedia dumps (Alternative names for this dataset) | |
Keyword(s) | mediawiki, xml dumps |
Size | From a few MB to several GB |
Language(s) | Multilingual |
Author(s) | wikimedians |
License(s) | Unknown [+] |
Website | http://dumps.wikimedia.org/ |
Related material | |
Related dataset(s) | Wikimedia Foundation image dump, WikiTeam dumps, Wikia dumps, Domas visits logs, Picture of the Year archives |
Related tool(s) | Unknown [+] |
Search | |
Google Scholar | |
Export and share | |
BibTeX, CSV, RDF, JSON | |
![]() ![]() ![]() ![]() ![]() ![]() ![]() | |
Browse properties · List of datasets |
Wikimedia dumps are complete copies of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available.
- Main site for XML & SQL dumps: http://dumps.wikimedia.org/
- Archive with some historical dumps: Wikimedia Downloads historical Archives
- Timeline of dumps procedures: http://wikitech.wikimedia.org/view/Dumps/History
- Dumps at Internet Archive: http://www.archive.org/details/wikimediadownloads
Example[edit]
You can see an example of the XML scheme for MediaWiki pages using the Special:Export tool (direct link).
Publications
Title | Author(s) | Keyword(s) | Published in | Language | DateThis property is a special property in this wiki. | Abstract | R | C |
---|---|---|---|---|---|---|---|---|
Analyzing and Predicting Quality Flaws in User-generated Content: The Case of Wikipedia | Maik Anderka | Information quality Wikipedia Quality Flaws Quality Flaw Prediction |
Bauhaus-Universität Weimar, Germany | English | 2013 | Web applications that are based on user-generated content are often criticized for containing low-quality information; a popular example is the online encyclopedia Wikipedia. The major points of criticism pertain to the accuracy, neutrality, and reliability of information. The identification of low-quality information is an important task since for a huge number of people around the world it has become a habit to first visit Wikipedia in case of an information need. Existing research on quality assessment in Wikipedia either investigates only small samples of articles, or else deals with the classification of content into high-quality or low-quality. This thesis goes further, it targets the investigation of quality flaws, thus providing specific indications of the respects in which low-quality content needs improvement. The original contributions of this thesis, which relate to the fields of user-generated content analysis, data mining, and machine learning, can be summarized as follows:
(1) We propose the investigation of quality flaws in Wikipedia based on user-defined cleanup tags. Cleanup tags are commonly used in the Wikipedia community to tag content that has some shortcomings. Our approach is based on the hypothesis that each cleanup tag defines a particular quality flaw. (2) We provide the first comprehensive breakdown of Wikipedia's quality flaw structure. We present a flaw organization schema, and we conduct an extensive exploratory data analysis which reveals (a) the flaws that actually exist, (b) the distribution of flaws in Wikipedia, and, (c) the extent of flawed content. (3) We present the first breakdown of Wikipedia's quality flaw evolution. We consider the entire history of the English Wikipedia from 2001 to 2012, which comprises more than 508 million page revisions, summing up to 7.9 TB. Our analysis reveals (a) how the incidence and the extent of flaws have evolved, and, (b) how the handling and the perception of flaws have changed over time. (4) We are the first who operationalize an algorithmic prediction of quality flaws in Wikipedia. We cast quality flaw prediction as a one-class classification problem, develop a tailored quality flaw model, and employ a dedicated one-class machine learning approach. A comprehensive evaluation based on human-labeled Wikipedia articles underlines the practical applicability of our approach. |
0 | 0 |
Sustainability of Open Collaborative Communities: Analyzing Recruitment Efficiency | Kevin Crowston Nicolas Jullien Felipe Ortega |
DEA modeling Efficiency Recruitment Wikipedia |
Technology Innovation Management Review | January 2013 | 0 | 0 | ||
Extraction of RDF Dataset from Wikipedia Infobox Data | Jimmy K. Chiu Thomas Y. Lee Sau Dan Lee Hailey H. Zhu David W. Cheung |
English | 2010 | This paper outlines the cleansing and extraction process of infobox data from Wikipedia data dump into Resource Description Framework (RDF) triplets. The numbers of the extracted triplets, resources, and predicates are substantially large enough for many research purposes such as semantic web search. Our software tool will be open-sourced for researchers to produce up-to-date RDF datasets from routine Wikipedia data dumps. | 0 | 0 | ||
Wikipedia: A Quantitative Analysis | Felipe Ortega | Universidad Rey Juan Carlos, Spain | English | 2009 | In this doctoral thesis, we undertake a quantitative analysis of the top-ten language editions of Wikipedia, from different perspectives. Our main goal has been to trace the evolution in time of key descriptive and organizational parameters of Wikipedia and its community of authors. The analysis has focused on logged authors (those editors who created a personal account to participate in the project). Among the distinct metrics included, we can ?nd the monthly evolution of general metrics (number of revisions, active editors, active pages); the distribution of pages and its length, the evolution of participation in discussion pages. We also present a detailed analysis of the inner social structure and strati?cation of the Wikipedia community of logged authors, ?tting appropriate distributions to the most relevant metrics. We also examine the inequality level of contributions from logged authors, showing that there exists a core of very active authors who undertake most of the editorial work. Regarding articles, the inequality analysis also shows that there exists a reduced group of popular articles, though the distribution of revisions is not as skewed as in the previous case. The analysis continues with an in-depth demographic study of the community of authors, focusing on the evolution of the core of very active contributors (applying a statistical technique known as survival analysis). We also explore some basic metrics to analyze the quality of Wikipedia articles and the trustworthiness level of individual authors. This work concludes with an extended analysis of the evolution of the most in?uential parameters and metrics previously presented. Based on these metrics, we infer important conclusions about the future sustainability of Wikipedia. According to these results, the Wikipedia community of authors has ceased to grow, remaining stable since Summer 2006 until the end of 2007. As a result, the monthly number of revisions has remained stable over the same period, restricting the number of articles that can be reviewed by the community. On the other side, whilst the number of revisions in talk pages has stabilized over the same period, as well, the number of active talk pages follows a steady growing rate, for all versions. This suggests that the community of authors is shifting its focus to broaden the coverage of discussion pages, which has a direct impact in the ?nal quality of content, as previous research works has shown. Regarding the inner social structure of the Wikipedia community of logged authors, we ?nd Pareto-like distributions that ?t all relevant metrics pertaining authors (number of revisions per author, number of different articles edited per author), while measurements on articles (number of revisions per article, number of different authors per article) follow lognormal shapes. The analysis of the inequality level of revisions performed by authors, and revisions received by arti- cles shows highly unequal distributions. The results of our survival analysis on Wikipedia authors presents very high mortality percentages on young authors, revealing an endemic problem of Wikipedias to keep young editors on collaborating with the project for a long period of time. In the same way, from our survival analysis we obtain that the mean lifetime of Wikipedia authors in the core (until they abandon the group of top editors) is situated between 200 and 400 days, for all versions, while the median value is lower than 120 days in all cases. Moreover the analysis of the monthly number of births and deaths in the community of logged authors reveals that the cause of the shift in the monthly trend of active authors is produced by a higher number of deaths from Summer 2006 in all versions, surpassing the monthly number of births from then on. The analysis of the inequality level of contributions over time, and the evolution of additional key features identi?ed in this thesis, reveals a worrying trend towards progressive increase of the effort spent by core authors, as time elapses. This trend may eventually cause that these authors will reach their upper limit in the number of revisions they can perform each month, thus starting a decreasing trend in the number of monthly revisions, and an overall recession of the content creation and reviewing process in Wikipedia. To prevent this probable future scenario, the number of monthly new editors should be improved again, perhaps through the adoption of speci?c policies and campaigns for attracting new editors to Wikipedia, and recover older top- contributors again. Finally, another important contribution for the research community is {WikiXRay,} the soft- ware tool we have developed to perform the statistical analyses included in this thesis. This tool completely automates the process of retrieving the database dumps from the Wikimedia public repositories, process them to obtain key metrics and descriptive parameters, and load them in a local database, ready to be used in empirical analyses. As far as we know, this is the ?rst research work implementing a comparative analysis, from an quantitative point of view, of the top-ten language editions of Wikipedia, presenting results from many different scienti?c perspectives. Therefore, we expect that this contribution will help the scienti?c community to enhance their understanding of the rich, complex and fascinating work- ing mechanisms and behavioral patterns of the Wikipedia project and its community of authors. Likewise, we hope that {WikiXRay} will facilitate the hard task of developing empirical analyses on any language version of the encyclopedia, boosting in this way the number of comparative studies like this one in many other scienti?c disciplines. | 0 | 8 | |
Measuring Wikipedia | Jakob Voss | Wikipedia Wiki Statistics Informetrics Webometrics Cybermetrics |
International Conference of the International Society for Scientometrics and Informetrics | English | 2005 | Wikipedia, an international project that uses Wiki software to collaboratively create an encyclopaedia, is becoming more and more popular. Everyone can directly edit articles and every edit is recorded. The version history of all articles is freely available and allows a multitude of examinations. This paper gives an overview on Wikipedia research. Wikipedia's fundamental components, i.e. articles, authors, edits, and links, as well as content and quality are analysed. Possibilities of research are explored including examples and first results. Several characteristics that are found in Wikipedia, such as exponential growth and scale-free networks are already known in other context. However the Wiki architecture also possesses some intrinsic specialties. General trends are measured that are typical for all Wikipedias but vary between languages in detail. | 12 | 16 |