List of datasets
From WikiPapers
- See also: List of tools.
List of all the datasets available in WikiPapers. Currently, there are about 45 datasets.
Filter by type:
To create a new "dataset" go to Form:Dataset.
Datasets
| Dataset | Size | Language | Description |
|---|---|---|---|
| Citizendium dumps | < 100 MB | English | |
| Citizendium mailing list archives | English | ||
| CoCoBi | 5 MB | German | CoCoBi is a Corpus of Comparable Biographies in German and contains 400 annotated biographies of 141 famous people. Automatic annotation was done the same way and with the same tools as in WikiBiography. Biographies come from different sources, mainly, from Wikipedia and the Brockhaus Lexikon. |
| Coordinates in Wikipedia articles | From a few KB to 100 MB | Multilingual | Coordinates in Wikipedia articles is a compilation of all the coordinates added to Wikipedia, language-by-language. |
| DBpedia | Catalan German Greek Spanish French Galician Hungarian Italian Dutch Polish Portuguese Russian Slovenian Turkish |
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the web to Wikipedia data. | |
| Deletionpedia | English | Deletionpedia is an archive of about 62,471 pages which have been deleted from English Wikipedia. | |
| Domas visits logs | ~40GB/month | Multilingual | Domas visits logs are page view statistics for Wikimedia projects. |
| EPIC/Oxford Wikipedia quality assessment | English | EPIC/Oxford Wikipedia quality assessment This dataset comprises the full, anonymized set of responses from the blind assessment of a sample of Wikipedia articles across languages and disciplines by academic experts. The study was conducted in 2012 by EPIC and the University of Oxford and sponsored by the Wikimedia Foundation. | |
| Google dataset linking strings and concepts | ~10 GB | Multilingual | |
| Nupedia mailing list archives | 750 KB | English | Nupedia mailing list archives is a compilation of messages sent to the mailing list of this Wikipedia predecesor. |
| OmegaWiki dumps | |||
| PAN Wikipedia quality flaw corpus 2012 | 324 MB | English | PAN Wikipedia quality flaw corpus 2012 is an evaluation corpus for the "Quality Flaw Prediction in Wikipedia" task of the PAN 2012 Lab, held in conjunction with the CLEF 2012 conference. |
| PAN Wikipedia vandalism corpus 2010 | 447 MB | English | PAN Wikipedia vandalism corpus 2010 (PAN-WVC-10) is a corpus for the evaluation of automatic vandalism detectors for Wikipedia. |
| PAN Wikipedia vandalism corpus 2011 | 370.8 MB | English German Spanish |
PAN Wikipedia vandalism corpus 2011 (PAN-WVC-11) is a corpus for the evaluation of automatic vandalism detectors for Wikipedia. |
| Picture of the Year archives | |||
| PlusPedia | 33 MB | German | PlusPedia is a German "deletionpedia". |
| Repos-2012-dataset | 5 MB | Spanish Catalan |
repos-2012-dataset contains metadata about links to digital repositories from Spanish and Catalan Wikipedias. |
| SWEETpedia | English | SWEETpedia is a periodic update of semantic web-related research using Wikipedia. | |
| Social networks of Wikipedia dataset | 100 MB | Multilingual | Social networks of Wikipedia dataset is a talk pages analysis. |
| Tamil Wikipedia word list | English Tamil |
||
| UBY | |||
| Webis Wikipedia vandalism corpus | 10 KB | English | Webis Wikipedia vandalism corpus (Webis-WVC-07) is a corpus for the evaluation of automatic vandalism detection algorithms for Wikipedia. |
| WikiBiography | 11 MB | German | WikiBiography is a corpus of about 1200 annotated biographies from German Wikipedia. |
| WikiCorpus | 1 GB | Catalan Spanish English |
WikiCorpus are datasets of Wikipedia enriched with linguistic info. |
| WikiIndex | Around 100MB | English | WikiIndex is a wiki about wikis. Thousands of them. |
| WikiLit | A few MB | English | WikiLit is a comprehensive literature review of scholarly research on Wikipedia (it does not include literature nor studies about other wikis). |
| WikiNet | ~100 MB | English Dutch French German Italian |
WikiNet is a multi-language ontology by exploiting several aspects of Wikipedia. |
| WikiPapers | A few MB | English | WikiPapers is compilation of resources (conference papers, journal articles, theses, books, datasets and tools) focused on the research of wikis. |
| WikiRelations | 676 MB | English | WikiRelations are binary relations obtained from processing Wikipedia category names and the category and page network. |
| WikiSym mailing lists archives | English | ||
| WikiTaxonomy | 2 MB | English | WikiTaxonomy is a taxonomy from Wikipedia. |
| WikiTeam dumps | From a few MB to several GB | Multilingual | |
| Wikia dumps | From a few MB to several GB | Multilingual | |
| Wikimedia Foundation image dump | 75.5 GB | Wikimedia Foundation image dump is a 77 gigabyte November 2005 tape archive file (.tar) of all the images in use on English Wikipedia. This archive contains roughly 296,000 images. Since this time, the size of the collection of images appears to have prevented Wikimedia from making the full image collection available. | |
| Wikimedia dumps | From a few MB to several GB | Multilingual | Wikimedia dumps are complete copies of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available. |
| Wikimedia mailing lists archives | Several MB | Multilingual | Wikimedia mailing lists archives are a list of all the public mailing lists on http://lists.wikimedia.org |
| Wikipedia Historical Attributes Data | 5.5 GB | English | Wikipedia Historical Attributes Data contains all attribute-value pairs of infoboxes out of English Wikipedia articles since 2003. It holds more than 500 million attribute changes. |
| Wikipedia Vandalism Corpus (Andrew G. West) | 25.5 MB | English | Wikipedia Vandalism Corpus (Andrew G. West) is a corpus of 5.7 million automatically tagged and 5,000 manually-confirmed incidents of vandalism in English Wikipedia. |
| Wikipedia article ratings | 500 MB | English | Wikipedia article ratings is an anonymized dump of article ratings (aka AFTv4) collected over 1 year since the deployment of the tool on the entire English Wikipedia (July 22, 2011 - July 22, 2012). |
| Wikipedia page-to-page link database | < 400 MB | English | |
| Wikipedia search data | Multilingual | Wikipedia search data are logs about search queries by visitors. | |
| Wikipedia user preferences | Multilingual | Wikipedia user preferences includes data on user preferences set by active Wikipedia editors. | |
| WikipediaXML | From a few MB to several GB | English German French Dutch Spanish Chinese Arabic Japanese |
wikipediaXML can be used in a large variety of XML IR tasks like ad-hoc retrieval, categorization, clustering or Structure Mapping task. |
| Wikipediadoc | 187 MB | English | Wikipediadoc contains 67,537 Wikipedia articles converted to Microsoft Word 2002 .doc format (Office XP). |
| Wlm-2011-dataset | 5 MB | Multilingual | wlm-2011-dataset is a dataset for Wiki Loves Monuments 2011. It contains metadata of all the images uploaded to Wikimedia Commons from September 1 to September 30, 2011, during the Wiki Loves Monuments 2011 photo contest. |
