List of datasets

From WikiPapers
Jump to: navigation, search
See also: List of tools.

This is a list of datasets available in WikiPapers. Currently, there are 45 datasets.

Filter by type:

To create a new "dataset" go to Form:Dataset.


Dataset Size Language Description
Citizendium dumps < 100 MB English
Citizendium mailing list archives English
CoCoBi 5 MB German CoCoBi is a Corpus of Comparable Biographies in German and contains 400 annotated biographies of 141 famous people. Automatic annotation was done the same way and with the same tools as in WikiBiography. Biographies come from different sources, mainly, from Wikipedia and the Brockhaus Lexikon.
Coordinates in Wikipedia articles From a few KB to 100 MB Coordinates in Wikipedia articles is a compilation of all the coordinates added to Wikipedia, language-by-language.
DBpedia Catalan
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the web to Wikipedia data.
Deletionpedia English Deletionpedia is an archive of about 62,471 pages which have been deleted from English Wikipedia.
Domas visits logs ~40GB/month Domas visits logs are page view statistics for Wikimedia projects.
EPIC/Oxford Wikipedia quality assessment English EPIC/Oxford Wikipedia quality assessment This dataset comprises the full, anonymized set of responses from the blind assessment of a sample of Wikipedia articles across languages and disciplines by academic experts. The study was conducted in 2012 by EPIC and the University of Oxford and sponsored by the Wikimedia Foundation.
Google dataset linking strings and concepts ~10 GB
Nupedia mailing list archives 750 KB English Nupedia mailing list archives is a compilation of messages sent to the mailing list of this Wikipedia predecesor.
OmegaWiki dumps
PAN Wikipedia quality flaw corpus 2012 324 MB English PAN Wikipedia quality flaw corpus 2012 is an evaluation corpus for the "Quality Flaw Prediction in Wikipedia" task of the PAN 2012 Lab, held in conjunction with the CLEF 2012 conference.
PAN Wikipedia vandalism corpus 2010 447 MB English PAN Wikipedia vandalism corpus 2010 (PAN-WVC-10) is a corpus for the evaluation of automatic vandalism detectors for Wikipedia.
PAN Wikipedia vandalism corpus 2011 370.8 MB English
PAN Wikipedia vandalism corpus 2011 (PAN-WVC-11) is a corpus for the evaluation of automatic vandalism detectors for Wikipedia.
Picture of the Year archives
PlusPedia 33 MB German PlusPedia is a German "deletionpedia".
Repos-2012-dataset 5 MB Spanish
repos-2012-dataset contains metadata about links to digital repositories from Spanish and Catalan Wikipedias.
SWEETpedia English SWEETpedia is a periodic update of semantic web-related research using Wikipedia.
Social networks of Wikipedia dataset 100 MB Social networks of Wikipedia dataset is a talk pages analysis.
Tamil Wikipedia word list English
Webis Wikipedia vandalism corpus 10 KB English Webis Wikipedia vandalism corpus (Webis-WVC-07) is a corpus for the evaluation of automatic vandalism detection algorithms for Wikipedia.
WikiBiography 11 MB German WikiBiography is a corpus of about 1200 annotated biographies from German Wikipedia.
WikiCorpus 1 GB Catalan
WikiCorpus are datasets of Wikipedia enriched with linguistic info.
WikiIndex Around 100MB English WikiIndex is a wiki about wikis. Thousands of them.
WikiLit A few MB English WikiLit is a comprehensive literature review of scholarly research on Wikipedia (it does not include literature nor studies about other wikis).
WikiNet ~100 MB English
WikiNet is a multi-language ontology by exploiting several aspects of Wikipedia.
WikiPapers A few MB English WikiPapers is compilation of resources (conference papers, journal articles, theses, books, datasets and tools) focused on the research of wikis.
WikiRelations 676 MB English WikiRelations are binary relations obtained from processing Wikipedia category names and the category and page network.
WikiSym mailing lists archives English
WikiTaxonomy 2 MB English WikiTaxonomy is a taxonomy from Wikipedia.
WikiTeam dumps From a few MB to several GB
Wikia dumps From a few MB to several GB
Wikimedia Foundation image dump 75.5 GB Wikimedia Foundation image dump is a 77 gigabyte November 2005 tape archive file (.tar) of all the images in use on English Wikipedia. This archive contains roughly 296,000 images. Since this time, the size of the collection of images appears to have prevented Wikimedia from making the full image collection available.
Wikimedia dumps From a few MB to several GB Wikimedia dumps are complete copies of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available.
Wikimedia mailing lists archives Several MB Wikimedia mailing lists archives are a list of all the public mailing lists on
Wikipedia Historical Attributes Data 5.5 GB English Wikipedia Historical Attributes Data contains all attribute-value pairs of infoboxes out of English Wikipedia articles since 2003. It holds more than 500 million attribute changes.
Wikipedia Vandalism Corpus (Andrew G. West) 25.5 MB English Wikipedia Vandalism Corpus (Andrew G. West) is a corpus of 5.7 million automatically tagged and 5,000 manually-confirmed incidents of vandalism in English Wikipedia.
Wikipedia article ratings 500 MB English Wikipedia article ratings is an anonymized dump of article ratings (aka AFTv4) collected over 1 year since the deployment of the tool on the entire English Wikipedia (July 22, 2011 - July 22, 2012).
Wikipedia page-to-page link database < 400 MB English
Wikipedia search data Wikipedia search data are logs about search queries by visitors.
Wikipedia user preferences Wikipedia user preferences includes data on user preferences set by active Wikipedia editors.
WikipediaXML From a few MB to several GB English
wikipediaXML can be used in a large variety of XML IR tasks like ad-hoc retrieval, categorization, clustering or Structure Mapping task.
Wikipediadoc 187 MB English Wikipediadoc contains 67,537 Wikipedia articles converted to Microsoft Word 2002 .doc format (Office XP).
Wlm-2011-dataset 5 MB wlm-2011-dataset is a dataset for Wiki Loves Monuments 2011. It contains metadata of all the images uploaded to Wikimedia Commons from September 1 to September 30, 2011, during the Wiki Loves Monuments 2011 photo contest.

See also[edit]

External links[edit]