Corpus

From WikiPapers
Jump to: navigation, search
See also: List of datasets.

Corpus is included as keyword or extra keyword in 0 datasets, 0 tools and 6 publications.

Datasets

There is no datasets for this keyword.

Tools

There is no tools for this keyword.


Publications

Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
A Wikipedia-based corpus reference tool Jason Ginsburg HCCE English 2012 This paper describes a dictionary-like reference tool that is designed to help users find information that is similar to what one would find in a dictionary when looking up a word, except that this information is extracted automatically from large corpora. For a particular vocabulary item, a user can view frequency information, part-of-speech distribution, word-forms, definitions, example paragraphs and collocations. All of this information is extracted automatically from corpora and most of this information is extracted from Wikipedia. Since Wikipedia is a massive corpus covering a diverse range of general topics, this information is probably very representative of how target words are used in general. This project has applications for English language teachers and learners, as well as for language researchers. 0 0
An english-translated parallel corpus for the CJK wikipedia collections Tang L.-X.
Shlomo Geva
Andrew Trotman
Proceedings of the 17th Australasian Document Computing Symposium, ADCS 2012 English 2012 In this paper, we describe a machine-translated parallel English corpus for the NTCIR Chinese, Japanese and Korean (CJK) Wikipedia collections. This document collection is named CJK2E Wikipedia XML corpus. The corpus could be used by the information retrieval research community and knowledge sharing in Wikipedia in many ways; for example, this corpus could be used for experimentations in cross-lingual information retrieval, cross-lingual link discovery, or omni-lingual information retrieval research. Furthermore, the translated CJK articles could be used to further expand the current coverage of the English Wikipedia. Copyright 0 0
A novel approach to sentence alignment from comparable corpora Li M.-H.
Vitaly Klyuev
Wu S.-H.
Proceedings of the 6th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS'2011 English 2011 This paper introduces a new technique to select candidate sentences for alignment from bilingual comparable corpora. Tests were done utilizing Wikipedia as a source for bilingual data. Our test languages are English and Chinese. A high quality of sentence alignment is illustrated by a machine translation application. 0 0
Query segmentation revisited Hagen M.
Martin Potthast
Benno Stein
Brautigam C.
Proceedings of the 20th International Conference on World Wide Web, WWW 2011 English 2011 We address the problem of query segmentation: given a keyword query, the task is to group the keywords into phrases, if possible. Previous approaches to the problem achieve reasonable segmentation performance but are tested only against a small corpus of manually segmented queries. In addition, many of the previous approaches are fairly intricate as they use expensive features and are difficult to be reimplemented. The main contribution of this paper is a new method for query segmentation that is easy to implement, fast, and that comes with a segmentation accuracy comparable to current state-of-the-art techniques. Our method uses only raw web n-gram frequencies and Wikipedia titles that are stored in a hash table. At the same time, we introduce a new evaluation corpus for query segmentation. With about 50 000 human-annotated queries, it is two orders of magnitude larger than the corpus being used up to now. Copyright © 2011 by the Association for Computing Machinery, Inc. (ACM). 0 0
Aufbau eines linguistischen Korpus aus den Daten der englischen Wikipedia Markus Fuchs Proceedings of the Conference on Natural Language Processing 2010 (KONVENS 10) German 2010 0 0
Crowdsourcing a Wikipedia Vandalism Corpus Martin Potthast SIGIR English 2010 We report on the construction of the PAN Wikipedia vandalism corpus, PAN-WVC-10, using Amazon’s Mechanical Turk. The corpus compiles 32 452 edits on 28 468 Wikipedia articles, among which 2 391 vandalism edits have been identified. 753 human annotators cast a total of 193 022 votes on the edits, so that each edit was reviewed by at least 3 annotators, whereas the achieved level of agreement was analyzed in order to label an edit as “regular” or “vandalism.” The corpus is available free of charge. 6 1