Automated Building of Error Corpora of Polish
|Automated Building of Error Corpora of Polish|
|Published in||Corpus Linguistics, Computer Tools, and Applications – State of the Art. PALC 2007, Peter Lang. Internationaler Verlag der Wissenschaften 2008, 631-639|
|Keyword(s)||error corpora, normativity, revision history, corpora building|
|Article||BASE, CiteSeerX, Google Scholar|
|Web||Ask, Bing, Google (PDF), Yahoo!|
|Download and mirrors|
|Local copy||Not available|
|Export and share|
|BibTeX, CSV, RDF, JSON|
|Browse properties · List of conference papers|
Automated Building of Error Corpora of Polish is a 2008 conference paper by Marcin Milkowski and published in Corpus Linguistics, Computer Tools, and Applications – State of the Art. PALC 2007, Peter Lang. Internationaler Verlag der Wissenschaften 2008, 631-639.
The paper shows how to automatically develop error corpora out of revision history of documents. The idea is based on a hypothesis that minor edits in documents represent correction of typos, slips of the tongue, grammar, usage and style mistakes. This hypothesis has been confirmed by frequency analysis of revision history of articles in the Polish Wikipedia. Resources such as revision history in Wikipedia, Wikia, and other collaborative editing systems, can be turned into corpora of errors, just by extracting the minor edits. The most theoretically interesting aspect is that the corrections will represent the average speaker's intuitions about usage, and this seems to be a promising way of researching normativity in claims about proper or improper Polish. By processing the revision history, one can gain pairs of segments in the corpus: first representing the error, and the other representing the correction. Moreover, it is relatively easy to tag parts of speech, compare subsequent versions, and prepare a text file containing the resulting corpus.
- This section requires expansion. Please, help!
Probably, this publication is cited by others, but there are no articles available for them in WikiPapers.