Automated Building of Error Corpora of Polish

From WikiPapers
Jump to: navigation, search

Automated Building of Error Corpora of Polish is a 2008 conference paper by Marcin Milkowski and published in Corpus Linguistics, Computer Tools, and Applications – State of the Art. PALC 2007, Peter Lang. Internationaler Verlag der Wissenschaften 2008, 631-639.

[edit] Abstract

The paper shows how to automatically develop error corpora out of revision history of documents. The idea is based on a hypothesis that minor edits in documents represent correction of typos, slips of the tongue, grammar, usage and style mistakes. This hypothesis has been confirmed by frequency analysis of revision history of articles in the Polish Wikipedia. Resources such as revision history in Wikipedia, Wikia, and other collaborative editing systems, can be turned into corpora of errors, just by extracting the minor edits. The most theoretically interesting aspect is that the corrections will represent the average speaker's intuitions about usage, and this seems to be a promising way of researching normativity in claims about proper or improper Polish. By processing the revision history, one can gain pairs of segments in the corpus: first representing the error, and the other representing the correction. Moreover, it is relatively easy to tag parts of speech, compare subsequent versions, and prepare a text file containing the resulting corpus.

[edit] References

This section requires expansion. Please, help!

Cited by

Probably, this publication is cited by others, but there are no articles available for them in WikiPapers.