A lexicon for processing archaic language: the case of XIXth century Slovene

From WikiPapers
Jump to: navigation, search

A lexicon for processing archaic language: the case of XIXth century Slovene is a 2011 conference paper written in English by Tomaž Erjavec, Christoph Ringlstetter, Maja Žorga, Annette Gotscharek and published in WoLeR 2011: International Workshop on Lexical Resources.

[edit] Abstract

The paper presents a lexicon to support computational processing of historical Slovene texts. Historical Slovene texts are being increasingly digitised and made available on the internet but are still underutilised as no language technology support is offered for their processing. Appropriate tools and resources would enable full-text searching with modern-day lemmas, modernisation of archaic language to make it more accessible to today‟s readers, and automatic OCR correction. We discuss the lexicon needed to support tokenisation, modernisation, lemmatisation and part-of-speech tagging of historical texts. The process of lexicon acquisition relies on a proof-read corpus, a large lexicon of contemporary Slovene, and tools to map historical forms to their contemporary equivalents via a set of rewrite rules, and to provide an editing environment for lexicon construction. The lexicon, currently work in progress, will be made publicly available; it should help not only in making digital libraries more accessible but also provide a quantitative basis for linguistic explorations of historical Slovene texts and a prototype electronic dictionary of archaic Slovene.

[edit] References

This publication has 1 references. Only those references related to wikis are included here:

Cited by

Probably, this publication is cited by others, but there are no articles available for them in WikiPapers.


No comments yet. Be first!