Automated non-content word list generation using hLDA

From WikiPapers
Jump to: navigation, search

Automated non-content word list generation using hLDA is a 2013 conference paper written in English by Krug W., Tomlinson M.T. and published in FLAIRS 2013 - Proceedings of the 26th International Florida Artificial Intelligence Research Society Conference.

[edit] Abstract

In this paper, we present a language-independent method for the automatic, unsupervised extraction of non-content words from a corpus of documents. This method permits the creation of word lists that may be used in place of traditional function word lists in various natural language processing tasks. As an example we generated lists of words from a corpus of English, Chinese, and Russian posts extracted from Wikipedia articles and Wikipedia Wikitalk discussion pages. We applied these lists to the task of authorship attribution on this corpus to compare the effectiveness of lists of words extracted with this method to expert-created function word lists and frequent word lists (a common alternative to function word lists). hLDA lists perform comparably to frequent word lists. The trials also show that corpus-derived lists tend to perform better than more generic lists, and both sets of generated lists significantly outperformed the expert lists. Additionally, we evaluated the performance of an English expert list on machine translations of our Chinese and Russian documents, showing that our method also outperforms this alternative. Copyright © 2013, Association for the Advancement of Artificial Intelligence. All rights reserved.

[edit] References

This section requires expansion. Please, help!

Cited by

Probably, this publication is cited by others, but there are no articles available for them in WikiPapers.