An N-gram-and-wikipedia joint approach to natural language identification
|An N-gram-and-wikipedia joint approach to natural language identification|
|Author(s)||Yang X., Liang W.|
|Published in||2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings|
|Keyword(s)||N-Gram, Natural language identification, TextTiling algorithm, Wikipedia (Extra: Computer scientists, Language processing, Language resources, Linguistic knowledge, Machine translations, Multi-lingual information retrieval, N-Gram, Natural languages, New approaches, TextTiling algorithm, Training sets, Wikipedia, Algorithms, Industrial research, Information retrieval, Information theory, Natural language processing systems, Linguistics)|
|Article||BASE, CiteSeerX, Google Scholar|
|Web||Ask, Bing, Google (PDF), Yahoo!|
|Download and mirrors|
|Local copy||Not available|
|Remote mirror(s)||Not available|
|Export and share|
|BibTeX, CSV, RDF, JSON|
|Browse properties · List of conference papers|
An N-gram-and-wikipedia joint approach to natural language identification is a 2010 conference paper written in English by Yang X., Liang W. and published in 2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings.
Natural Language Identification is the process of detecting and determining in which language or languages a given piece of text is written. As one of the key steps in Computational Linguistics/Natural Language Processing(NLP) tasks, such as Machine Translation, Multi-lingual Information Retrieval and Processing of Language Resources, Natural Language Identification has drawn widespread attention and extensive research, making it one of the few relatively well studied sub-fields in the whole NLP field. However, various problems remain far from resolved in this field. Current noncomputational approaches require researchers possess sufficient prior linguistic knowledge about the languages to be identified, while current computational (statistical) approaches demand large-scale training set for each to-be-identified language. Apparently, drawbacks for both are that, few computer scientists are equipped with sufficient knowledge in Linguistics, and the size of the training set may get endlessly larger in pursuit of higher accuracy and the ability to process more languages. Also, faced with multi-lingual documents on the Internet, neither approach can render satisfactory results. To address these problems, this paper proposes a new approach to Natural Language Identification. It exploits N-Gram frequency statistics to segment a piece of text in a language-specific fashion, and then takes advantage of Wikipedia to determine the language used in each segment. Multiple experiments have demonstrated that satisfactory results can be rendered by this approach, especially with multi-lingual documents.
- This section requires expansion. Please, help!
Probably, this publication is cited by others, but there are no articles available for them in WikiPapers. Cited 2 time(s)