Browse wiki

Jump to: navigation, search
An N-gram-and-wikipedia joint approach to natural language identification
Abstract Natural Language Identification is the proNatural Language Identification is the process of detecting and determining in which language or languages a given piece of text is written. As one of the key steps in Computational Linguistics/Natural Language Processing(NLP) tasks, such as Machine Translation, Multi-lingual Information Retrieval and Processing of Language Resources, Natural Language Identification has drawn widespread attention and extensive research, making it one of the few relatively well studied sub-fields in the whole NLP field. However, various problems remain far from resolved in this field. Current noncomputational approaches require researchers possess sufficient prior linguistic knowledge about the languages to be identified, while current computational (statistical) approaches demand large-scale training set for each to-be-identified language. Apparently, drawbacks for both are that, few computer scientists are equipped with sufficient knowledge in Linguistics, and the size of the training set may get endlessly larger in pursuit of higher accuracy and the ability to process more languages. Also, faced with multi-lingual documents on the Internet, neither approach can render satisfactory results. To address these problems, this paper proposes a new approach to Natural Language Identification. It exploits N-Gram frequency statistics to segment a piece of text in a language-specific fashion, and then takes advantage of Wikipedia to determine the language used in each segment. Multiple experiments have demonstrated that satisfactory results can be rendered by this approach, especially with multi-lingual documents., especially with multi-lingual documents.
Abstractsub Natural Language Identification is the proNatural Language Identification is the process of detecting and determining in which language or languages a given piece of text is written. As one of the key steps in Computational Linguistics/Natural Language Processing(NLP) tasks, such as Machine Translation, Multi-lingual Information Retrieval and Processing of Language Resources, Natural Language Identification has drawn widespread attention and extensive research, making it one of the few relatively well studied sub-fields in the whole NLP field. However, various problems remain far from resolved in this field. Current noncomputational approaches require researchers possess sufficient prior linguistic knowledge about the languages to be identified, while current computational (statistical) approaches demand large-scale training set for each to-be-identified language. Apparently, drawbacks for both are that, few computer scientists are equipped with sufficient knowledge in Linguistics, and the size of the training set may get endlessly larger in pursuit of higher accuracy and the ability to process more languages. Also, faced with multi-lingual documents on the Internet, neither approach can render satisfactory results. To address these problems, this paper proposes a new approach to Natural Language Identification. It exploits N-Gram frequency statistics to segment a piece of text in a language-specific fashion, and then takes advantage of Wikipedia to determine the language used in each segment. Multiple experiments have demonstrated that satisfactory results can be rendered by this approach, especially with multi-lingual documents., especially with multi-lingual documents.
Bibtextype inproceedings  +
Doi 10.1109/IUCS.2010.5666010  +
Has author Yang X. + , Liang W. +
Has extra keyword Computer scientists + , Language processing + , Language resources + , Linguistic knowledge + , Machine translations + , Multi-lingual information retrieval + , N-Gram + , Natural languages + , New approaches + , TextTiling algorithm + , Training sets + , Wikipedia + , Algorithms + , Industrial research + , Information retrieval + , Information theory + , Natural language processing systems + , Linguistics +
Has keyword N-Gram + , Natural language identification + , TextTiling algorithm + , Wikipedia +
Isbn 9781424478200  +
Language English +
Number of citations by publication 0  +
Number of references by publication 0  +
Pages 332–339  +
Published in 2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings +
Title An N-gram-and-wikipedia joint approach to natural language identification +
Type conference paper  +
Year 2010 +
Creation dateThis property is a special property in this wiki. 6 November 2014 16:42:06  +
Categories Publications without license parameter  + , Publications without remote mirror parameter  + , Publications without archive mirror parameter  + , Publications without paywall mirror parameter  + , Conference papers  + , Publications without references parameter  + , Publications  +
Modification dateThis property is a special property in this wiki. 6 November 2014 16:42:06  +
DateThis property is a special property in this wiki. 2010  +
hide properties that link here 
An N-gram-and-wikipedia joint approach to natural language identification + Title
 

 

Enter the name of the page to start browsing from.