Browse wiki

Jump to: navigation, search
Named entity normalization in user generated content
Abstract Named entity recognition is important for Named entity recognition is important for semantically oriented retrieval tasks, such as question answering, entity retrieval, biomedical retrieval, trend detection, and event and entity tracking. In many of these tasks it is important to be able to accurately normalize the recognized entities, i.e., to map surface forms to unambiguous references to real world entities. Within the context of structured databases, this task (known as record linkage and data, de-duplication) has been a topic of active research for more than five decades. For edited content, such as news articles, the named entity normalization (NEN) task is one that has recently attracted considerable attention. We consider the task in the challenging context of user generated content (UGC), where it forms a key ingredient of tracking and media-analysis systems. A baseline NEN system from the literature (that normalizes surface forms to Wikipedia pages) performs considerably worse on UGC than on edited news: accuracy drops from 80% to 65% for a Dutch language data set and from 94% to 77% for English. We identify several sources of errors: entity recognition errors, multiple ways of referring to the same entity and ambiguous references. To address these issues we propose five improvements to the baseline NEN algorithm, to arrive at a language independent NEN system that achieves overall accuracy scores of 90% on the English data set and 89% on the Dutch data set. We show that each of the improvements contributes to the overall score of our improved NEN algorithm, and conclude with an error analysis on both Dutch and English language UGC. The NEN system is computationally efficient and runs with very modest computational requirements. Copyright 2008 ACM.tational requirements. Copyright 2008 ACM.
Abstractsub Named entity recognition is important for Named entity recognition is important for semantically oriented retrieval tasks, such as question answering, entity retrieval, biomedical retrieval, trend detection, and event and entity tracking. In many of these tasks it is important to be able to accurately normalize the recognized entities, i.e., to map surface forms to unambiguous references to real world entities. Within the context of structured databases, this task (known as record linkage and data, de-duplication) has been a topic of active research for more than five decades. For edited content, such as news articles, the named entity normalization (NEN) task is one that has recently attracted considerable attention. We consider the task in the challenging context of user generated content (UGC), where it forms a key ingredient of tracking and media-analysis systems. A baseline NEN system from the literature (that normalizes surface forms to Wikipedia pages) performs considerably worse on UGC than on edited news: accuracy drops from 80% to 65% for a Dutch language data set and from 94% to 77% for English. We identify several sources of errors: entity recognition errors, multiple ways of referring to the same entity and ambiguous references. To address these issues we propose five improvements to the baseline NEN algorithm, to arrive at a language independent NEN system that achieves overall accuracy scores of 90% on the English data set and 89% on the Dutch data set. We show that each of the improvements contributes to the overall score of our improved NEN algorithm, and conclude with an error analysis on both Dutch and English language UGC. The NEN system is computationally efficient and runs with very modest computational requirements. Copyright 2008 ACM.tational requirements. Copyright 2008 ACM.
Bibtextype inproceedings  +
Doi 10.1145/1390749.1390755  +
Has author Jijkoun V. + , Khalid M.A. + , Marx M. + , Maarten de Rijke +
Has extra keyword Analysis systems + , Base-lines + , Computational requirements + , Computationally efficient + , Dataset + , English languages + , Entity recognitions + , Entity retrievals + , Named entities + , Named Entity recognitions + , News articles + , Question answering + , Real worlds + , Record linkages + , Structured databases + , Trend detections + , User generated content + , Wikipedia + , Error analysis + , Error detection + , Errors + , Linguistics + , Technical presentations + , Natural language processing systems +
Has keyword Named entities + , User generated content + , Wikipedia +
Isbn 9781605581965  +
Language English +
Number of citations by publication 0  +
Number of references by publication 0  +
Pages 23–30  +
Published in Proceedings of SIGIR 2008 Workshop on Analytics for Noisy Unstructured Text Data, AND'08 +
Title Named entity normalization in user generated content +
Type conference paper  +
Year 2008 +
Creation dateThis property is a special property in this wiki. 8 November 2014 02:38:50  +
Categories Publications without license parameter  + , Publications without remote mirror parameter  + , Publications without archive mirror parameter  + , Publications without paywall mirror parameter  + , Conference papers  + , Publications without references parameter  + , Publications  +
Modification dateThis property is a special property in this wiki. 8 November 2014 02:38:50  +
DateThis property is a special property in this wiki. 2008  +
hide properties that link here 
Named entity normalization in user generated content + Title
 

 

Enter the name of the page to start browsing from.