Browse wiki

Jump to: navigation, search
Predicting quality flaws in user-generated content: The case of wikipedia
Abstract The detection and improvement of low-qualiThe detection and improvement of low-quality information is a key concern in Web applications that are based on user-generated content; a popular example is the online encyclopedia Wikipedia. Existing research on quality assessment of user-generated content deals with the classification as to whether the content is high-quality or low-quality. This paper goes one step further: it targets the prediction of quality flaws, this way providing specific indications in which respects low-quality content needs improvement. The prediction is based on user-defined cleanup tags, which are commonly used in many Web applications to tag content that has some shortcomings. We apply this approach to the English Wikipedia, which is the largest and most popular user-generated knowledge source on the Web. We present an automatic mining approach to identify the existing cleanup tags, which provides us with a training corpus of labeled Wikipedia articles. We argue that common binary or multiclass classification approaches are ineffective for the prediction of quality flaws and hence cast quality flaw prediction as a one-class classification problem. We develop a quality flaw model and employ a dedicated machine learning approach to predict Wikipedia's most important quality flaws. Since in the Wikipedia setting the acquisition of significant test data is intricate, we analyze the effects of a biased sample selection. In this regard we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. The flaw prediction performance is evaluated with 10,000 Wikipedia articles that have been tagged with the ten most frequent quality flaws: provided test data with little noise, four flaws can be detected with a precision close to 1.n be detected with a precision close to 1.
Abstractsub The detection and improvement of low-qualiThe detection and improvement of low-quality information is a key concern in Web applications that are based on user-generated content; a popular example is the online encyclopedia Wikipedia. Existing research on quality assessment of user-generated content deals with the classification as to whether the content is high-quality or low-quality. This paper goes one step further: it targets the prediction of quality flaws, this way providing specific indications in which respects low-quality content needs improvement. The prediction is based on user-defined cleanup tags, which are commonly used in many Web applications to tag content that has some shortcomings. We apply this approach to the English Wikipedia, which is the largest and most popular user-generated knowledge source on the Web. We present an automatic mining approach to identify the existing cleanup tags, which provides us with a training corpus of labeled Wikipedia articles. We argue that common binary or multiclass classification approaches are ineffective for the prediction of quality flaws and hence cast quality flaw prediction as a one-class classification problem. We develop a quality flaw model and employ a dedicated machine learning approach to predict Wikipedia's most important quality flaws. Since in the Wikipedia setting the acquisition of significant test data is intricate, we analyze the effects of a biased sample selection. In this regard we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. The flaw prediction performance is evaluated with 10,000 Wikipedia articles that have been tagged with the ten most frequent quality flaws: provided test data with little noise, four flaws can be detected with a precision close to 1.n be detected with a precision close to 1.
Bibtextype inproceedings  +
Doi 10.1145/2348283.2348413  +
Has author Maik Anderka + , Benno Stein + , Nedim Lipka +
Has extra keyword Class imbalance + , Flaw models + , High quality + , Information quality + , Knowledge sources + , Learning approach + , Low qualities + , Multi-class classification + , One-class Classification + , Online encyclopedia + , Prediction performance + , Quality assessment + , Sample selection + , Test data + , Training corpus + , User generated content + , WEB application + , Wikipedia + , Data mining + , Forecasting + , Information retrieval + , Research + , Websites +
Has keyword Information quality + , One-class classification + , Quality flaw prediction + , User-generated content analysis + , Wikipedia +
Isbn 9781450316583  +
Language English +
Number of citations by publication 0  +
Number of references by publication 0  +
Pages 981–990  +
Published in SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval +
Title Predicting quality flaws in user-generated content: The case of wikipedia +
Type conference paper  +
Year 2012 +
Creation dateThis property is a special property in this wiki. 8 November 2014 04:29:33  +
Categories Duplicate publication  + , Publications without license parameter  + , Publications without remote mirror parameter  + , Publications without archive mirror parameter  + , Publications without paywall mirror parameter  + , Conference papers  + , Publications without references parameter  + , Publications  +
Modification dateThis property is a special property in this wiki. 8 November 2014 04:29:33  +
DateThis property is a special property in this wiki. 2012  +
hide properties that link here 
Predicting quality flaws in user-generated content: The case of wikipedia + Title
 

 

Enter the name of the page to start browsing from.