Benno Stein
From WikiPapers
| Benno Stein (Alternative names for this author) | |
| Affiliation | Unknown [+] |
| Country | Unknown [+] |
| Co-authors | Maik Anderka, Martin Potthast, Matthias Busse, Nedim Lipka, Robert Gerling, Teresa Holfeld |
| Website | Unknown [+] |
| Statistics | |
| Authorship | Publications (10), datasets (4), tools (0) |
| Citations | Total (4), average (0.4), median (0), max (4), min (0) |
| Keywords | |
| Search | |
| DBLP · Google Scholar | |
| Export and share | |
| BibTeX, CSV, RDF, JSON | |
| | |
| Browse properties · List of authors | |
Benno Stein is an author.
Datasets
| Dataset | Description |
|---|---|
| PAN Wikipedia quality flaw corpus 2012 | PAN Wikipedia quality flaw corpus 2012 is an evaluation corpus for the "Quality Flaw Prediction in Wikipedia" task of the PAN 2012 Lab, held in conjunction with the CLEF 2012 conference. |
| PAN Wikipedia vandalism corpus 2010 | PAN Wikipedia vandalism corpus 2010 (PAN-WVC-10) is a corpus for the evaluation of automatic vandalism detectors for Wikipedia. |
| PAN Wikipedia vandalism corpus 2011 | PAN Wikipedia vandalism corpus 2011 (PAN-WVC-11) is a corpus for the evaluation of automatic vandalism detectors for Wikipedia. |
| Webis Wikipedia vandalism corpus | Webis Wikipedia vandalism corpus (Webis-WVC-07) is a corpus for the evaluation of automatic vandalism detection algorithms for Wikipedia. |
Publications
Only those publications related to wikis are shown here.| Title | Keyword(s) | Published in | Language | DateThis property is a special property in this wiki. | Abstract | R | C |
|---|---|---|---|---|---|---|---|
| A Breakdown of Quality Flaws in Wikipedia | Quality Flaws Information quality Wikipedia User-generated Content Analysis |
2nd Joint WICOW/AIRWeb Workshop on Web Quality (WebQuality 12) | English | 2012 | The online encyclopedia Wikipedia is a successful example of the increasing popularity of user generated content on the Web. Despite its success, Wikipedia is often criticized for containing low-quality information, which is mainly attributed to its core policy of being open for editing by everyone. The identification of low-quality information is an important task since Wikipedia has become the primary source of knowledge for a huge number of people around the world. Previous research on quality assessment in Wikipedia either investigates only small samples of articles, or else focuses on single quality aspects, like accuracy or formality. This paper targets the investigation of quality flaws, and presents the first complete breakdown of Wikipedia's quality flaw structure. We conduct an extensive exploratory analysis, which reveals (1) the quality flaws that actually exist, (2) the distribution of flaws in Wikipedia, and (3) the extent of flawed content. An important finding is that more than one in four English Wikipedia articles contains at least one quality flaw, 70% of which concern article verifiability. | 0 | 0 |
| On the Evolution of Quality Flaws and the Effectiveness of Cleanup Tags in the English Wikipedia | Wikipedia Cleanup Tags Quality Flaws Information quality Quality Flaw Evolution |
Wikipedia Academy | English | 2012 | The improvement of information quality is a major task for the free online encyclopedia Wikipedia. Recent studies targeted the analysis and detection of specific quality flaws in Wikipedia articles. To date, quality flaws have been exclusively investigated in current Wikipedia articles, based on a snapshot representing the state of Wikipedia at a certain time. This paper goes further, and provides the first comprehensive breakdown of the evolution of quality flaws in Wikipedia. We utilize cleanup tags to analyze the quality flaws that have been tagged by the Wikipedia community in the English Wikipedia, from its launch in 2001 until 2011. This leads to interesting findings regarding (1) the development of Wikipedia's quality flaw structure and (1) the usage and the effectiveness of cleanup tags. Specifically, we show that inline tags are more effective than tag boxes, and provide statistics about the considerable volume of rare and non-specific cleanup tags. We expect that this work will support the Wikipedia community in making quality assurance activities more efficient. | 0 | 0 |
| Overview of the 1st International Competition on Quality Flaw Prediction in Wikipedia | Information quality Wikipedia Quality Flaw Prediction |
CLEF | English | 2012 | The paper overviews the task "Quality Flaw Prediction in Wikipedia" of the PAN'12 competition. An evaluation corpus is introduced which comprises 1,592,226 English Wikipedia articles, of which 208,228 have been tagged to contain one of ten important quality flaws. Moreover, the performance of three quality flaw classifiers is evaluated. | 0 | 0 |
| Predicting Quality Flaws in User-generated Content: The Case of Wikipedia | User-generated Content Analysis Information quality Wikipedia Quality Flaw Prediction One-class Classification |
35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012) | English | 2012 | The detection and improvement of low-quality information is a key concern in Web applications that are based on user-generated content; a popular example is the online encyclopedia Wikipedia. Existing research on quality assessment of user-generated content deals with the classification as to whether the content is high-quality or low-quality. This paper goes one step further: it targets the prediction of quality flaws, this way providing specific indications in which respects low-quality content needs improvement. The prediction is based on user-defined cleanup tags, which are commonly used in many Web applications to tag content that has some shortcomings. We apply this approach to the English Wikipedia, which is the largest and most popular user-generated knowledge source on the Web. We present an automatic mining approach to identify the existing cleanup tags, which provides us with a training corpus of labeled Wikipedia articles. We argue that common binary or multiclass classification approaches are ineffective for the prediction of quality flaws and hence cast quality flaw prediction as a one-class classification problem. We develop a quality flaw model and employ a dedicated machine learning approach to predict Wikipedia's most important quality flaws. Since in the Wikipedia setting the acquisition of significant test data is intricate, we analyze the effects of a biased sample selection. In this regard we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. The flaw prediction performance is evaluated with 10,000 Wikipedia articles that have been tagged with the ten most frequent quality flaws: provided test data with little noise, four flaws can be detected with a precision close to 1. | 0 | 0 |
| Detection of Text Quality Flaws as a One-class Classification Problem | Information quality Wikipedia Quality Flaw Prediction One-class Classification |
20th ACM Conference on Information and Knowledge Management (CIKM 11) | English | 2011 | For Web applications that are based on user generated content the detection of text quality flaws is a key concern. Our research contributes to automatic quality flaw detection. In particular, we propose to cast the detection of text quality flaws as a one-class classification problem: we are given only positive examples (= texts containing a particular quality flaw) and decide whether or not an unseen text suffers from this flaw. We argue that common binary or multiclass classification approaches are ineffective in here, and we underpin our approach by a real-world application: we employ a dedicated one-class learning approach to determine whether a given Wikipedia article suffers from certain quality flaws. Since in the Wikipedia setting the acquisition of sensible test data is quite intricate, we analyze the effects of a biased sample selection. In addition, we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. Altogether, provided test data with little noise, four from ten important quality flaws in Wikipedia can be detected with a precision close to 1. | 0 | 0 |
| Towards automatic quality assurance in Wikipedia | Wikipedia Information quality Flaw Detection |
20th International Conference on World Wide Web (WWW 11) | English | 2011 | Featured articles in Wikipedia stand for high information quality, and it has been found interesting to researchers to analyze whether and how they can be distinguished from "ordinary" articles. Here we point out that article discrimination falls far short of writer support or automatic quality assurance: Featured articles are not identified, but are made. Following this motto we compile a comprehensive list of information quality flaws in Wikipedia, model them according to the latest state of the art, and devise one-class classification technology for their identification. | 0 | 0 |
| Identifying featured articles in wikipedia: writing style matters | Domain transfer Information quality Wikipedia |
World Wide Web | English | 2010 | 0 | 0 | |
| The ESA Retrieval Model Revisited | 32th International ACM SIGIR Conference (SIGIR 09) | English | 2009 | Among the retrieval models that have been proposed in the last years, the ESA model of Gabrilovich and Markovitch received much attention. The authors report on a significant improvement in the retrieval performance, which is explained with the semantic concepts introduced by the document collection underlying ESA. Their explanation appears plausible but our analysis shows that the connections are more involved and that the "concept hypothesis" does not hold. In our contribution we analyze several properties that in fact affect the retrieval performance. Moreover, we introduce a formalization of ESA, which reveals its close connection to existing retrieval models. | 0 | 0 | |
| A Wikipedia-Based Multilingual Retrieval Model | 30th European Conference on IR Research (ECIR 08) | English | 2008 | This paper introduces CL-ESA, a new multilingual retrieval model for the analysis of cross-language similarity. The retrieval model exploits the multilingual alignment of Wikipedia: given a document d written in language L we construct a concept vector d for d, where each dimension i in d quantifies the similarity of d with respect to a document chosen from the “L-subset” of Wikipedia. Likewise, for a second document d′ written in language L′, , we construct a concept vector d′, using from the L′-subset of the Wikipedia the topic-aligned counterparts of our previously chosen documents. Since the two concept vectors d and d′ are collection-relative representations of d and d′ they are language-independent. I. e., their similarity can directly be computed with the cosine similarity measure, for instance. We present results of an extensive analysis that demonstrates the power of this new retrieval model: for a query document d the topically most similar documents from a corpus in another language are properly ranked. Salient property of the new retrieval model is its robustness with respect to both the size and the quality of the index document collection. | 0 | 0 | |
| Automatic Vandalism Detection in Wikipedia | English | 2008 | We present results of a new approach to detect destructive article revisions, so-called vandalism, in Wikipedia. Vandalism detection is a one-class classification problem, where vandalism edits are the target to be identified among all revisions. Interestingly, vandalism detection has not been addressed in the Information Retrieval literature by now. In this paper we discuss the characteristics of vandalism as humans recognize it and develop features to render vandalism detection as a machine learning task. We compiled a large number of vandalism edits in a corpus, which allows for the comparison of existing and new detection approaches. Using logistic regression we achieve 83% precision at 77% recall with our model. Compared to the rule-based methods that are currently applied in Wikipedia, our approach increases the F-Measure performance by 49% while being faster at the same time. | 0 | 4 |
