Martin Potthast

From WikiPapers
Jump to: navigation, search

Martin Potthast is an author from Germany.

Datasets

Dataset Description
PAN Wikipedia vandalism corpus 2010 PAN Wikipedia vandalism corpus 2010 (PAN-WVC-10) is a corpus for the evaluation of automatic vandalism detectors for Wikipedia.
PAN Wikipedia vandalism corpus 2011 PAN Wikipedia vandalism corpus 2011 (PAN-WVC-11) is a corpus for the evaluation of automatic vandalism detectors for Wikipedia.
Webis Wikipedia vandalism corpus Webis Wikipedia vandalism corpus (Webis-WVC-07) is a corpus for the evaluation of automatic vandalism detection algorithms for Wikipedia.


Publications

Only those publications related to wikis are shown here.
Title Keyword(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Query segmentation revisited Corpus
Query segmentation
Web N-grams
Proceedings of the 20th International Conference on World Wide Web, WWW 2011 English 2011 We address the problem of query segmentation: given a keyword query, the task is to group the keywords into phrases, if possible. Previous approaches to the problem achieve reasonable segmentation performance but are tested only against a small corpus of manually segmented queries. In addition, many of the previous approaches are fairly intricate as they use expensive features and are difficult to be reimplemented. The main contribution of this paper is a new method for query segmentation that is easy to implement, fast, and that comes with a segmentation accuracy comparable to current state-of-the-art techniques. Our method uses only raw web n-gram frequencies and Wikipedia titles that are stored in a hash table. At the same time, we introduce a new evaluation corpus for query segmentation. With about 50 000 human-annotated queries, it is two orders of magnitude larger than the corpus being used up to now. Copyright © 2011 by the Association for Computing Machinery, Inc. (ACM). 0 0
Cross-language plagiarism detection Language Resources and Evaluation 2010 0 0
Crowdsourcing a Wikipedia Vandalism Corpus Wikipedia
Vandalism detection
Evaluation
Corpus
SIGIR English 2010 We report on the construction of the PAN Wikipedia vandalism corpus, PAN-WVC-10, using Amazon’s Mechanical Turk. The corpus compiles 32 452 edits on 28 468 Wikipedia articles, among which 2 391 vandalism edits have been identified. 753 human annotators cast a total of 193 022 votes on the edits, so that each edit was reviewed by at least 3 annotators, whereas the achieved level of agreement was analyzed in order to label an edit as “regular” or “vandalism.” The corpus is available free of charge. 6 1
A Wikipedia-Based Multilingual Retrieval Model 30th European Conference on IR Research (ECIR 08) English 2008 This paper introduces CL-ESA, a new multilingual retrieval model for the analysis of cross-language similarity. The retrieval model exploits the multilingual alignment of Wikipedia: given a document d written in language L we construct a concept vector d for d, where each dimension i in d quantifies the similarity of d with respect to a document chosen from the “L-subset” of Wikipedia. Likewise, for a second document d′ written in language L′, , we construct a concept vector d′, using from the L′-subset of the Wikipedia the topic-aligned counterparts of our previously chosen documents. Since the two concept vectors d and d′ are collection-relative representations of d and d′ they are language-independent. I. e., their similarity can directly be computed with the cosine similarity measure, for instance. We present results of an extensive analysis that demonstrates the power of this new retrieval model: for a query document d the topically most similar documents from a corpus in another language are properly ranked. Salient property of the new retrieval model is its robustness with respect to both the size and the quality of the index document collection. 0 0
Wikipedia in the pocket: Indexing technology for near-duplicate detection and high similarity search Fuzzy-fingerprinting
Hash-based indexing
Near-duplicate detection
Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07 English 2007 We develop and implement a new indexing technology which allows us to use complete (and possibly very large) documents as queries, while having a retrieval performance comparable to a standard term query. Our approach aims at retrieval tasks such as near duplicate detection and high similarity search. To demonstrate the performance of our technology we have compiled the search index "Wikipedia in the Pocket", which contains about 2 million English and German Wikipedia articles.1 This index - along with a search interface - fits on a conventional CD (0.7 gigabyte). The ingredients of our indexing technology are similarity hashing and minimal perfect hashing. 0 0
Wikipedia in the pocket: indexing technology for near-duplicate detection and high similarity search English 2007 0 0