An analysis of systematic judging errors in information retrieval
|An analysis of systematic judging errors in information retrieval|
|Author(s)||Kazai G., Craswell N., Yilmaz E., Tahaghoghi S.M.M.|
|Published in||ACM International Conference Proceeding Series|
|Keyword(s)||bias, noise, relevence (Extra: bias, Four-group, noise, relevence, Retrieval systems, Test Collection, Web searches, Wikipedia, Errors, Information retrieval, Information retrieval systems, Knowledge management, Management science, Search engines, World Wide Web, Systematic errors)|
|Article||BASE, CiteSeerX, Google Scholar|
|Web||Ask, Bing, Google (PDF), Yahoo!|
|Download and mirrors|
|Local copy||Not available|
|Remote mirror(s)||Not available|
|Export and share|
|BibTeX, CSV, RDF, JSON|
|Browse properties · List of conference papers|
An analysis of systematic judging errors in information retrieval is a 2012 conference paper written in English by Kazai G., Craswell N., Yilmaz E., Tahaghoghi S.M.M. and published in ACM International Conference Proceeding Series.
Test collections are powerful mechanisms for the evaluation and optimization of information retrieval systems. However, there is reported evidence that experiment outcomes can be affected by changes to the judging guidelines or changes in the judge population. This paper examines such effects in a web search setting, comparing the judgments of four groups of judges: NIST Web Track judges, untrained crowd workers and two groups of trained judges of a commercial search engine. Our goal is to identify systematic judging errors by comparing the labels contributed by the different groups, working under the same or different judging guidelines. In particular, we focus on detecting systematic differences in judging depending on specific characteristics of the queries and URLs. For example, we ask whether a given population of judges, working under a given set of judging guidelines, are more likely to consistently overrate Wikipedia pages than another group judging under the same instructions. Our approach is to identify judging errors with respect to a consensus set, a judged gold set and a set of user clicks. We further demonstrate how such biases can affect the training of retrieval systems.
- This section requires expansion. Please, help!
Probably, this publication is cited by others, but there are no articles available for them in WikiPapers. Cited 1 time(s)