Hugo Zaragoza

From WikiPapers
Jump to: navigation, search

Hugo Zaragoza is an author.

Publications

Only those publications related to wikis are shown here.
Title Keyword(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Caching search engine results over incremental indices Real-time indexing
Search engine caching
SIGIR 2010 Proceedings - 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval English 2010 A Web search engine must update its index periodically to incorporate changes to the Web. We argue in this paper that index updates fundamentally impact the design of search engine result caches, a performance-critical component of modern search engines. Index updates lead to the problem of cache invalidation: invalidating cached entries of queries whose results have changed. Naïve approaches, such as flushing the entire cache upon every index update, lead to poor performance and in fact, render caching futile when the frequency of updates is high. Solving the invalidation problem efficiently corresponds to predicting accurately which queries will produce different results if re-evaluated, given the actual changes to the index. To obtain this property, we propose a framework for developing invalidation predictors and define metrics to evaluate invalidation schemes. We describe concrete predictors using this framework and compare them against a baseline that uses a cache invalidation scheme based on time-to-live (TTL). Evaluation over Wikipedia documents using a query log from the Yahoo search engine shows that selective invalidation of cached search results can lower the number of unnecessary query evaluations by as much as 30% compared to a baseline scheme, while returning results of similar freshness. In general, our predictors enable fewer unnecessary invalidations and fewer stale results compared to a TTL-only scheme for similar freshness of results. 0 0
Investigating the demand side of semantic search through query log analysis CEUR Workshop Proceedings English 2009 In this paper, we propose a method to create aggregated representations of the information needs of Web users when searching for particular types of objects. We suggest this method as a way to investigate the gap between what Web search users are expecting to find and the kind of information that is provided by Semantic Web datasets formatted according to a particular ontology. We evaluate our method qualitatively by measuring its power as a query completion mechanism. Last, we perform a qualitative evaluation comparing the information Web users search for with the information available in Dbpedia, the structured data representation of Wikipedia. 0 0
Learning to tag and tagging to learn: A case study on wikipedia IEEE Intelligent Systems English 2008 Information technology experts suggest that natural language technologies will play an important role in the Web's future. The latest Web developments, such as the huge success of Web 2.0, demonstrate annotated data's significant potential. The problem of semantically annotating Wikipedia inspires a novel method for dealing with domain and task adaptation of semantic taggers in cases where parallel text and metadata are available. One main approach to tagging for acquiring knowledge from Wikipedia involves self-training that adds automatically annotated data from the target domain to the original training data. Another key approach involves structural correspondence learning, which tries to build a shared feature representation of the data. 0 0
Semantically Annotated Snapshot of the English Wikipedia LREC'08 2008 This paper describes SW1, the first version of a semantically annotated snapshot of the English Wikipedia. In recent years Wikipedia has become a valuable resource for both the Natural Language Processing (NLP) community and the Information Retrieval (IR) community. Although NLP technology for processing Wikipedia already exists, not all researchers and developers have the computational resources to process such a volume of information. Moreover, the use of different versions of Wikipedia processed differently might make it difficult to compare results. The aim of this work is to provide easy access to syntactic and semantic annotations for researchers of both NLP and IR communities by building a reference corpus to homogenize experiments and make results comparable. These resources, a semantically annotated corpus and a “entity containment” derived graph, are licensed under the GNU Free Documentation License and available from http://www.yr-bcn.es/semanticWikipedia 0 1
Ranking Very Many Typed Entities on Wikipedia CIKM '07: Proceedings of the Sixteenth ACM International Conference on Information and Knowledge Management 2007 We discuss the problem of ranking very many entities of different types. In particular we deal with a heterogeneous set of types, some being very generic and some very speci??c. We discuss two approaches for this problem: i) exploiting the entity containment graph and ii) using a Web search engine to compute entity relevance. We evaluate these approaches on the real task of ranking Wikipedia entities typed with a state-of-the-art named-entity tagger. Results show that both approaches can greatly increase the performance of methods based only on passage retrieval. 0 0
Ranking very many typed entities on Wikipedia English 2007 0 0