Exploiting XML structure to improve information retrieval in peer-to-peer systems

From WikiPapers
Jump to: navigation, search

Exploiting XML structure to improve information retrieval in peer-to-peer systems is a 2008 conference paper written in English by Winter J. and published in ACM SIGIR 2008 - 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Proceedings.

[edit] Abstract

With the advent of XML as a standard for representation and exchange of structured documents, a growing amount of XML-documents are being stored in Peer-to-Peer (P2P) networks. Current research on P2P search engines proposes the use of Information Retrieval (IR) techniques to perform content-based search, but does not take into account structural features of documents. P2P systems typically have no central index, thus avoiding single-points-of-failures, but distribute all information among participating peers. Accordingly, a querying peer has only limited access to the index information and should select carefully which peers can help answering a given query by contributing resources such as local index information or CPU time for ranking computations. Bandwidth consumption is a major issue. To guarantee scalability, P2P systems have to reduce the number of peers involved in the retrieval process. As a result, the retrieval quality in terms of recall and precision may suffer substantially. In the proposed thesis, document structure is considered as an extra source of information to improve the retrieval quality of XML-documents in a P2P environment. The thesis centres on the following questions: how can structural information help to improve the retrieval of XML-documents in terms of result quality such as precision, recall, and specificity? Can XML structure support the routing of queries in distributed environments, especially the selection of promising peers? How can XML IR techniques be used in a P2P network while minimizing bandwidth consumption and considering performance aspects? To answer these questions and to analyze possible achievements, a search engine is proposed that exploits structural hints expressed explicitly by the user or implicitly by the self-describing structure of XML-documents. Additionally, more focused and specific results are obtained by providing ranked retrieval units that can be either XML-documents as a whole or the most relevant passages of theses documents. XML information retrieval techniques are applied in two ways: to select those peers participating in the retrieval process, and to compute the relevance of documents. The indexing approach includes both content and structural information of documents. To support efficient execution of multi term queries, index keys consist of rare combinations of (content, structure)-tuples. Performance is increased by using only fixed-sized posting lists: frequent index keys are combined with each other iteratively until the new combination is rare, with a posting list size under a pre-set threshold. All posting lists are sorted by taking into account classical IR measures such as term frequency and inverted term frequency as well as weights for potential retrieval units of a document, with a slight bias towards documents on peers with good collections regarding the current index key and with good peer characteristics such as online times, available bandwidth, and latency. When extracting the posting list for a specific query, a re-ordering on the posting list is performed that takes into account the structural similarity between key and query. According to this preranking, peers are selected that are expected to hold information about potentially relevant documents and retrieval units The final ranking is computed in parallel on those selected peers. The computation is based on an extension of the vector space model and distinguishes between weights for different structures of the same content. This allows weighting XML elements with respect to their discriminative power, e.g. a title will be weighted much higher than a footnote. Additionally, relevance is computed as a mixture of content relevance and structural similarity between a given query and a potential retrieval unit. Currently, a first prototype for P2P Information Retrieval of XML-documents called SPIRIX is being implemented. Experiments to evaluate the proposed techniques and use of structural hints will be performed on a distributed version of the INEX Wikipedia Collection.

[edit] References

This section requires expansion. Please, help!

Cited by

Probably, this publication is cited by others, but there are no articles available for them in WikiPapers.