Browse wiki

Jump to: navigation, search
Exploiting XML structure to improve information retrieval in peer-to-peer systems
Abstract With the advent of XML as a standard for rWith the advent of XML as a standard for representation and exchange of structured documents, a growing amount of XML-documents are being stored in Peer-to-Peer (P2P) networks. Current research on P2P search engines proposes the use of Information Retrieval (IR) techniques to perform content-based search, but does not take into account structural features of documents. P2P systems typically have no central index, thus avoiding single-points-of-failures, but distribute all information among participating peers. Accordingly, a querying peer has only limited access to the index information and should select carefully which peers can help answering a given query by contributing resources such as local index information or CPU time for ranking computations. Bandwidth consumption is a major issue. To guarantee scalability, P2P systems have to reduce the number of peers involved in the retrieval process. As a result, the retrieval quality in terms of recall and precision may suffer substantially. In the proposed thesis, document structure is considered as an extra source of information to improve the retrieval quality of XML-documents in a P2P environment. The thesis centres on the following questions: how can structural information help to improve the retrieval of XML-documents in terms of result quality such as precision, recall, and specificity? Can XML structure support the routing of queries in distributed environments, especially the selection of promising peers? How can XML IR techniques be used in a P2P network while minimizing bandwidth consumption and considering performance aspects? To answer these questions and to analyze possible achievements, a search engine is proposed that exploits structural hints expressed explicitly by the user or implicitly by the self-describing structure of XML-documents. Additionally, more focused and specific results are obtained by providing ranked retrieval units that can be either XML-documents as a whole or the most relevant passages of theses documents. XML information retrieval techniques are applied in two ways: to select those peers participating in the retrieval process, and to compute the relevance of documents. The indexing approach includes both content and structural information of documents. To support efficient execution of multi term queries, index keys consist of rare combinations of (content, structure)-tuples. Performance is increased by using only fixed-sized posting lists: frequent index keys are combined with each other iteratively until the new combination is rare, with a posting list size under a pre-set threshold. All posting lists are sorted by taking into account classical IR measures such as term frequency and inverted term frequency as well as weights for potential retrieval units of a document, with a slight bias towards documents on peers with good collections regarding the current index key and with good peer characteristics such as online times, available bandwidth, and latency. When extracting the posting list for a specific query, a re-ordering on the posting list is performed that takes into account the structural similarity between key and query. According to this preranking, peers are selected that are expected to hold information about potentially relevant documents and retrieval units The final ranking is computed in parallel on those selected peers. The computation is based on an extension of the vector space model and distinguishes between weights for different structures of the same content. This allows weighting XML elements with respect to their discriminative power, e.g. a title will be weighted much higher than a footnote. Additionally, relevance is computed as a mixture of content relevance and structural similarity between a given query and a potential retrieval unit. Currently, a first prototype for P2P Information Retrieval of XML-documents called SPIRIX is being implemented. Experiments to evaluate the proposed techniques and use of structural hints will be performed on a distributed version of the INEX Wikipedia Collection. version of the INEX Wikipedia Collection.
Abstractsub With the advent of XML as a standard for rWith the advent of XML as a standard for representation and exchange of structured documents, a growing amount of XML-documents are being stored in Peer-to-Peer (P2P) networks. Current research on P2P search engines proposes the use of Information Retrieval (IR) techniques to perform content-based search, but does not take into account structural features of documents. P2P systems typically have no central index, thus avoiding single-points-of-failures, but distribute all information among participating peers. Accordingly, a querying peer has only limited access to the index information and should select carefully which peers can help answering a given query by contributing resources such as local index information or CPU time for ranking computations. Bandwidth consumption is a major issue. To guarantee scalability, P2P systems have to reduce the number of peers involved in the retrieval process. As a result, the retrieval quality in terms of recall and precision may suffer substantially. In the proposed thesis, document structure is considered as an extra source of information to improve the retrieval quality of XML-documents in a P2P environment. The thesis centres on the following questions: how can structural information help to improve the retrieval of XML-documents in terms of result quality such as precision, recall, and specificity? Can XML structure support the routing of queries in distributed environments, especially the selection of promising peers? How can XML IR techniques be used in a P2P network while minimizing bandwidth consumption and considering performance aspects? To answer these questions and to analyze possible achievements, a search engine is proposed that exploits structural hints expressed explicitly by the user or implicitly by the self-describing structure of XML-documents. Additionally, more focused and specific results are obtained by providing ranked retrieval units that can be either XML-documents as a whole or the most relevant passages of theses documents. XML information retrieval techniques are applied in two ways: to select those peers participating in the retrieval process, and to compute the relevance of documents. The indexing approach includes both content and structural information of documents. To support efficient execution of multi term queries, index keys consist of rare combinations of (content, structure)-tuples. Performance is increased by using only fixed-sized posting lists: frequent index keys are combined with each other iteratively until the new combination is rare, with a posting list size under a pre-set threshold. All posting lists are sorted by taking into account classical IR measures such as term frequency and inverted term frequency as well as weights for potential retrieval units of a document, with a slight bias towards documents on peers with good collections regarding the current index key and with good peer characteristics such as online times, available bandwidth, and latency. When extracting the posting list for a specific query, a re-ordering on the posting list is performed that takes into account the structural similarity between key and query. According to this preranking, peers are selected that are expected to hold information about potentially relevant documents and retrieval units The final ranking is computed in parallel on those selected peers. The computation is based on an extension of the vector space model and distinguishes between weights for different structures of the same content. This allows weighting XML elements with respect to their discriminative power, e.g. a title will be weighted much higher than a footnote. Additionally, relevance is computed as a mixture of content relevance and structural similarity between a given query and a potential retrieval unit. Currently, a first prototype for P2P Information Retrieval of XML-documents called SPIRIX is being implemented. Experiments to evaluate the proposed techniques and use of structural hints will be performed on a distributed version of the INEX Wikipedia Collection. version of the INEX Wikipedia Collection.
Bibtextype inproceedings  +
Doi 10.1145/1390334.1390565  +
Has author Winter J. +
Has extra keyword Bandwidth + , Client server computer systems + , Computer networks + , Computer software + , Computer systems + , Distributed computer systems + , Encoding (symbols) + , Information retrieval + , Information retrieval systems + , Information services + , Markup languages + , Model structures + , Natural language processing systems + , Research and development management + , Search engine + , Telecommunication systems + , World Wide Web + , Available bandwidths + , Bandwidth consumptions + , Content-based XML-retrieval + , Cpu times + , Current indices + , Distributed environments + , Distributed search + , Document structures + , Frequent indices + , Index informations + , Information retrieval techniques + , Ir techniques + , Local indices + , Online times + , P2P environments + , P2p networks + , P2P systems + , Peer-to-Peer + , Peer-to-peer networks + , Peer-to-peer systems + , Performance aspects + , Ranked retrievals + , Recall and precisions + , Relevant documents + , Retrieval processes + , Retrieval qualities + , Structural features + , Structural informations + , Structural similarities + , Structured documents + , Term frequencies + , Two ways + , Vector Space models + , Wikipedia + , XML information retrieval + , XML information retrievals + , XML +
Has keyword Content-based XML-retrieval + , Distributed search + , Information retrieval + , Peer-to-Peer + , XML information retrieval +
Isbn 9781605581644  +
Language English +
Number of citations by publication 0  +
Number of references by publication 0  +
Pages 890–{{{page-end}}}  +
Published in ACM SIGIR 2008 - 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Proceedings +
Title Exploiting XML structure to improve information retrieval in peer-to-peer systems +
Type conference paper  +
Year 2008 +
Creation dateThis property is a special property in this wiki. 7 November 2014 15:21:23  +
Categories Publications without license parameter  + , Publications without remote mirror parameter  + , Publications without archive mirror parameter  + , Publications without paywall mirror parameter  + , Conference papers  + , Publications without references parameter  + , Publications  +
Modification dateThis property is a special property in this wiki. 7 November 2014 15:21:23  +
DateThis property is a special property in this wiki. 2008  +
hide properties that link here 
Exploiting XML structure to improve information retrieval in peer-to-peer systems + Title
 

 

Enter the name of the page to start browsing from.