Crawling deep web entity pages
|Crawling deep web entity pages|
|Author(s)||He Y., Xin D., Ganti V., Rajaraman S., Shah N.|
|Published in||WSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining|
|Keyword(s)||deep-web crawl, entities, web data (Extra: De duplications, Deep web, deep-web crawl, entities, Online shopping sites, Prototype system, Query generation, Search interfaces, Sub-problems, Text document, Textual content, Web data, Wikipedia, Data mining, Electronic commerce, Information retrieval, Websites)|
|Article||BASE, CiteSeerX, Google Scholar|
|Web||Ask, Bing, Google (PDF), Yahoo!|
|Download and mirrors|
|Local copy||Not available|
|Remote mirror(s)||Not available|
|Export and share|
|BibTeX, CSV, RDF, JSON|
|Browse properties · List of conference papers|
Crawling deep web entity pages is a 2013 conference paper written in English by He Y., Xin D., Ganti V., Rajaraman S., Shah N. and published in WSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining.
Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. These techniques are experimentally evaluated and shown to be effective.
- This section requires expansion. Please, help!
Probably, this publication is cited by others, but there are no articles available for them in WikiPapers.