Browse wiki

Jump to: navigation, search
Beyond the bag-of-words paradigm to enhance information retrieval applications
Abstract The typical IR-approach to indexing, clustThe typical IR-approach to indexing, clustering, classification and retrieval, just to name a few, is the one based on the bag-of-words paradigm. It eventually transforms a text into an array of terms, possibly weighted (with tf-idf scores or derivatives), and then represents that array via points in highly-dimensional space. It is therefore syntactical and unstructured, in the sense that different terms lead to different dimensions. Co-occurrence detection and other processing steps have been thus proposed (see e.g. LSI, Spectral analysis [7]) to identify the existence of those relations, but yet everyone is aware of the limitations of this approach especially in the expanding context of short (and thus poorly composed) texts, such as the snippets of search-engine results, the tweets of a Twitter channel, the items of a news feed, the posts of a blog, or the advertisement messages, etc.. A good deal of recent work is attempting to go beyond this paradigm by enriching the input text with additional structured annotations. This general idea has been declined in the literature in two distinct ways. One consists of extending the classic term-based vector-space model with additional dimensions corresponding to features (concepts) extracted from an external knowledge base, such as DMOZ, Wikipedia, or even the whole Web (see e.g. [4, 5, 12]). The pro of this approach is to extend the bag-of-words scheme with more concepts, thus possibly allowing the identification of related texts which are syntactically far apart. The cons resides in the contamination of these vectors by un-related (but common) concepts retrieved via the syntactic queries. The second way consists of identifying in the input text short-and-meaningful sequences of terms (aka spots) which are then connected to unambiguous concepts drawn from a catalog. The catalog can be formed by either a small set of specifically recognized types, most often People and Locations (aka Named Entities, see e.g. [13, 14]), or it can consists of millions of concepts drawn from a large knowledge base, such as Wikipedia. This latter catalog is ever-expanding and currently offers the best trade-off between a catalog with a rigorous structure but with low coverage (like WordNet, CYC, TAP), and a large text collection with wide coverage but unstructured and noised content (like the whole Web). To understand how this annotation works, let us consider the following short news: "Diego Maradona won against Mexico". The goal of the annotation is to detect "Diego Maradona" and"Mexico" as spots, and then hyper-link them with theWikipedia pages which deal with the ex Argentina's coach and the football team of Mexico. The annotator uses as spots the anchor texts which occur in Wikipedia pages, and as possible concepts for each spot the (possibly many) pages pointed in Wikipedia by that spot/anchors pointed in Wikipedia by that spot/anchor
Abstractsub The typical IR-approach to indexing, clustThe typical IR-approach to indexing, clustering, classification and retrieval, just to name a few, is the one based on the bag-of-words paradigm. It eventually transforms a text into an array of terms, possibly weighted (with tf-idf scores or derivatives), and then represents that array via points in highly-dimensional space. It is therefore syntactical and unstructured, in the sense that different terms lead to different dimensions. Co-occurrence detection and other processing steps have been thus proposed (see e.g. LSI, Spectral analysis [7]) to identify the existence of those relations, but yet everyone is aware of the limitations of this approach especially in the expanding context of short (and thus poorly composed) texts, such as the snippets of search-engine results, the tweets of a Twitter channel, the items of a news feed, the posts of a blog, or the advertisement messages, etc.. A good deal of recent work is attempting to go beyond this paradigm by enriching the input text with additional structured annotations. This general idea has been declined in the literature in two distinct ways. One consists of extending the classic term-based vector-space model with additional dimensions corresponding to features (concepts) extracted from an external knowledge base, such as DMOZ, Wikipedia, or even the whole Web (see e.g. [4, 5, 12]). The pro of this approach is to extend the bag-of-words scheme with more concepts, thus possibly allowing the identification of related texts which are syntactically far apart. The cons resides in the contamination of these vectors by un-related (but common) concepts retrieved via the syntactic queries. The second way consists of identifying in the input text short-and-meaningful sequences of terms (aka spots) which are then connected to unambiguous concepts drawn from a catalog. The catalog can be formed by either a small set of specifically recognized types, most often People and Locations (aka Named Entities, see e.g. [13, 14]), or it can consists of millions of concepts drawn from a large knowledge base, such as Wikipedia. This latter catalog is ever-expanding and currently offers the best trade-off between a catalog with a rigorous structure but with low coverage (like WordNet, CYC, TAP), and a large text collection with wide coverage but unstructured and noised content (like the whole Web). To understand how this annotation works, let us consider the following short news: "Diego Maradona won against Mexico". The goal of the annotation is to detect "Diego Maradona" and"Mexico" as spots, and then hyper-link them with theWikipedia pages which deal with the ex Argentina's coach and the football team of Mexico. The annotator uses as spots the anchor texts which occur in Wikipedia pages, and as possible concepts for each spot the (possibly many) pages pointed in Wikipedia by that spot/anchors pointed in Wikipedia by that spot/anchor
Bibtextype inproceedings  +
Doi 10.1145/1995412.1995414  +
Has author Paolo Ferragina +
Has extra keyword Algorithmic problems + , Annotation work + , Argentina + , Bag of words + , Clustering + , Co-occurrence + , Concept-based + , Conceptual annotation + , External knowledge + , F-measure + , Global score + , Invited talk + , Knowledge base + , Me-xico + , Named entities + , News feeds + , On-the-fly + , Other statistics + , Processing steps + , Query time + , Retrieval applications + , Scoring functions + , Semantic network + , Structured annotation + , Structured knowledge + , Text collection + , Vector-space models + , Via point + , Web advertising + , Wikipedia + , Wordnet + , Algorithms + , Chemical contamination + , Cluster analysis + , Information retrieval + , Knowledge based systems + , Libraries + , Search engine + , Semantics + , Spectrum analysis + , Syntactics + , User interfaces + , Vector spaces + , Teaching +
Has keyword Classification + , Clustering + , Conceptual annotation + , Search engine + , Structured annotation + , Wikipedia +
Isbn 9781450307956  +
Language English +
Number of citations by publication 0  +
Number of references by publication 0  +
Pages 3–4  +
Published in Proceedings - 4th International Conference on SImilarity Search and APplications, SISAP 2011 +
Title Beyond the bag-of-words paradigm to enhance information retrieval applications +
Type conference paper  +
Year 2011 +
Creation dateThis property is a special property in this wiki. 7 November 2014 06:17:51  +
Categories Publications without license parameter  + , Publications without remote mirror parameter  + , Publications without archive mirror parameter  + , Publications without paywall mirror parameter  + , Conference papers  + , Publications without references parameter  + , Publications  +
Modification dateThis property is a special property in this wiki. 7 November 2014 06:17:51  +
DateThis property is a special property in this wiki. 2011  +
hide properties that link here 
Beyond the bag-of-words paradigm to enhance information retrieval applications + Title
 

 

Enter the name of the page to start browsing from.