Paolo Ferragina

From WikiPapers
Jump to: navigation, search

Paolo Ferragina is an author.

Publications

Only those publications related to wikis are shown here.
Title Keyword(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
A framework for benchmarking entity-annotation systems Benchmark framework
Entity annotation
Wikipedia
WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web English 2013 In this paper we design and implement a benchmarking framework for fair and exhaustive comparison of entity-annotation systems. The framework is based upon the definition of a set of problems related to the entity-annotation task, a set of measures to evaluate systems performance, and a systematic comparative evaluation involving all publicly available datasets, containing texts of various types such as news, tweets and Web pages. Our framework is easily-extensible with novel entity annotators, datasets and evaluation measures for comparing systems, and it has been released to the public as open source1. We use this framework to perform the first extensive comparison among all available entity annotators over all available datasets, and draw many interesting conclusions upon their efficiency and effectiveness. We also draw conclusions between academic versus commercial annotators. Copyright is held by the International World Wide Web Conference Committee (IW3C2). 0 0
Classification of short texts by deploying topical annotations Lecture Notes in Computer Science English 2012 We propose a novel approach to the classification of short texts based on two factors: the use of Wikipedia-based annotators that have been recently introduced to detect the main topics present in an input text, represented via Wikipedia pages, and the design of a novel classification algorithm that measures the similarity between the input text and each output category by deploying only their annotated topics and the Wikipedia link-structure. Our approach waives the common practice of expanding the feature-space with new dimensions derived either from explicit or from latent semantic analysis. As a consequence it is simple and maintains a compact intelligible representation of the output categories. Our experiments show that it is efficient in construction and query time, accurate as state-of-the-art classifiers (see e.g. Phan et al. WWW '08), and robust with respect to concept drifts and input sources. 0 0
Fast and Accurate Annotation of Short Texts with Wikipedia Pages Content analysis and indexing
Intelligent Web services and Semantic Web
Knowledge management
Natural Language Processing
IEEE Softw. English 2012 0 0
Beyond the bag-of-words paradigm to enhance information retrieval applications Classification
Clustering
Conceptual annotation
Search engine
Structured annotation
Wikipedia
Proceedings - 4th International Conference on SImilarity Search and APplications, SISAP 2011 English 2011 The typical IR-approach to indexing, clustering, classification and retrieval, just to name a few, is the one based on the bag-of-words paradigm. It eventually transforms a text into an array of terms, possibly weighted (with tf-idf scores or derivatives), and then represents that array via points in highly-dimensional space. It is therefore syntactical and unstructured, in the sense that different terms lead to different dimensions. Co-occurrence detection and other processing steps have been thus proposed (see e.g. LSI, Spectral analysis [7]) to identify the existence of those relations, but yet everyone is aware of the limitations of this approach especially in the expanding context of short (and thus poorly composed) texts, such as the snippets of search-engine results, the tweets of a Twitter channel, the items of a news feed, the posts of a blog, or the advertisement messages, etc.. A good deal of recent work is attempting to go beyond this paradigm by enriching the input text with additional structured annotations. This general idea has been declined in the literature in two distinct ways. One consists of extending the classic term-based vector-space model with additional dimensions corresponding to features (concepts) extracted from an external knowledge base, such as DMOZ, Wikipedia, or even the whole Web (see e.g. [4, 5, 12]). The pro of this approach is to extend the bag-of-words scheme with more concepts, thus possibly allowing the identification of related texts which are syntactically far apart. The cons resides in the contamination of these vectors by un-related (but common) concepts retrieved via the syntactic queries. The second way consists of identifying in the input text short-and-meaningful sequences of terms (aka spots) which are then connected to unambiguous concepts drawn from a catalog. The catalog can be formed by either a small set of specifically recognized types, most often People and Locations (aka Named Entities, see e.g. [13, 14]), or it can consists of millions of concepts drawn from a large knowledge base, such as Wikipedia. This latter catalog is ever-expanding and currently offers the best trade-off between a catalog with a rigorous structure but with low coverage (like WordNet, CYC, TAP), and a large text collection with wide coverage but unstructured and noised content (like the whole Web). To understand how this annotation works, let us consider the following short news: "Diego Maradona won against Mexico". The goal of the annotation is to detect "Diego Maradona" and"Mexico" as spots, and then hyper-link them with theWikipedia pages which deal with the ex Argentina's coach and the football team of Mexico. The annotator uses as spots the anchor texts which occur in Wikipedia pages, and as possible concepts for each spot the (possibly many) pages pointed in Wikipedia by that spot/anchor 0 0
First steps beyond the bag-of-words representation of short texts CEUR Workshop Proceedings English 2011 We address the problem of enhancing the classical bag-of- words representation of texts by designing and engineering Tagme, the first system that performs an accurate and on-the-y semantic annota- tion of short texts via Wikipedia as knowledge base. Several experiments show that Tagme outperforms state-of-the-art algorithms when they are adapted to work on short texts and it results fast and competitive on long ones. This leads us to argue favorably about Tagme's application to clustering, classification and retrieval systems on challenging scenarios like web-snippets, tweets, news, ads, etc. 0 0
TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities) Algorithms
Experimentation
Performance
International Conference on Information and Knowledge Management, Proceedings English 2010 We designed and implemented TAGME, a system that is able to efficiently and judiciously augment a plain-text with pertinent hyperlinks to Wikipedia pages. The specialty of TAGME with respect to known systems [5, 8] is that it may annotate texts which are short and poorly composed, such as snippets of search-engine results, tweets, news, etc. This annotation is extremely informative, so any task that is currently addressed using the bag-of-words paradigm could benefit from using this annotation to draw upon (the millions of) Wikipedia pages and their inter-relations. 0 0
TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) Semantic annotation
Text mining
Wikipedia
Word sense disambiguation
CIKM English 2010 0 0