Luca de Alfaro

From WikiPapers
Jump to: navigation, search

Luca de Alfaro is an author.

Tools

Tool Description
Authorship Tracking Authorship Tracking This code implements the algorithms for tracking the authorship of text in revisioned content that have been published in WWW 2013: http://www2013.wwwconference.org/proceedings/p343.pdf

The idea consists in attributing each portion of text to the earliest revision where it appeared. For instance, if a revision contains the sentence "the cat ate the mouse", and the sentence is deleted, and reintroduced in a later revision (not necessarily as part of a revert), once re-introduced it is still attributed to its earliest author.

Precisely, the algorithm takes a parameter N. If a sequence of tokens of length equal or greater than N has appeared before, it is attributed to its earliest occurrence. See the paper for details.

The code works by building a trie-based representation of the whole history of the revisions, in an object of the class AuthorshipAttribution. Each time a new revision is passed to the object, the object updates its internal state and it computes the earliest attribution of the new revision, which can be then easily obtained. The object itself can be serialized (and de-serialized) using json-based methods.

To avoid the representation of the whole past history from growing too much, we remove from the object the information about content that has been absent from revisions (a) for at least 90 days, and (b) for at least 100 revisions. These are configurable parameters. With these choices, for the Wikipedia, the serialization of the object has size typically between 10 and 20 times the size of a typical revision, even for pages with very long revision lists. See paper for detailed experimental results.
WikiTrust WikiTrust is an open-source, on-line reputation system for Wikipedia authors and content.


Publications

Only those publications related to wikis are shown here.
Title Keyword(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Attributing authorship of revisioned content Authorship
Revisioned content
Wikipedia
WWW 2013 - Proceedings of the 22nd International Conference on World Wide Web English 2013 A considerable portion of web content, from wikis to collaboratively edited documents, to code posted online, is revisioned. We consider the problem of attributing authorship to such revisioned content, and we develop scalable attribution algorithms that can be applied to very large bodies of revisioned content, such as the English Wikipedia. Since content can be deleted, only to be later re-inserted, we introduce a notion of authorship that requires comparing each new revision with the entire set of past revisions. For each portion of content in the newest revision, we search the entire history for content matches that are statistically unlikely to occur spontaneously, thus denoting common origin. We use these matches to compute the earliest possible attribution of each word (or each token) of the new content. We show that this \earliest plausible attribution" can be computed efficiently via compact summaries of the past revision history. This leads to an algorithm that runs in time proportional to the sum of the size of the most recent revision, and the total amount of change (edit work) in the revision history. This amount of change is typically much smaller than the total size of all past revisions. The resulting algorithm can scale to very large repositories of revisioned content, as we show via experimental data over the English Wikipedia Copyright is held by the International World Wide Web Conference Committee (IW3C2). 0 0
The Gene Wiki in 2011: community intelligence applied to human gene annotation Nucleic Acids Research English 2012 The Gene Wiki is an open-access and openly editable collection of Wikipedia articles about human genes. Initiated in 2008, it has grown to include articles about more than 10 000 genes that, collectively, contain more than 1.4 million words of gene-centric text with extensive citations back to the primary scientific literature. This growing body of useful, gene-centric content is the result of the work of thousands of individuals throughout the scientific community. Here, we describe recent improvements to the automated system that keeps the structured data presented on Gene Wiki articles in sync with the data from trusted primary databases. We also describe the expanding contents, editors and users of the Gene Wiki. Finally, we introduce a new automated system, called WikiTrust, which can effectively compute the quality of Wikipedia articles, including Gene Wiki articles, at the word level. All articles in the Gene Wiki can be freely accessed and edited at Wikipedia, and additional links and information can be found at the project's Wikipedia portal page: http://en.wikipedia.org/wiki/Portal:Gene_Wiki. 0 0
Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features Wikipedia
Wiki
Collaboration
Vandalism
Machine learning
Metadata
Natural Language Processing
Reputation
Lecture Notes in Computer Science English February 2011 Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, about 7% are acts of vandalism. Such behavior is characterized by modifications made in bad faith; introducing spam and other inappropriate content. In this work, we present the results of an effort to integrate three of the leading approaches to Wikipedia vandalism detection: a spatio-temporal analysis of metadata (STiki), a reputation-based system (WikiTrust), and natural language processing features. The performance of the resulting joint system improves the state-of-the-art from all previous methods and establishes a new baseline for Wikipedia vandalism detection. We examine in detail the contribution of the three approaches, both for the task of discovering fresh vandalism, and for the task of locating vandalism in the complete set of Wikipedia revisions. 0 1
Reputation systems for open collaboration Communications of the ACM English 2011 Algorithmic-based user incentives ensure the trustworthiness of evaluations of Wikipedia entries and Google Maps business information. 0 0
Wikipedia vandalism detection: Combining natural language, metadata, and reputation features Lecture Notes in Computer Science English 2011 Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, about 7% are acts of vandalism. Such behavior is characterized by modifications made in bad faith; introducing spam and other inappropriate content. In this work, we present the results of an effort to integrate three of the leading approaches to Wikipedia vandalism detection: a spatio-temporal analysis of metadata (STiki), a reputation-based system (WikiTrust), and natural language processing features. The performance of the resulting joint system improves the state-of-the-art from all previous methods and establishes a new baseline for Wikipedia vandalism detection. We examine in detail the contribution of the three approaches, both for the task of discovering fresh vandalism, and for the task of locating vandalism in the complete set of Wikipedia revisions. 0 1
Detecting Wikipedia Vandalism using WikiTrust - Lab Report for PAN at CLEF 2010 English 2010 0 1
Measuring Wikipedia: A hands-on tutorial Data mining
Empirical research
Measurements
Wikipedia
WikiTrust
WikiXRay
WikiSym English 2009 This tutorial is an introduction to the best methodologies, tools and practices for Wikipedia research. The tutorial will be led by Luca de Alfaro (Wiki Lab at UCSC, California, USA) and Felipe Ortega (Libresoft, URJC, Madrid, Spain). Both cumulate several years of practical experience exploring and processing Wikipedia data [1], [2], [3]. As well, their respective research groups have led the development of two cutting-edge software tools (WikiTrust and WikiXRay), for analyzing Wikipedia. WikiTrust implements an author reputation system, and a text trust system, for wikis. WikiXRay is a tool automating the quantitative analysis of any language version of Wikipedia (in general, any wiki based on MediaWiki). Copyright 0 0
Measuring Wikipedia: a hands-on tutorial WikiTrust
WikiXRay
Wikipedia
Data mining
Empirical research
Measurements
WikiSym English 2009 0 0
Assigning Trust to Wikipedia Content WikiSym English 2008 The Wikipedia is a collaborative encyclopedia: anyone can contribute to its articles simply by clicking on an "edit" button. The open nature of the Wikipedia has been key to its success, but has also created a challenge: how can readers develop an informed opinion on its reliability? We propose a system that computes quantitative values of trust for the text in Wikipedia articles; these trust values provide an indication of text reliability. The system uses as input the revision history of each article, as well as information about the reputation of the contributing authors, as provided by a reputation system. The trust of a word in an article is computed on the basis of the reputation of the original author of the word, as well as the reputation of all authors who edited text near the word. The algorithm computes word trust values that vary smoothly across the text; the trust values can be visualized using varying text-background colors. The algorithm ensures that all changes to an article's text are reflected in the trust values, preventing surreptitious content changes. We have implemented the proposed system, and we have used it to compute and display the trust of the text of thousands of articles of the English Wikipedia. To validate our trust-computation algorithms, we show that text labeled as low-trust has a significantly higher probability of being edited in the future than text labeled as high-trust. 0 7
Measuring Author Contributions to the Wikipedia WikiSym English 2008 0 5
Robust content-driven reputation Reputation
User generated content
Wikipedia
Proceedings of the ACM Conference on Computer and Communications Security English 2008 In content-driven reputation systems for collaborative content, users gain or lose reputation according to how their contributions fare: authors of long-lived contributions gain reputation, while authors of reverted contributions lose reputation. Existing content-driven systems are prone to Sybil attacks, in which multiple identities, controlled by the same person, perform coordinated actions to increase their reputation. We show that content-driven reputation systems can be made resistant to such attacks by taking advantage of thefact that the reputation increments and decrements depend on content modifications, which are visible to all. We present an algorithm for content-driven reputation that prevents a set of identities from increasing their maximum reputation without doing any useful work. Here, work is considered useful if it causes content to evolve in a direction that is consistent with the actions of high-reputation users. We argue that the content modifications that require no effort, such as the insertion or deletion of arbitrary text, are invariably non-useful. We prove a truthfullness result for the resulting system, stating that users who wish to perform a contribution do not gain by employing complex contribution schemes, compared to simply performing the contribution at once. In particular, splitting the contribution in multiple portions, or employing the coordinated actions of multiple identities, do not yield additional reputation. Taken together, these results indicate that content-driven systems can be made robust with respect to Sybil attacks. Copyright 2008 ACM. 0 0
A content-driven reputation system for the Wikipedia English 2007 We present a content-driven reputation system for Wikipedia authors. In our system, authors gain reputation when the edits they perform to Wikipedia articles are preserved by subsequent authors, and they lose reputation when their edits are rolled back or undone in short order. Thus, author reputation is computed solely on the basis of content evolution; user-to-user comments or ratings are not used. The author reputation we compute could be used to flag new contributions from low-reputation authors, or it could be used to allow only authors with high reputation to contribute to controversialor critical pages. A reputation system for the Wikipedia could also provide an incentive for high-quality contributions. We have implemented the proposed system, and we have used it to analyze the entire Italian and French Wikipedias, consisting of a total of 691,551 pages and 5,587,523 revisions. Our results show that our notion of reputation has good predictive value: changes performed by low-reputation authors have a significantly larger than average probability of having poor quality, as judged by human observers, and of being later undone, as measured by our algorithms. 0 10