Authorship Tracking
Authorship Tracking (Alternative names for this tool) | |
Keyword(s) | blame maps, attribution, authorship |
Operating system(s) | Cross-platform |
Language(s) | None |
Programming language(s) | Python |
Author(s) | Luca de Alfaro, Michael Shavlovsky |
License(s) | BSD License |
Website | https://github.com/lucadealfaro/authorship-tracking |
Related material | |
Related tool(s) | https://etherpad.wikimedia.org/p/mwpersistence, https://sites.google.com/a/ucsc.edu/luca/the-wikipedia-authorship-project |
Related dataset(s) | Unknown [+] |
Search | |
Google Scholar | |
Export and share | |
BibTeX, CSV, RDF, JSON | |
![]() ![]() ![]() ![]() ![]() ![]() ![]() | |
Browse properties · List of tools |
Authorship Tracking This code implements the algorithms for tracking the authorship of text in revisioned content that have been published in WWW 2013: http://www2013.wwwconference.org/proceedings/p343.pdf
The idea consists in attributing each portion of text to the earliest revision where it appeared. For instance, if a revision contains the sentence "the cat ate the mouse", and the sentence is deleted, and reintroduced in a later revision (not necessarily as part of a revert), once re-introduced it is still attributed to its earliest author.
Precisely, the algorithm takes a parameter N. If a sequence of tokens of length equal or greater than N has appeared before, it is attributed to its earliest occurrence. See the paper for details.
The code works by building a trie-based representation of the whole history of the revisions, in an object of the class AuthorshipAttribution. Each time a new revision is passed to the object, the object updates its internal state and it computes the earliest attribution of the new revision, which can be then easily obtained. The object itself can be serialized (and de-serialized) using json-based methods.
To avoid the representation of the whole past history from growing too much, we remove from the object the information about content that has been absent from revisions (a) for at least 90 days, and (b) for at least 100 revisions. These are configurable parameters. With these choices, for the Wikipedia, the serialization of the object has size typically between 10 and 20 times the size of a typical revision, even for pages with very long revision lists. See paper for detailed experimental results.
Publications
There is no publication about this tool yet.