STiki

From WikiPapers
Jump to: navigation, search

STiki is an anti-vandalism tool that consists of server-side detection algorithms and a client-facing GUI.

The STiki page on English Wikipedia is the primary source for more information.



STiki screenshot.png


Publications

Title Author(s) Keyword(s) Published in Language DateThis property is a special property in this wiki. Abstract
Open Wikis and the Protection of Institutional Welfare Andrew G. West
Insup Lee
EDUCAUSE Center for Applied Research, Research Bulletins English 7 February 2012 Much has been written about wikis’ reliability and use in the classroom. This research bulletin addresses the negative impacts on institutional welfare that can arise from participating in and supporting wikis. The open nature of the platform, which is fundamental to wiki operation and success, enables these negative consequences. A finite user base that can be determined a priori (e.g., a course roster) minimizes the security implications, hence our discussion in this bulletin primarily concerns open or public wikis that accept contributions from a broad and unknown set of Internet users.
Autonomous Link Spam Detection in Purely Collaborative Environments Andrew G. West
Avantika Agrawal
Phillip Baker
Brittney Exline
Insup Lee
Wikipedia
Collaboration
Collaborative security
Information security
Spam
Spam mitigation
Reputation
Spatio- temporal features
Machine learning
Intelligent routing
WikiSym English October 2011 Collaborative models (e.g., wikis) are an increasingly prevalent Web technology. However, the open-access that defines such systems can also be utilized for nefarious purposes. In particular, this paper examines the use of collaborative functionality to add inappropriate hyperlinks to destinations outside the host environment (i.e., link spam). The collaborative encyclopedia, Wikipedia, is the basis for our analysis.

Recent research has exposed vulnerabilities in Wikipedia's link spam mitigation, finding that human editors are latent and dwindling in quantity. To this end, we propose and develop an autonomous classifier for link additions. Such a system presents unique challenges. For example, low barriers-to-entry invite a diversity of spam types, not just those with economic motivations. Moreover, issues can arise with how a link is presented (regardless of the destination).

In this work, a spam corpus is extracted from over 235,000 link additions to English Wikipedia. From this, 40+ features are codified and analyzed. These indicators are computed using "wiki" metadata, landing site analysis, and external data sources. The resulting classifier attains 64% recall at 0.5% false-positives (ROC-AUC=0.97). Such performance could enable egregious link additions to be blocked automatically with low false-positive rates, while prioritizing the remainder for human inspection. Finally, a live Wikipedia implementation of the technique has been developed.
Link Spamming Wikipedia for Profit Andrew G. West
Jian Chang
Krishna Venkatasubramanian
Oleg Sokolsky
Insup Lee
Web 2.0 spam
Spam
Wikipedia
Wiki
Collaborative security
Attack model
Measurement study
Spam economics
CEAS '11: Proc. of the 8th Annual Collaboration, Electronic Messaging, Anti-Abuse, and Spam Conference English September 2011 Collaborative functionality is an increasingly prevalent web technology. To encourage participation, these systems usually have low barriers-to-entry and permissive privileges. Unsurprisingly, ill-intentioned users try to leverage these characteristics for nefarious purposes. In this work, a particular abuse is examined -- link spamming -- the addition of promotional or otherwise inappropriate hyperlinks.

Our analysis focuses on the wiki model and the collaborative encyclopedia, Wikipedia, in particular. A principal goal of spammers is to maximize *exposure*, the quantity of people who view a link. Creating and analyzing the first Wikipedia link spam corpus, we find that existing spam strategies perform quite poorly in this regard. The status quo spamming model relies on link persistence to accumulate exposures, a strategy that fails given the diligence of the Wikipedia community. Instead, we propose a model that exploits the latency inherent in human anti-spam enforcement.

Statistical estimation suggests our novel model would produce significantly more link exposures than status quo techniques. More critically, the strategy could prove economically viable for perpetrators, incentivizing its exploitation. To this end, we address mitigation strategies.
Multilingual Vandalism Detection using Language-Independent & Ex Post Facto Evidence Andrew G. West
Insup Lee
PAN-CLEF English September 2011 There is much literature on Wikipedia vandalism detection. However, this writing addresses two facets given little treatment to date. First, prior efforts emphasize zero-delay detection, classifying edits the moment they are made. If classification can be delayed (e.g., compiling offline distributions), it is possible to leverage ex post facto evidence. This work describes/evaluates several features of this type, which we find to be overwhelmingly strong vandalism indicators.

Second, English Wikipedia has been the primary test-bed for research. Yet, Wikipedia has 200+ language editions and use of localized features impairs portability. This work implements an extensive set of language-independent indicators and evaluates them using three corpora (German, English, Spanish). The work then extends to include language-specific signals. Quantifying their performance benefit, we find that such features can moderately increase classifier accuracy, but significant effort and language fluency are required to capture this utility.

Aside from these novel aspects, this effort also broadly addresses the task, implementing 65 total features. Evaluation produces 0.840 PR-AUC on thezero-delay task and 0.906 PR-AUC with ex post facto evidence (averaging languages). Performance matches the state-of-the-art (English), sets novel baselines (German, Spanish), and is validated by a first-place finish over the 2011 PAN-CLEF test set.
Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features B. Thomas Adler
Luca de Alfaro
Santiago M. Mola Velasco
Paolo Rosso
Andrew G. West
Wikipedia
Wiki
Collaboration
Vandalism
Machine learning
Metadata
Natural Language Processing
Reputation
Lecture Notes in Computer Science English February 2011 Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, about 7% are acts of vandalism. Such behavior is characterized by modifications made in bad faith; introducing spam and other inappropriate content. In this work, we present the results of an effort to integrate three of the leading approaches to Wikipedia vandalism detection: a spatio-temporal analysis of metadata (STiki), a reputation-based system (WikiTrust), and natural language processing features. The performance of the resulting joint system improves the state-of-the-art from all previous methods and establishes a new baseline for Wikipedia vandalism detection. We examine in detail the contribution of the three approaches, both for the task of discovering fresh vandalism, and for the task of locating vandalism in the complete set of Wikipedia revisions.
STiki: An Anti-Vandalism Tool for Wikipedia Using Spatio-Temporal Analysis of Revision Metadata Andrew G. West
Sampath Kannan
Insup Lee
Wikipedia
Collaboration software
Information security
Intelligent routing
Spatio-temporal processing
WikiSym English July 2010 STiki is an anti-vandalism tool for Wikipedia. Unlike similar tools, STiki does not rely on natural language processing (NLP) over the article or diff text to locate vandalism. Instead, STiki leverages spatio-temporal properties of revision metadata. The feasibility of utilizing such properties was demonstrated in our prior work, which found they perform comparably to NLP-efforts while being more efficient, robust to evasion, and language independent. STiki is a real-time, on-Wikipedia implementation based on these properties. It consists of, (1) a server-side processing engine that examines revisions, scoring the likelihood each is vandalism, and, (2) a client-side GUI that presents likely vandalism to end-users for definitive classiffcation (and if necessary, reversion on Wikipedia). Our demonstration will provide an introduction to spatio-temporal properties, demonstrate the STiki software, and discuss alternative research uses for the open-source code.
Detecting Wikipedia Vandalism via Spatio-Temporal Analysis of Revision Metadata Andrew G. West
Sampath Kannan
Insup Lee
Wikipedia
Spatio-temporal reputation
Vandalism
Collaboration software
Content-based access control
EUROSEC English April 2010 Blatantly unproductive edits undermine the quality of the collaboratively-edited encyclopedia, Wikipedia. They not only disseminate dishonest and offensive content, but force editors to waste time undoing such acts of vandalism. Language-processing has been applied to combat these malicious edits, but as with email spam, these filters are evadable and computationally complex. Meanwhile, recent research has shown spatial and temporal features effective in mitigating email spam, while being lightweight and robust. In this paper, we leverage the spatio-temporal properties of revision metadata to detect vandalism on Wikipedia. An administrative form of reversion called rollback enables the tagging of malicious edits, which are contrasted with nonoffending edits in numerous dimensions. Crucially, none of these features require inspection of the article or revision text. Ultimately, a classifier is produced which flags vandalism at performance comparable to the natural-language efforts we intend to complement (85% accuracy at 50% recall). The classifier is scalable (processing 100+ edits a second) and has been used to locate over 5,000 manually-confirmed incidents of vandalism outside our labeled set.