Re: Automatic tagged vandalism corpus[edit]

In response to the "Automatic tagged vandalism corpus" open question. I have done work on tagging vandalism based on the usage of "rollback" functionality. This is a process described in my EUROSEC paper and was used to build a corpus distributed on my homepage(see "Software & Datasets", though perhaps out-dated by this point). To some extent, humans *are* involved, but it doesn't require any additional work beyond what people are already doing naturally on the encyclopedia (what I think I call "implicit feedback"). This process is at the core of the "metadata" algorithm delivered by the STiki tool. Thanks, West.andrew.g 22:10, February 18, 2012 (EST)

Thanks for the notice. I didn't remember about your corpus (I read about it some time ago, but I forgot it). Yes, your corpus is a good approach to the open question I wrote. Is the software publicly available and free licensed? I would like you check it. Which is the vandalism corpus license? Also, I have a question, don't you include regular edits, right? Vandalism corpus are good for machine learning, and a vandalism corpus whithout good edits are not fine for training a bot. Furthermore, literature says that about 6-7% of edits in English Wikipedia are vandalism, so you need a 93-94% of regular edits for a balanced corpus. A little tip, you can append the parameter &diffonly=1 to avoid to load the entire page I have started Wikipedia Vandalism Corpus (Andrew G. West) (I don't know how to name it, all vandalism corpus are vandalism corpus ; )) Regards. emijrp (talk) 17:24, February 19, 2012 (EST)