|This user account is a bot operated by Chenzw (talk). It is used to make repetitive automated or semi-automated edits that would be extremely tedious to do manually, in accordance with the bot policy. The bot is approved and currently active. |
Administrators: if this bot is malfunctioning or causing harm, please block it.
|Emergency robot shutoff button|
|Administrators: Use this button if the bot is malfunctioning. (direct link)
Non-administrators can report misbehaving bots to Wikipedia:Administrators' noticeboard.
|ChenzwBot (Talk · Contribs)|
ChenzwBot patrols the sea of recent changes.
|Flagged?||Yes (11 April 2008)|
|Edit rate:||Variable (Anti-vandalism)|
|Automatic or manual?||Automatic|
|Programming language/s:||PHP and Python|
|Source code published?||https://gitlab.com/antivandalbot-ng (partial)|
- Since 2010: Anti-vandalism task begins, using Chris G Bot's code. Vandalism detection was achieved by evaluating edits using regular expressions. Prone to low detection and high false positive rates.
- Approximately Dec 2015: Bot begins using the revscoring library (which powers ORES) to extract features (e.g. numbers of characters added/removed) about each edit. Vandalism probability is predicted by a Random forest classifier.
- Mid-2016: Bot core rewritten in line with the reactor design pattern.
- 7 May 2018: Bot core rewritten (again) in Python, which is vastly more efficient than the original PHP implementation. Classifier changed to XGBoost.
- Mid-October 2019: Classifier changed to LightGBM, with substantial improvements to how each diff is evaluated. Words added by editors are transformed to tf–idf vectors, and fed into a separate Bayesian classifier. These words are also tagged as nouns/verbs/pronouns etc., with the counts of the various categories becoming new inputs for the main LightGBM classifier.
Summary of algorithm
Unlike previous generation anti-vandalism bots (Chris G Bot, GoblinBot4, and previous codebases of ChenzwBot; a brief overview of how they worked can be found in History above), ChenzwBot learns what is considered vandalism from a list of pre-classified edits provided by a human. This is the training dataset.
The bot evaluates each edit according to the following:
- Extracting features from each edit using the revscoring library. Such features usually come in the form of certain metrics, such as (but not limited to): total length of edit, number of nouns added/removed, number of special characters added/removed, number of wikilinks added/removed. This step is similar to how the revision scoring tool on Recent Changes works.
- Part-of-speech tagging: the same word can have different meanings depending on context. For example, "film" can be used both as a noun and a verb. To further distinguish between both uses, the bot tags the word using the spaCy library.
- Understanding edit content: words added in the edit are then counted. Due to processing in step 2, the same word used in different contexts is counted separately. To avoid giving commonly used words undue weight, they are transformed to tf–idf vectors. This has the effect of prioritising relatively uncommon words in the dataset, and penalising common words in the dataset, when they appear in the edit.
- The tf-idf vector is provided as an input to a Bayesian classifier. The classifier provides a score representing the probability that the words used in that edit are characteristic of vandalism.
- The features in step 1 and the probability score in step 4 are supplied as inputs to a gradient boosting classifier. The output of the classifier is the final calculated probability that the edit is vandalism. ChenzwBot uses the LightGBM library.
A note about probability thresholds
- The final probability given by the LightGBM classifier is compared against a defined threshold. Probabilities higher than the threshold are considered vandalism and reverted accordingly, while those lower than the threshold are considered constructive. The threshold is calculated against a separate dataset (not used in training) such that it will match a specific false positive rate. This false positive rate is set by a human.
- The bot will not revert local sysops and rollbackers.
- The bot will not revert any registered user with more than a certain number of edits.
- The bot will not revert any edit made to a page on the exclusion list. This currently includes Wikipedia:Sandbox, Wikipedia:Introduction, and Wikipedia:Student tutorial.
The bot will not revert to itself if it last reverted the page less than 24 hours ago.Exceptions to this rule include Wikipedia, Simple English Wikipedia, WP:AN, and WP:ST.
- As of 16 December 2019 the bot will instead not revert the same user/page combination more than once every 24 hours. This means that the bot may revert the same page multiple times within a 24 hour window if the vandals are different.
- Due to issues with false positives, the bot will no longer revert any edit outside of article space.