Wikipedia:Pruning article revisions

Like a busy Wikipedia editor pruning an article, this man is doing a serious trimming on a tree.

Although the Simple Wikipedia, today, has less than 257 thousand articles (plus thousands of red-link articles), the total revisions number nearly 10 million. As of 10-September-2024, the article count was 256,553 articles, with 9,736,592 total revisions, giving an average[1] of 11.57 revisions per article.

Pruning, or erasing (deleting) small revisions might need changing of Wikipedia policies or adjusting the underlying MediaWiki software. However, revisions could also be pruned in advance, by many actions each person could take to avoid long-term storing of more revisions.

Problems with numerous revisions

change

You might wonder if it even matters that Wikipedia has such a high number of revisions. But there are some issues.

First, there is a heavy cost to the numerous revisions. The Wikimedia Foundation, which operates Wikipedia, is non-profit organization. Its continued operation relies solely on private donations. Without these donations, Wikipedia could not exist.

There is a cost to each edit, each revision stored, and each time a page is viewed even without editing. Though the cost of each one is minute, with millions each day, these all add up. People are not discouraged from reading or editing Wikipedia for this reason. That's what this is for, and funds have been allocated for this purpose, provided they are used responsibly.

Article quality

change

The quality of writing is also at stake. The goal of Wikipedia is to have a well-written encyclopedia. This cannot be accomplished in one day, or in a single edit for each article. But an edit history that is clogged with experimental or "junk" edits may become confusing, and versions that fail to meet encyclopedic standards may in the long run have a negative effect, particularly if they are not marked with tags to indicate their temporary status.

An editor who makes multiple edits to an article in an attempt to achieve his/her final plan could be viewing any edits before the final one as temporary revisions that will not remain very long. Sometimes, it is not easy or even possible to get the permanently planned revision made in a single edit. This can be the case when the edit contains a large amount of text, or when it is difficult to write all the text at once.

User actions in pruning revisions

change

Some of the actions that could be performed, by individual users, to prune both the current and future revisions would be:

  1. Combine future edits together, as one "SAVE" operation.
  2. Avoid instant revert of hacking, but instead combine fixes with an update.
  3. Avoid editing an article for minor fixes, combine with enhancements.
  4. Explain how saving multiple revisions is not usually safer.
  5. Plan enhancements offline, or in a user-space test version to be deleted.
  6. Create new articles offline, then copy online only for previewing.
  7. Create new articles in user-space, then copy (not move) the finished article into article-space, allowing deletion of the early drafts.
  8. Copy (not move) old articles when renaming, but copy/list the old revision history as a talk-page topic.
  9. Avoid running fixer-bots too often, and beware small auto-updates, especially for minor words or auto-fixing grammar in vandalism humor.

Each of the above actions is elaborated below, to explain the particular ways of reducing revisions.

Combine future edits together

change

Many new users simply do not know they can combine their edits as one SAVE, using a longer edit-summary line. Users often just change a word & save, change a word & save, etc. Some users don't even realize how the extra revisions pile-up, expanding as a long list under the History tab of past edits. Several simple steps could be taught:

  • State that the edit-summary can scroll larger than the input box, to note multiple changes described in a single update.
  • Explain that several changes, at once, can be accepted by other users, in a single edit, without defending each change.
  • Suggest that users can copy the wiki-edit buffer into their computer clipboard (such "^A^C") to overcome web-transmission errors.
  • Warn that saving multiple revisions is not safer: others could still revert all changes back from multiple revisions.
  • Explain the use of article-diffs to see what other users have changed, to allow combining edit-conflict (collision) text into an ongoing long edit.
  • Describe how to copy the edit-buffer offline to allow combining long-term changes, or copy updated sections from an offline copy into a live edit.
  • Warn that text containing special-characters, such as the bottom interwiki language links, will get converted to "?????" in some text editors. So, updated text must be copied, from those text-editors, into the wiki-edit buffer, replacing around the special-character sections.

I've noticed several new users, while greatly expanding a low-traffic article, will keep saving every new phrase as though other users might pounce, at any minute, on the revised article before the next desperate SAVE is made. Many users don't understand how to watch the History listing to see if an article is quiet now, unchanged for weeks, and can be safely edited for hours with no one else making changes.

Avoid instant revert of hacking

change

Some kinds of vandalism go unnoticed for months, sometimes over 6 months, so there is no hurry to revert all hackings made in low-traffic articles. There are several issues to consider:

  • If the hacked text is extensive, then at least quickly fix any grammar issues or improve some wording before saving.
  • Preferably, take time to add another source, or investigate a "citation-needed" issue, while also fixing the hacked text.
  • In the edit-summary line, emphasize the improved change, rather than highlight that vandals have forced someone to "revert" under their power.
  • Perhaps volunteer to assist an admin who could post a request-list of several articles to fix, then respond with a later status list showing which articles have been improved from the request-list.
  • Realize that if a vandal causes someone to jump, such as reacting with rapid reverts, then they have their victims on the puppet-strings: do not quickly do anything to show trolls how they have power to cause other people to scramble. Make them wait.
  • Perhaps long-term, have a group devoted to enhancing a list of hacked articles, recommending update-protection for repeatedly hacked articles.

The tendency to rapidly revert vandalism, as though the whole world has stopped breathing, has made it difficult for other users to also enhance those articles while correcting hacked text. It is an utter myth that "Every article gets vandalized" (not true): many articles go years without ever being hacked. Depending on notability or libel concerns, many hacked articles could wait (a long time) to be fixed while improving.

Saving multiple revisions is not safer

change

It would be a lot easier to warn new users, early on, that just because they SAVE, after every change, as a new revision, those changes are not safer or, somehow, more permanent. No, other users can simply revert all those revisions, reversing days/weeks of edits. The only permanent change is bloating Wikipedia ("forever") with numerous edits, all of which get reverted. Some issues to consider:

  • If the article seems unstable, then perhaps create a sub-article to expand on issues that might get reverted in the main article.
  • If most users reject the revised text, then perhaps abandon that article, and focus on other articles instead.

Not every group of other editors is cooperative; sometimes, cliques of other users can act like inner-wiki ("inner-city") gangs that live by their code, while outsiders receive pre-calculated treatment. There might be no safe way to save revisions, or get talk-page consensus, under those circumstances. Either move on to other articles, or else, contact an admin or WikiProject that might help the balance of power.

Plan enhancements offline

change

Plan some enhancements to articles by using offline files of either notes or potential new text. It might help to explain to users that creating an offline version of an article can allow more time for completing broad revisions, but then simply merge, into that new text, any other user changes that have been made meanwhile. Some issues to note:

  • Long-term editing offline can allow broader wording to be completed, whereas partial SAVEs could alert other users to tinker with the half-finished wording and complicate the whole update.
  • Create an offline text file of source footnotes about the article; those sources might also be added to several related articles.
  • Create an offline text file of new paragraphs: keeping those paragraphs separate from the whole article might simplify the focus on the new wording over a long period, rather than get lost if expanded within the whole article.

For large updates, planning enhancements offline could reduce total revisions by a factor of 25 (or more), due to the focus on broad wording, while avoiding hacks to half-finished text by the tinkering of other users.

Create new articles offline

change

Beyond just planning enhancements, it can be easier to create entire articles offline. Some issues to note:

  • Remember to include all the basic parts: intro, history (if applicable), standard sections of WP:Layout (See also, Notes, References, Other websites), bottom categories, and interwiki links (if the topic has other-language articles).
  • More than any other omission, the bottom categories are probably the most overlooked aspect when writing new articles.
  • Take time to get the first version of the article to be "ready for primetime" and avoid a lot of early revisions.
  • Also consider keeping a separate list of source footnotes that might also be inserted into other, older articles.
  • Creating articles offline, and giving them a solid set of references and content BEFORE they are put on Wikipedia also makes it less likely that the articles will get nominated for speedy deletion.

Note that offline storage does not have the Wikipedia backup protection, so be sure to make periodic copies, if needed, for backup.

Create in user-space, then copy not move

change

Perhaps the easiest way to actually erase old revisions is to create a new article within the user-space for a particular user, then copy (not move) the article to become a brand-new entry in article-space. When the text is copied as a new article, then all those user-space revisions could be deleted. Some issues to note:

  • Also create a talk-page, if there will be some unusual aspects to note about the article's content or formatting.
  • If the early revision history was really important, then copy just the text of the history listing into the talk-page or a subpage "Talk:<article>/old_history", as a topic which can explain how the past revisions were important.
  • When inviting other editors to revise in that user-space, announce that the revisions will be pruned later, leaving just the final copy.
  • Reusing a temp article will not delete older revisions: keeping a page called "User:Xxxx/Temp" to be used again for another article does not allow deletion of past revisions.

The article revisions will not be removed from Wikipedia until the user-space article has been deleted. After 2007, it became possible to restored a deleted article, so in the event of a major misunderstanding, any deleted article can be restored by an admin.

Copy (not move) old articles to rename

change

An effective way to actually erase many old revisions is to rename an old article, by creating the new name as a new article, then copy (not move) the article to become a brand-new entry in article-space. When the text is copied as a new article, then all those old revisions (under the old name) could be later deleted. Some issues to note:

  • Some people treasure the old stuff in revisions, so only particular articles could be re-created.
  • Only a relative few articles are scheduled to be renamed: the most common involve creating a disambiguation page, such as "Newton (disambiguation)".
  • Many articles could be renamed with an epithet "(scientist)" then delete the original, creating a redirect to the new: create "Albert Einstein (scientist)" and then delete article "Albert Einstein" but quickly recreate as a redirection to "Albert...(scientist)".
  • Also copy the talk-page (and any archive subpages), putting a new topic that lists the old revision history, for the old name.
  • Such copying of articles, without the old revisions, has been done for months for the commons images and photos (in Wikimedia Commons), setting a long-term precedent.
  • For detailed comparisons, old revisions could be recreated by purposely editing the new page as a series of large edits, each storing successive updates to simulate the old major revisions, and logging the original user name in each edit-summary line.

Again, some people treasure the old stuff, like keeping decades of old, worn-out shoes, so not all articles can be pruned so easily by simply copy-renaming them into a clean, fresh name. The best compromise would be to keep an old history list in the talk-page or as a subpage "/old_history". However, in a fight to retain the detailed differences, it might be necessary to recreate the major old revisions by a series of repeated edits, filling the wiki-edit buffer with each successive major revision, and then saving each to allow comparisons of texts. Those recreated revisions should copy the edit-summary line from the actual old revisions, but also include text identifying each original user+date. Even though that re-creation, of major revisions, might seem extreme, many hundreds/thousands of minor revisions would be omitted, representing years of edits with just a few dozen recreated, major (non-trivial) revisions. Such a re-created article could retain the major detailed differences, showing how the article evolved, but omit the thousands of minor interim revisions that clutter a step-by-step viewing of each next revision.

Avoid running fixer-bots too often

change

Some of the Wikipedia bots, which perform repeated robotic edits to many articles, are focused on really tiny updates to articles, with some even correcting the grammar in vandalism humor or within hacked text that will soon be removed. For example, a bot that capitalizes the word "english" to be "English" assumes that people never use the word in any other manner, such as "putting english" (a spin) on a billiard ball. It is highly debatable to allow bots to run rampant, and make opinionated conclusions about lowercase words (such as "english"). Of course, thousands of revisions could be generated by renaming "metre" to "meter" or some similar minor changes. Bots should be denied from making those minor changes.

The fixer-bots should be run on a rare basis, and perhaps even count how many corrections would be made to an article, then cleverly refuse to update an article just for a single minor word, unless the bot was running in a quarterly update-all-minor-issues mode. Let the bots analyze small problems, or even count the occurrences of lowercase "english" and such, but those bots should wait to fix minor issues, limiting severe precision to a few times a year, and then fix all problems with one revision to each article, at that later time.

Combined effect across many users

change

Although techniques such as deleting user-space articles and copy-renaming old articles might seem extreme, the combined effect of thousands of users, each redoing articles in their specialty, could reduce portions of Wikipedia by 10x, 100x or 1000x times fewer revisions. After that point, users taught to combine numerous small edits (in the future), and avoid panic saving, would reduce the subsequent revision lists and omit most of the small stuff that cluttered Wikipedia during 2006–2008.

The overall effect would be simple to measure: just compare how the average-revision count fell, compared with:

  • the 15-November-2008 figure of 100.81 revisions per article;
  • versus today's current 37.95 revisions per article.

Progress could be measured by comparing those averages against a future, anticipated reduction in average revisions per article.

Pruning older revisions

change

Many revisions of Wikipedia articles are very minor and could be "pruned" from the overall History of article revisions, to leave only the more major revisions. Even if the minor revisions were actually left in storage, perhaps they could be bypassed when listing all other article revisions. However, actually erasing specific revisions from an article might require changing, or reconfiguring, the software behind Wikipedia, the MediaWiki system.

Perhaps 50% of all revisions are hackings/jokes + revert: it's not just the hacking of articles that escalates the total revisions, but the instant reverting that doubles the total revisions to recover from hacking.

If the Wikipedia system for storing pages were changed, then over half of all revisions could be erased from Wikipedia servers and no longer listed under the "History" tab of logged revisions:

  • Once a revision is judged to be a hacking/humor, then "erase" rather than revert: the hacked version would be erased from storage and from the logs (at some point).
  • If a single user makes several small consecutive edits to an article in one day, then combine those small edits as one revision.

Again, erasing or combining old revisions might require changing, or adjusting, the MediaWiki software behind Wikipedia.

Social impact of hackings

change

Long lists of revisions showing hackings, each followed by instant-reverts, can cause the History listing of revisions to become much longer. Beyond a survey of prankster edits, there's not much use for the long-term logging of a hacking+revert. It's analogous to bird droppings: it might be beneficial to notice some short-term patterns, such as when automobiles parked under some trees get bombarded with bird droppings, but it is less useful to record all the millions of bird droppings, everywhere in the world, in a giant database of history listings. Just forget about those bird droppings, rather than log each for eternity. Seek to have a "clean car", and focus on issues such as dents or tree branches falling on a automobile.

Implications

change

The process of erasing a revision should require some special authority to avoid edit-wars that would use erase to negate other contributions. The 3-revert rule (WP:3RR) is intended to limit edit-wars during a one-day period. However, unrestricted use of erase could prolong edit-wars where revisions were removed without reverting.

Also, an erase could complicate the wikiserver databases, unless an erase was treated as a "delete" transaction in the storage. The database transaction log could handle the erase as a delete of that particular revision. Internally, the article history could record a hacking as a "Try" followed by an "Erase" so that the list of revisions would not show either transaction under the "History" tab when listing all the true revisions. However, long term, it might be more efficient if the hacking+erase really disappeared from the article storage at some point. Those combined "Try+Erase" transactions could be removed from the system after some period of time. Obviously, each transaction must be recorded, short-term, to allow backup/recovery of the edited articles, in the event of storage failure mid-way before the erase took effect.

Alternatives

change

If hackings cannot be erased, then perhaps they could be skipped in History listings. A special type of revert could indicate that the prior revision was merely hacked text, with the result that both the hacked+reverted entries would be skipped under a History listing. It can be very tedious, when stepping through multiple revisions, to see "Page blanked with" and then "Reverted" to show the entire text of the article changed in both revisions. Skipping over the hacked+reverted revisions would cut the History listing of many high-traffic articles by about 50% or more. However, many low-traffic articles have gone years without being hacked.

change
 
  1. The average revisions per article is calculated as the total revisions divided by total pages: avg = #total_edits / #pages, as {{NUMBEROFEDITS:R}} / {{NUMBEROFPAGES:R}}.