Robots exclusion standard

standard used to advise web crawlers and scrapers not to index a web page or site

The robots exclusion standard (also called the robots exclusion protocol or robots.txt protocol) is a way of telling Web crawlers and other Web robots which parts of a Web site they can see.

To give robots instructions about which pages of a Web site they can access, site owners put a text file called robots.txt in the main directory of their Web site, e.g. http://www.example.com/robots.txt.[1] This text file tells robots which parts of the site they can and cannot access. However, robots can ignore robots.txt files, especially malicious (bad) robots.[2] If the robots.txt file does not exist, Web robots assume that they can see all parts of the site.

Examples of robots.txt files

change

References

change
  1. "Robot Exclusion Standard". HelpForWebBeginners.com. Archived from the original on 2011-12-08. Retrieved 2012-02-13.
  2. "About /robots.txt". Robotstxt.org. Retrieved 2012-02-13.