Robots exclusion standard

standard used to advise web crawlers and scrapers not to index a web page or site

The robots exclusion standard (also called the robots exclusion protocol or robots.txt protocol) is a way of telling Web crawlers and other Web robots which parts of a Web site they can see.

To give robots instructions about which pages of a Web site they can access, site owners put a text file called robots.txt in the main directory of their Web site, e.g. http://www.example.com/robots.txt.[1] This text file tells robots which parts of the site they can and cannot access. However, robots can ignore robots.txt files, especially malicious (bad) robots.[2] If the robots.txt file does not exist, Web robots assume that they can see all parts of the site.

Examples of robots.txt files change

References change

  1. "Robot Exclusion Standard". HelpForWebBeginners.com. Retrieved 2012-02-13.
  2. "About /robots.txt". Robotstxt.org. Retrieved 2012-02-13.