New standards are being developed to extend the Robots Exclusion Protocol and Meta Robots tags, allowing them to block all AI crawlers from using publicly available web content for training purposes. The proposal, drafted by Krishna Madhavan, Principal Product Manager at Microsoft AI, and Fabrice Canel, Principal Product Manager at Microsoft Bing, will make it easy to block all mainstream AI Training crawlers with one simple rule that can be applied to each individual crawler.
Virtually all legitimate crawlers obey the Robots.txt and Meta Robots tags which makes this proposal a dream come true for publishers who don’t want their content used for AI training purposes.
Internet Engineering Task Force (IETF)
The Internet Engineering Task Force (IETF) is an international Internet standards making group founded in 1986 that coordinates the development and codification of standards that everyone can voluntarily agree one. For example, the Robots Exclusion Protocol was independently created in 1994 and in 2019 Google proposed that the IETF adopt it as an official standards with agreed upon definitions. In 2022 the IETF published an official Robots Exclusion Protocol that defines what it is and extends the original protocol.
Three Ways To Block AI Training Bots
The draft proposal for blocking AI training bots suggests three ways to block the bots:
Robots.txt Protocols
Meta Robots HTML Elements
Application Layer Response Header
1. Robots.Txt For Blocking AI Robots
The draft proposal seeks to create additional rules that will extend the Robots Exclusion Protocol (Robots.txt) to AI Training Robots. This will bring about some order and give publishers choice in what robots are allowed to crawl their websites.
Adherence to the Robots.txt protocol is voluntary but all legitimate crawlers tend to obey it.
The draft explains the purpose of the new Robots.txt rules:
“While the Robots Exclusion Protocol enables service owners to control how, if at all, automated clients known as crawlers may access the URIs on their services as defined by [RFC8288], the protocol doesn’t provide controls on how the data returned by their service may be used in training generative AI foundation models.
Application developers are requested to honor these tags. The tags are not a form of access authorization however.”
An important quality of the new robots.txt rules and the meta robots HTML elements is that legit AI training crawlers tend to voluntarily agree to follow these protocols, which is something that all legitimate bots do. This will simplify bot blocking for publishers.
The following are the proposed Robots.txt rules: