The Modern Guide To Robots.txt: How To Use It Avoiding The Pitfalls

The Modern Guide To Robots.txt

Robots.txt just turned 30 – cue the existential crisis! Like many hitting the big 3-0, it’s wondering if it’s still relevant in today’s world of AI and advanced search algorithms.

Spoiler alert: It definitely is!

Let’s take a look at how this file still plays a key role in managing how search engines crawl your site, how to leverage it correctly, and common pitfalls to avoid.

What Is A Robots.txt File?

A robots.txt file provides crawlers like Googlebot and Bingbot with guidelines for crawling your site. Like a map or directory at the entrance of a museum, it acts as a set of instructions at the entrance of the website, including details on:


What crawlers are/aren’t allowed to enter?
Any restricted areas (pages) that shouldn’t be crawled.
Priority pages to crawl – via the XML sitemap declaration.

Its primary role is to manage crawler access to certain areas of a website by specifying which parts of the site are “off-limits.” This helps ensure that crawlers focus on the most relevant content rather than wasting the crawl budget on low-value content.

While a robots.txt guides crawlers, it’s important to note that not all bots follow its instructions, especially malicious ones. But for most legitimate search engines, adhering to the robots.txt directives is standard practice.

What Is Included In A Robots.txt File?

Robots.txt files consist of lines of directives for search engine crawlers and other bots.

Valid lines in a robots.txt file consist of a field, a colon, and a value.

Robots.txt files also commonly include blank lines to improve readability and comments to help website owners keep track of directives.

Image from author, November 2024
To get a better understanding of what is typically included in a robots.txt file and how different sites leverage it, I looked at robots.txt files for 60 domains with a high share of voice across health, financial services, retail, and high-tech.

Excluding comments and blank lines, the average number of lines across 60 robots.txt files was 152.

Large publishers and aggregators, such as hotels.com, forbes.com, and nytimes.com, typically had longer files, while hospitals like pennmedicine.org and hopkinsmedicine.com typically had shorter files. Retail site’s robots.txt files typically fall close to the average of 152.

All sites analyzed include the fields user-agent and disallow within their robots.txt files, and 77% of sites included a sitemap declaration with the field sitemap.

Fields leveraged less frequently were allow (used by 60% of sites) and crawl-delay (used by 20%) of sites.




Field
% of Sites Leveraging


user-agent
100%


disallow
100%


sitemap
77%


allow
60%


crawl-delay
20%

Robots.txt Syntax

Now that we’ve covered what types of fields are typically included in a robots.txt, we can dive deeper into what each one means and how to use it.

For more information on robots.txt syntax and how it is interpreted by Google, check out Google’s robots.txt documentation.

User-Agent

The user-agent field specifies what crawler the directives (disallow, allow) apply to. You can use the user-agent field to create rules that apply to specific bots/crawlers or use a wild card to indicate rules that apply to all crawlers.

For example, the below syntax indicates that any of the following directives only apply to Googlebot.

user-agent: Googlebot

If you want to create rules that apply to all crawlers, you can use a wildcard instead of naming a specific crawler.

user-agent: *

You can include multiple user-agent fields within your robots.txt to provide specific rules for different crawlers or groups of crawlers, for example:

user-agent: *

#Rules here would apply to all crawlers

user-agent: Googlebot

#Rules here would only apply to Googlebot

user-agent: otherbot1

user-agent: otherbot2

user-agent: otherbot3

#Rules here would apply to otherbot1, otherbot2, and otherbot3

Disallow And Allow

The disallow field specifies paths that designated crawlers should not access. The allow field specifies paths that designated crawlers can access.

Because Googlebot and other crawlers will assume they can access any URLs that aren’t specifically disallowed, many sites keep it simple and only specify what paths should not be accessed using the disallow field.

For example, the below syntax would tell all crawlers not to access URLs matching the path /do-not-enter.

user-agent: *

disallow: /do-not-enter

#All crawlers are blocked from crawling pages with the path /do-not-enter

If you’re using both allow and disallow fields within your robots.txt, make sure to read the section on order of precedence for rules in Google’s documentation.

Generally, in the case of conflicting rules, Google will use the more specific rule.

For example, in the below case, Google won’t crawl pages with the path/do-not-enter because the disallow rule is more specific than the allow rule.

user-agent: *

allow: /

disallow: /do-not-enter

If neither rule is more specific, Google will default to using the less restrictive rule.

In the instance below, Google would crawl pages with the path/do-not-enter because the allow rule is less restrictive than the disallow rule.

user-agent: *

allow: /do-not-enter

disallow: /do-not-enter

Note that if there is no path specified for the allow or disallow fields, the rule will be ignored.

user-agent: *

disallow:

This is very different from only including a forward slash (/) as the value for the disallow field, which would match the root domain and any lower-level URL (translation: every page on your site).  

If you want your site to show up in search results, make sure you don’t have the following code. It will block all search engines from crawling all pages on your site.

user-agent: *

disallow: /

This might seem obvious, but believe me, I’ve seen it happen.

URL Paths

URL paths are the portion of the URL after the protocol, subdomain, and domain beginning with a forward slash (/). For the example URL https://www.example.com/guides/technical/robots-txt, the path would be /guides/technical/robots-txt.

Image from author, November 2024
URL paths are case-sensitive, so be sure to double-check that the use of capitals and lower cases in the robot.txt aligns with the intended URL path.

Special Characters

Google, Bing, and other major search engines also support a limited number of special characters to help match URL paths.

A special character is a symbol that has a unique function or meaning instead of just representing a regular letter or number. Special characters supported by Google in robots.txt are:


Asterisk (*) – matches 0 or more instances of any character.
Dollar sign ($) – designates the end of the URL.

To illustrate how these special characters work, assume we have a small site with the following URLs:


https://www.example.com/
https://www.example.com/search
https://www.example.com/guides
https://www.example.com/guides/technical
https://www.example.com/guides/technical/robots-txt
https://www.example.com/guides/technical/robots-txt.pdf
https://www.example.com/guides/technical/xml-sitemaps
https://www.example.com/guides/technical/xml-sitemaps.pdf
https://www.example.com/guides/content
https://www.example.com/guides/content/on-page-optimization
https://www.example.com/guides/content/on-page-optimization.pdf

Example Scenario 1: Block Site Search Results

A common use of robots.txt is to block internal site search results, as these pages typically aren’t valuable for organic search results.

For this example, assume when users conduct a search on https://www.example.com/search, their query is appended to the URL.

If a user searched “xml sitemap guide,” the new URL for the search results page would be https://www.example.com/search?search-query=xml-sitemap-guide.

When you specify a URL path in the robots.txt, it matches any URLs with that path, not just the exact URL. So, to block both the URLs above, using a wildcard isn’t necessary.

The following rule would match both https://www.example.com/search and https://www.example.com/search?search-query=xml-sitemap-guide.

user-agent: *

disallow: /search

#All crawlers are blocked from crawling pages with the path /search

If a wildcard (*) were added, the results would be the same.

user-agent: *

disallow: /search*

#All crawlers are blocked from crawling pages with the path /search

Example Scenario 2: Block PDF files

In some cases, you may want to use the robots.txt file to block specific types of files.

Imagine the site decided to create PDF versions of each guide to make it easy for users to print. The result is two URLs with exactly the same content, so the site owner may want to block search engines from crawling the PDF versions of each guide.

In this case, using a wildcard (*) would be helpful to match the URLs where the path starts with /guides/ and ends with .pdf, but the characters in between vary.

user-agent: *

Tinggalkan Balasan

Alamat email Anda tidak akan dipublikasikan. Ruas yang wajib ditandai *