Understanding how to use the robots.txt file is crucial for any website’s SEO strategy. Mistakes in this file can impact how your website is crawled and your pages’ search appearance. Getting it right, on the other hand, can improve crawling efficiency and mitigate crawling issues.
Google recently reminded website owners about the importance of using robots.txt to block unnecessary URLs.
Those include add-to-cart, login, or checkout pages. But the question is – how do you use it properly?
In this article, we will guide you into every nuance of how to do just so.
What Is Robots.txt?
The robots.txt is a simple text file that sits in the root directory of your site and tells crawlers what should be crawled.
The table below provides a quick reference to the key robots.txt directives.
Directive
Description
User-agent
Specifies which crawler the rules apply to. See user agent tokens. Using * targets all crawlers.
Disallow
Prevents specified URLs from being crawled.
Allow
Allows specific URLs to be crawled, even if a parent directory is disallowed.
Sitemap
Indicates the location of your XML Sitemap by helping search engines to discover it.
This is an example of robot.txt from ikea.com with multiple rules.
Example of robots.txt from ikea.com
Note that robots.txt doesn’t support full regular expressions and only has two wildcards:
Asterisks (*), which matches 0 or more sequences of characters.
Dollar sign ($), which matches the end of a URL.
Also, note that its rules are case-sensitive, e.g., “filter=” isn’t equal to “Filter=.”
Order Of Precedence In Robots.txt
When setting up a robots.txt file, it’s important to know the order in which search engines decide which rules to apply in case of conflicting rules.
They follow these two key rules:
1. Most Specific Rule
The rule that matches more characters in the URL will be applied. For example:
User-agent: *
Disallow: /downloads/
Allow: /downloads/free/
In this case, the “Allow: /downloads/free/” rule is more specific than “Disallow: /downloads/” because it targets a subdirectory.
Google will allow crawling of subfolder “/downloads/free/” but block everything else under “/downloads/.”
2. Least Restrictive Rule
When multiple rules are equally specific, for example:
User-agent: *
Disallow: /downloads/
Allow: /downloads/
Google will choose the least restrictive one. This means Google will allow access to /downloads/.
Why Is Robots.txt Important In SEO?
Blocking unimportant pages with robots.txt helps Googlebot focus its crawl budget on valuable parts of the website and on crawling new pages. It also helps search engines save computing power, contributing to better sustainability.
Imagine you have an online store with hundreds of thousands of pages. There are sections of websites like filtered pages that may have an infinite number of versions.
Those pages don’t have unique value, essentially contain duplicate content, and may create infinite crawl space, thus wasting your server and Googlebot’s resources.
That is where robots.txt comes in, preventing search engine bots from crawling those pages.
If you don’t do that, Google may try to crawl an infinite number of URLs with different (even non-existent) search parameter values, causing spikes and a waste of crawl budget.
When To Use Robots.txt
As a general rule, you should always ask why certain pages exist, and whether they have anything worth for search engines to crawl and index.
If we come from this principle, certainly, we should always block:
URLs that contain query parameters such as:
Internal search.
Faceted navigation URLs created by filtering or sorting options if they are not part of URL structure and SEO strategy.
Action URLs like add to wishlist or add to cart.
Private parts of the website, like login pages.
JavaScript files not relevant to website content or rendering, such as tracking scripts.
Blocking scrapers and AI chatbots to prevent them from using your content for their training purposes.
Let’s dive into how you can use robots.txt for each case.
1. Block Internal Search Pages
The most common and absolutely necessary step is to block internal search URLs from being crawled by Google and other search engines, as almost every website has an internal search functionality.
On WordPress websites, it is usually an “s” parameter, and the URL looks like this:
https://www.example.com/?s=google
Gary Illyes from Google has repeatedly warned to block “action” URLs as they can cause Googlebot to crawl them indefinitely even non-existent URLs with different combinations.
Here is the rule you can use in your robots.txt to block such URLs from being crawled:
User-agent: *
Disallow: *s=*
The User-agent: * line specifies that the rule applies to all web crawlers, including Googlebot, Bingbot, etc.
The Disallow: *s=* line tells all crawlers not to crawl any URLs that contain the query parameter “s=.” The wildcard “*” means it can match any sequence of characters before or after “s= .” However, it will not match URLs with uppercase “S” like “/?S=” since it is case-sensitive.
Here is an example of a website that managed to drastically reduce the crawling of non-existent internal search URLs after blocking them via robots.txt.
Screenshot from crawl stats report
Note that Google may index those blocked pages, but you don’t need to worry about them as they will be dropped over time.
2. Block Faceted Navigation URLs
Faceted navigation is an integral part of every ecommerce website. There can be cases where faceted navigation is part of an SEO strategy and aimed at ranking for general product searches.
For example, Zalando uses faceted navigation URLs for color options to rank for general product keywords like “gray t-shirt.”
However, in most cases, this is not the case, and filter parameters are used merely for filtering products, creating dozens of pages with duplicate content.
Technically, those parameters are not different from internal search parameters with one difference as there may be multiple parameters. You need to make sure you disallow all of them.
For example, if you have filters with the following parameters “sortby,” “color,” and “price,” you may use this set of rules:
User-agent: *
Disallow: *sortby=*
Disallow: *color=*
Disallow: *price=*
Based on your specific case, there may be more parameters, and you may need to add all of them.
What About UTM Parameters?
UTM parameters are used for tracking purposes.
As John Mueller stated in his Reddit post, you don’t need to worry about URL parameters that link to your pages externally.
John Mueller on UTM parameters
Just make sure to block any random parameters you use internally and avoid linking internally to those pages, e.g., linking from your article pages to your search page with a search query page “https://www.example.com/?s=google.”
3. Block PDF URLs
Let’s say you have a lot of PDF documents, such as product guides, brochures, or downloadable papers, and you don’t want them crawled.
Here is a simple robots.txt rule that will block search engine bots from accessing those documents:
User-agent: *
Disallow: /*.pdf$
The “Disallow: /*.pdf$” line tells crawlers not to crawl any URLs that end with .pdf.
By using /*, the rule matches any path on the website. As a result, any URL ending with .pdf will be blocked from crawling.
If you have a WordPress website and want to disallow PDFs from the uploads directory where you upload them via the CMS, you can use the following rule:
User-agent: *
Disallow: /wp-content/uploads/*.pdf$
Allow: /wp-content/uploads/2024/09/allowed-document.pdf$
You can see that we have conflicting rules here.
In case of conflicting rules, the more specific one takes priority, which means the last line ensures that only the specific file located in folder “wp-content/uploads/2024/09/allowed-document.pdf” is allowed to be crawled.