A Guide To Robots.txt: Best Practices For SEO

A Guide To Robots.txt: Best Practices For SEO

4. Block A Directory

Let’s say you have an API endpoint where you submit your data from the form. It is likely your form has an action attribute like action=”/form/submissions/.”

The issue is that Google will try to crawl that URL, /form/submissions/, which you likely don’t want. You can block these URLs from being crawled with this rule:

User-agent: *
Disallow: /form/

By specifying a directory in the Disallow rule, you are telling the crawlers to avoid crawling all pages under that directory, and you don’t need to use the (*) wildcard anymore, like “/form/*.”

Note that you must always specify relative paths and never absolute URLs, like “https://www.example.com/form/” for Disallow and Allow directives.

Be cautious to avoid malformed rules. For example, using /form without a trailing slash will also match a page /form-design-examples/, which may be a page on your blog that you want to index.

Read: 8 Common Robots.txt Issues And How To Fix Them

5. Block User Account URLs

If you have an ecommerce website, you likely have directories that start with “/myaccount/,” such as “/myaccount/orders/” or “/myaccount/profile/.”

With the top page “/myaccount/” being a sign-in page that you want to be indexed and found by users in search, you may want to disallow the subpages from being crawled by Googlebot.

You can use the Disallow rule in combination with the Allow rule to block everything under the “/myaccount/” directory (except the /myaccount/ page).

User-agent: *
Disallow: /myaccount/
Allow: /myaccount/$

And again, since Google uses the most specific rule, it will disallow everything under the /myaccount/ directory but allow only the /myaccount/ page to be crawled.

Here’s another use case of combining the Disallow and Allow rules: in case you have your search under the /search/ directory and want it to be found and indexed but block actual search URLs:

User-agent: *
Disallow: /search/
Allow: /search/$

6. Block Non-Render Related JavaScript Files

Every website uses JavaScript, and many of these scripts are not related to the rendering of content, such as tracking scripts or those used for loading AdSense.

Googlebot can crawl and render a website’s content without these scripts. Therefore, blocking them is safe and recommended, as it saves requests and resources to fetch and parse them.

Below is a sample line that is disallowing sample JavaScript, which contains tracking pixels.

User-agent: *
Disallow: /assets/js/pixels.js

7. Block AI Chatbots And Scrapers

Many publishers are concerned that their content is being unfairly used to train AI models without their consent, and they wish to prevent this.

#ai chatbots
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: Claude-Web
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: cohere-ai
User-agent: Bytespider
User-agent: Google-Extended
User-Agent: PerplexityBot
User-agent: Applebot-Extended
User-agent: Diffbot
User-agent: PerplexityBot
Disallow: /

#scrapers
User-agent: Scrapy
User-agent: magpie-crawler
User-agent: CCBot
User-Agent: omgili
User-Agent: omgilibot
User-agent: Node/simplecrawler
Disallow: /

Here, each user agent is listed individually, and the rule Disallow: / tells those bots not to crawl any part of the site.

This, besides preventing AI training on your content, can help reduce the load on your server by minimizing unnecessary crawling.

For ideas on which bots to block, you may want to check your server log files to see which crawlers are exhausting your servers, and remember, robots.txt doesn’t prevent unauthorized access.

8. Specify Sitemaps URLs

Including your sitemap URL in the robots.txt file helps search engines easily discover all the important pages on your website. This is done by adding a specific line that points to your sitemap location, and you can specify multiple sitemaps, each on its own line.

Sitemap: https://www.example.com/sitemap/articles.xml
Sitemap: https://www.example.com/sitemap/news.xml
Sitemap: https://www.example.com/sitemap/video.xml

Unlike Allow or Disallow rules, which allow only a relative path, the Sitemap directive requires a full, absolute URL to indicate the location of the sitemap.

Ensure the sitemaps’ URLs are accessible to search engines and have proper syntax to avoid errors.

Sitemap fetch error in search console

9. When To Use Crawl-Delay

The crawl-delay directive in robots.txt specifies the number of seconds a bot should wait before crawling the next page. While Googlebot does not recognize the crawl-delay directive, other bots may respect it.

It helps prevent server overload by controlling how frequently bots crawl your site.

For example, if you want ClaudeBot to crawl your content for AI training but want to avoid server overload, you can set a crawl delay to manage the interval between requests.

User-agent: ClaudeBot
Crawl-delay: 60

This instructs the ClaudeBot user agent to wait 60 seconds between requests when crawling the website.

Of course, there may be AI bots that don’t respect crawl delay directives. In that case, you may need to use a web firewall to rate limit them.

Troubleshooting Robots.txt

Once you’ve composed your robots.txt, you can use these tools to troubleshoot if the syntax is correct or if you didn’t accidentally block an important URL.

1. Google Search Console Robots.txt Validator

Once you’ve updated your robots.txt, you must check whether it contains any error or accidentally blocks URLs you want to be crawled, such as resources, images, or website sections.

Navigate Settings > robots.txt, and you will find the built-in robots.txt validator. Below is the video of how to fetch and validate your robots.txt.

2. Google Robots.txt Parser

This parser is official Google’s robots.txt parser which is used in Search Console.

It requires advanced skills to install and run on your local computer. But it is highly recommended to take time and do it as instructed on that page because you can validate your changes in the robots.txt file before uploading to your server in line with the official Google parser.

Centralized Robots.txt Management

Each domain and subdomain must have its own robots.txt, as Googlebot doesn’t recognize root domain robots.txt for a subdomain.

It creates challenges when you have a website with a dozen subdomains, as it means you should maintain a bunch of robots.txt files separately.

However, it is possible to host a robots.txt file on a subdomain, such as https://cdn.example.com/robots.txt, and set up a redirect from  https://www.example.com/robots.txt to it.

You can do vice versa and host it only under the root domain and redirect from subdomains to the root.

Search engines will treat the redirected file as if it were located on the root domain. This approach allows centralized management of robots.txt rules for both your main domain and subdomains.

It helps make updates and maintenance more efficient. Otherwise, you would need to use a separate robots.txt file for each subdomain.

Conclusion

A properly optimized robots.txt file is crucial for managing a website’s crawl budget. It ensures that search engines like Googlebot spend their time on valuable pages rather than wasting resources on unnecessary ones.

On the other hand, blocking AI bots and scrapers using robots.txt can significantly reduce server load and save computing resources.

Make sure you always validate your changes to avoid unexpected crawability issues.

However, remember that while blocking unimportant resources via robots.txt may help increase crawl efficiency, the main factors affecting crawl budget are high-quality content and page loading speed.

Happy crawling!

More resources: 

Featured Image: BestForBest/Shutterstock

Tinggalkan Balasan

Alamat email Anda tidak akan dipublikasikan. Ruas yang wajib ditandai *