Google Search Central has launched a new series called “Crawling December” to provide insights into how Googlebot crawls and indexes webpages.
Google will publish a new article each week this month exploring various aspects of the crawling process that are not often discussed but can significantly impact website crawling.
The first post in the series covers the basics of crawling and sheds light on essential yet lesser-known details about how Googlebot handles page resources and manages crawl budgets.
Crawling Basics
Today’s websites are complex due to advanced JavaScript and CSS, making them harder to crawl than old HTML-only pages. Googlebot works like a web browser but on a different schedule.
When Googlebot visits a webpage, it first downloads the HTML from the main URL, which may link to JavaScript, CSS, images, and videos. Then, Google’s Web Rendering Service (WRS) uses Googlebot to download these resources to create the final page view.
Here are the steps in order:
Initial HTML download
Processing by the Web Rendering Service
Resource fetching
Final page construction
Crawl Budget Management
Crawling extra resources can reduce the main website’s crawl budget. To help with this, Google says that “WRS tries to cache every resource (JavaScript and CSS) used in the pages it renders.”
It’s important to note that the WRS cache lasts up to 30 days and is not influenced by the HTTP caching rules set by developers.
This caching strategy helps to save a site’s crawl budget.
Recommendations
This post gives site owners tips on how to optimize their crawl budget: