Google published an explainer that discusses how Content Delivery Networks (CDNs) influence search crawling and improve SEO but also how they can sometimes cause problems.
What Is A CDN?
A Content Delivery Network (CDN) is a service that caches a web page and displays it from a data center that’s closest to the browser requesting that web page. Caching a web page means that the CDN creates a copy of a web page and stores it. This speeds up web page delivery because now it’s served from a server that’s closer to the site visitor, requiring less “hops” across the Internet from the origin server to the destination (the site visitor’s browser).
CDNs Unlock More Crawling
One of the benefits of using a CDN is that Google automatically increases the crawl rate when it detects that web pages are being served from a CDN. This makes using a CDN attractive to SEOs and publishers who are concerned about increasing the amount of pages that are crawled by Googlebot.
Normally Googlebot will reduce the amount of crawling from a server if it detects that it’s reaching a certain threshold that’s causing the server to slow down. Googlebot slows the amount of crawling, which is called throttling. That threshold for “throttling” is higher when a CDN is detected, resulting in more pages crawled.
Something to understand about serving pages from a CDN is that the first time pages are served they must be served directly from your server. Google uses an example of a site with over a million web pages:
“However, on the first access of a URL the CDN’s cache is “cold”, meaning that since no one has requested that URL yet, its contents weren’t cached by the CDN yet, so your origin server will still need serve that URL at least once to “warm up” the CDN’s cache. This is very similar to how HTTP caching works, too.
In short, even if your webshop is backed by a CDN, your server will need to serve those 1,000,007 URLs at least once. Only after that initial serve can your CDN help you with its caches. That’s a significant burden on your “crawl budget” and the crawl rate will likely be high for a few days; keep that in mind if you’re planning to launch many URLs at once.”
When Using CDNs Backfire For Crawling
Google advises that there are times when a CDN may put Googlebot on a blacklist and subsequently block crawling. This effect is described as two kinds of blocks:
1. Hard blocks
2. Soft blocks
Hard blocks happen when a CDN responds that there’s a server error. A bad server error response can be a 500 (internal server error) which signals a major problem is happening with the server. Another bad server error response is the 502 (bad gateway). Both of these server error responses will trigger Googlebot to slow down the crawl rate. Indexed URLs are saved internally at Google but continued 500/502 responses can cause Google to eventually drop the URLs from the search index.
The preferred response is a 503 (service unavailable), which indicates a temporary error.