How to create a custom cache key for bot traffic?

We have a Cache Everything page rule to greatly reduce the load on our origin servers. Since we might serve slightly different content for the same page based on country, we have also enabled the Custom Cache Key setting and activated the “geo” user feature to shard the cache by country.

We want some pages to be completely hidden to users from the US, so we’ve implemented 302 redirects on our PHP server when the CF-IPCountry header matches US.

The problem is that Googlebot mainly uses US-based IP addresses, which means that it gets the 302 redirects. A few days into this, our pages were removed from the Google Search index. We don’t want that.


How could we bypass the cache just for Googlebot and serve it the original content, rather than the 302 redirect?

Since Googlebot uses its own user agent string, we could further shard the cache by the User-Agent header. Then, on our server, we would check for the presence of Googlebot in that header and return the original content. My main concern with this approach is that the cache will get too fragmented and greatly increase the load on our server, since users have a wide range of browsers and devices with different user agents. This is confirmed in the docs, where it’s said that there’s a “high cardinality and risk sharding the cache”.

Is there a way to have a custom cache key just for bot and non-bot traffic? Or is there any other approach to this?

I’d probably look at using Workers where you can write much more complex logic than what’s available on page rules.

2 Likes

Since you are able to check request headers on your server, you could create a transform rule adding a new header, and test for it. The incoming request match SSL/not SSL means all requests get the new header. This would show all known bots and you could then decide what to do depending on the user agent.

3 Likes

I had no idea Cloudflare had so many fields that can be used for custom headers. I ended up creating a transform rule:

…then I changed my PHP logic:

if (($_SERVER['HTTP_X_KNOWN_BOT'] ?? null) === 'true') {
    // do whatever
}

…and added the new x-known-bot header in the Custom Cache Key setting of my Cache Everything rule:

image


I’ll let this run for a while to see if our search results change and post the outcome here.

1 Like

We just checked and our page is back in the Google index. It seems to work!

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.