We’ve recently started caching static HTML pages within our site. We’ve done this by ensuring there’s a cache-control response header set, with a value of public,max-age=14400. We then added a Page Rule to set the Cache Level to Cache Everything.
When testing this using normal browsers, this is working fine - we can see the Cloudflare cache getting populated, and can see us getting hits when it’s in the cache.
However, checking our server-side logs, we’re still seeing a large number of requests that are not being served by the cache and are making it through to the server (e.g., 10,000+ requests in 24 hours). Checking these requests in the logs don’t show anything special - they don’t have a querystring, nothing else appears different, although I don’t have access to the request headers to see if anything is different there.
Checking the Cloudflare logs, I can see that for this page, we’re getting a 93% ‘bypass’ rate, and only 5% hit. This doesn’t match the data we’ve got for static resources (e.g., CSS or JS files) which are close to 100% hit. Digging into the Cloudflare data a bit more, we’re getting a large number of requests from bots (specifically BingBot, with over 60% of traffic across the whole site).
My question is this: is it possible for a bot to be bypassing the cache for static HTML pages, and if so, what can be done to avoid this?