APO and cache-control request headers issues with crawlers

I have a WordPress site with APO, Tiered Caching, and Cache Reserve enabled. I’m getting very high cache-hit rates now, but I’m still seeing my origin server get slammed any time a bot starts crawling my site. After spending a lot of time debugging this, I’m pretty sure the issue is that APO does a cache bypass whenever the cache-control: no-cache request header is present.

According to some discussions such as this and this, that is apparently the intended behavior for APO.

This means that any time a web crawler like Bingbot or Pinterestbot start crawling my site, 100% of their requests are going to bypass all the cache on Cloudflare and hit my origin server.

It seems that the only method to prevent this is to create a Cache-Everything page rule on my entire site, but that presents a host of other problems (dynamic content issues) and basically makes the features of APO worthless.

Is there no other way to ignore the cache-control: no-cache request header than with a Cache-Everything page rule? I tried creating a request header transform rule, but I got an error that the cache-control request header cannot be modified or removed by a transform rule.

It honestly seems to me like a bug that APO is letting the client bypass the cache in this way. The Cloudflare cache behavior documentation here states that the cache should be bypassed if the origin server sends a cache-control: no-cache response header. It doesn’t say anything about request headers from the client being able to trigger a cache bypass.

2 Likes

@yevgen Can you weigh in on this? I see you commented on a similar issue here, and stated you would look into ignoring cache-control: no-cache when it comes from bots. But as far as I can tell that idea was never implemented.

Later in that thread you mention using Cache Everything rules, but despite the claim that it works well with APO, I’m finding it breaks things for logged in users on my WordPress site, just like in this thread.

I made another post suggesting they allow us to ignore cache-control: no-cache when it comes from bots via a request header transform rule, but I can’t seem to get any response from any Cloudflare employee on that thread either.

1 Like

Please consider this reply an upvote for the issues described in this post.

I prefer using APO on our sites because it guarantees better integration with WordPress. However, other technology leaders in our company are not comfortable with the ease with which you can bypass APO and view it as an attack vector.

Regardless of its implementation, we must have the option to ignore a client’s request to bypass the cache when using APO, making APO consistent with the rest of Cloudflare’s caching behavior.

2 Likes

Thanks @paul.stengel

I have a proposed solution using request header transform rules over here. Please upvote that as well if you think that is a good proposal.

If my proposed solution doesn’t happen, perhaps Cloudflare Snippets will be a solution, once that is available.

1 Like

Hi @mark.r.baird, thanks for reporting this. I think this is a very valid suggestion. We will conduct an assessment of this logic to see what our options are here. But at first glance it seems reasonable that we could remove this logic or provide a way for users to disable it.

2 Likes