APO - Crawlers / robots hitting origin server instead of APO cache

I am testing the same url on your site from different geographic regions so each region might not have a cache in that region’s datacenter so coming KV

Yes subequent request come via cache not KV

age 488
cf-apo-via cache
cf-cache-status HIT
cf-edge-cache cache,platform=wordpress
cf-ray 5fbdc2edbbf5a516-NRT
cf-request-id 06ca7828930000a516a18e4000000001

And what does it mean when it results in KV?

I mean there is a possibility that even after a few days, not all Cloudflare datacenters have been primed. But as I mentioned, even after a few days, I continue to see many crawler entries in my origin server’s log and it doesn’t seem to slow down.

What I’ll do is I won’t purge the cache for a month and I will keep an eye on repeat URLs in the access log.

CF APO’s advantage over normal CF Worker/CF CDN cache is cache misses in one CF datacenter can consult with KV cache to see if another CF region’s datacenter has the asset in cache. If another CF region’s datacenter has the asset in it’s cache, CF datacenter can serve the KV cache entry instead of going back and asking it from your origin server. So cf-apo-via KV is just that.

Oh so this is almost like the Argo feature where datacenters can talk to each other to get a cached copy? I didn’t know that had been implemented for APO. That is a great feature.

But then again, that wouldn’t explain the requests to my origin server, since KV shouldn’t reach my origin.

Yeah it’s mentioned at https://blog.cloudflare.com/building-automatic-platform-optimization-for-wordpress-using-cloudflare-workers/

With the Automatic Platform Optimization release we wanted to improve loading time for cache cold start from any location in the world. We explored different approaches and decided to use Workers KV to improve Edge Caching.

In addition to Cloudflare’s CDN cache we put the content into Workers KV. It only requires a single request to the page to cache it and within a minute it is made available to be read back from KV from any Cloudflare data center.

I guess one reason CF APO might talk with your origin is after a cache purge and invalidation of KV cache

Updating content

After an update has been made to the WordPress website the plugin makes a request to Cloudflare’s API which both purges cache and marks content as stale in KV. The next request for the asset will trigger revalidation of the content. If the plugin is not enabled cache revalidation logic is triggered as detailed previously.

We serve the stale copy of the content still present in KV and asynchronously fetch new content from the origin, apply possible optimizations and then cache it (both regular local CDN cache and globally in KV).

so might need to dig into and see when CF Wordpress plugin is doing cache purges. If you’re purging local wordpress cache plugin, it maybe triggering a CF Wordpress plugin purge too ???

Yes, when I purge the local cache, I also purge the Cloudflare cache. But as I said previously, even after many days of having a primed cache, I continue seeing many crawler entries in the NGINX access log, and continue seeing different requests for the same URLs still hitting my origin server, which is the reason why I still suspect some crawler requests are hitting my origin when they shouldn’t.

might need to dig into your extended nginx origin logs and pick a few specific urls to track over a longer period of time and chart/map their request date/timestamps in relation to when you do your cache purging/cache priming task i.e. log the local server timestamp of the script/routine when doing local/cf cache purge

also when you’re priming the cache, you could later at set intervals inspect the primed cached url’s cache age header to see how old the cache is in relation to the time you last purged cf cache. Edit, actually you can do that to track the url’s cache age over time too :slight_smile: Though it would be per cf datacenter region but you could use webpagetest.org’s API and script test of specific URLs and log the json or XML result output and inspect the response headers that way too. That you can leverage webpagetest.org’s API to test from several geographical regions.

It’s quite useful as that is how I can prime CF cache on multiple geographic regions, via webpagetest.org API :slight_smile:

WPT API docs https://github.com/WPO-Foundation/webpagetest-docs/blob/master/dev/api.md

That’s exactly what I will do. I will stop the purge / prime cronjob from running every Friday and will make sure the cache is primed for a few weeks. Then I will look at the top URLs in the access log and see if they eventually stop being accessed, or if they continously get accessed even after a long time.

1 Like

It’s still possible to bypass cache with cache-control: no-cache request header. I’m thinking that we could somehow ignore it for bots, I will look into code.

1 Like

ah that’s the obvious one we overlooked. Whether or not crawlers are requesting with no-cache

So (in theory) could an attacker simply put that in the header, bypass CF and overwhelm the server?

Yup they can - same with inserting known cache bypassing cookies into requests. That’s what other CF tools are for i.e. CF Rate limiting, CF Firewall rules/WAF and if on Enterprise plan, Bot Management is huge in helping protect against malicious requests.

Yea I use those, unfortunately Bot Management is not in my pricing tier.
:+1:

As per earlier request page rule: “Cache Everything” works nicely with APO, with that rule APO will ignore request cache-control header.

3 Likes

Which means if crawlers are sending a cache-control: no-cache request header, APO should ignore it with the Cache Everything Page Rule?

I found something rather interesting tonight while looking at my origin server’s access log.

Have a look:

54.236.1.13 - [03/Dec/2020:15:00:13 -0500] “GET /tache-de-rousseur-poignet-femmes/amp/ HTTP/1.1” 200 10673 “-” “Mozilla/5.0 (compatible; Pinterestbot/1.0; +http://www.pinterest.com/bot.html)” “54.236.1.13” “www.ipnoze.com” sn=“www.ipnoze.com” rt=0.000 ua="-" us="-" ut="-" ul="-" cs=“HIT”

54.236.1.13 - [03/Dec/2020:20:06:09 -0500] “GET /tache-de-rousseur-poignet-femmes/amp/ HTTP/1.1” 200 10673 “-” “Mozilla/5.0 (compatible; Pinterestbot/1.0; +http://www.pinterest.com/bot.html)” “54.236.1.13” “www.ipnoze.com” sn=“www.ipnoze.com” rt=0.000 ua="-" us="-" ut="-" ul="-" cs=“HIT”

These are two requests coming from the Pinterest crawler for the same url, one at 15:00 (3:00 PM) and the other at 20:06 (8:06 PM), from the same IP address (so I would guess same location), that hit two times my origin server. This means both requests somehow bypassed APO and got to my origin even if the cache was already primed on the first request.

But then again, unless we can have access to the CF headers in the origin server’s access log, there’s still not much we can do. But I believe it is proof that some crawler requests are somehow bypassing APO.

After doing more testing and looking at my origin server’s access log, I believe the only reason why some URLs are accessed more than once is because APO accesses the URLs on my origin server from different locations. I noticed that the only posts that seem to be requested many times during the same day from my origin server are only the ones that have been published on the same day. Since those URLs don’t have cache entries at first, I think APO / Cloudflare needs to prime the different cache servers around the world by accessing the same URLs on my origin server.

But this brings me to wonder if the KV Cache feature is working? Because after reading the info that @eva2000 posted earlier in this thread about KV Cache, I thought once a Cloudflare cache server was primed, the others would prime their cache from the KV Cache, preventing unnecessary trips to the origin. I wonder what @yevgen has to say about this.

1 Like

After doing more testing and looking at my origin server’s access log, I believe the only reason why some URLs are accessed more than once is because APO accesses the URLs on my origin server from different locations.

That’s probably it. The primary goal of APO is to serve cached HTML to the eyeballs as much as possible. Cloudflare caching system is the main mechanism for that. Unfortunately, there are no guarantees how long the content stays in the Cache, it could be evicted due to a number of reasons:

  • site plan type, how often the content is requested, available disk space, etc

KV storage is used as a backup system to serve content to eyeballs when the Cache in a particular data center doesn’t contain a specific page. It seems to be a good fit as we see around 30% of all requests are served from KV storage. There are hidden bandwidth costs of serving content globally from KV instances. So KV storage for APO is an ongoing experiment as we still learning the costs of operating it.
We made a design decision to make Cache and KV independent from each other. Cache and KV are populated with content directly from origin. Decreasing the load on the origin servers was never a design goal of APO it’s a nice bonus we got out of the system by being smart when to call origin servers.

So far the system works pretty well and we don’t have plans to make any drastic changes to how Cache and KV are populated.

2 Likes

Thank you @yevgen for this detailed answer.

1 Like

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.