APO - Crawlers / robots hitting origin server instead of APO cache

Hi,

So after getting APO setup properly, I noticed that all crawlers / robots are hitting my origin server. So basically, any request coming from Googlebot / bingbot is hitting my origin and NOT hitting the Cloudflare APO cache. I would tend to say this is problematic since it means crawlers are seeing a much slower TTFB than they would if they hit the APO cache.

Is this something normal? Is this an issue that could be addressed? In the meantime, Is there any Page Rule I could setup to make crawlers hit the APO cache instead of my origin server?

I’d be curious to know if @eva2000 has noticed this issue before? Maybe he has a fix?

And I’m sure @yevgen will have an insightful answer for me.

Thank you.

It would be great to see some of the request signatures that hit the origin when APO cache is available. It’s hard to understand what could be a cause of this without any data.

Curious, how do you test the requests coming from Googlebot / bingbot?

Before APO existed, if you had cache everything rule and short edge cache ttl then yes search engine crawler and bots would more likely make a cache miss request due to fact CF CDN cache is per data center and short edge cache ttl. Now with APO and long 30 day edge cache ttl unless purged, would make it less likely but would need for a way to look at whether request was served from CF cache or CF kv cache or a cache miss to origin. Remember requests include all requests unless you’re filtering for only content type = HTML and HTTP 200 status. All requests means CF same origin insights beacon, /wp-admin/* requests etc which all wouldn’t be cacheable so result in a cache miss.

For Google at least, look into your Google Search Console/Webmaster tools and inspect URLs in both Google index and live URL and see the headers returned.

example from my Google Search Console inspecting Wordpress blog url’s headers on crawled page

Also just learned that for Google Chrome User Report (CRuX) and Google Search Console data, the crawler will also crawl and take into account robots defined noindex pages as well as logged in pages - so these might be non-cache requests by nature. These pages won’t be indexable but still factor into Google Core Web Vital real world field metrics for Google’s Page Experience signals.

would be nice if cache analytics could add additional filters for user agent, referrer so we can dig further into cache analytics

image

was curious so I dug into my blog’s Cloudflare logs (Enterprise feature) to filter the last 10 entries for today for Googlebot user agent where there was a cache miss (!= hit) to see what type of url requests Googlebot was requesting and it seems there invalid urls resulting in some with /?p 404 or /tag/ requests which actually come from my redirect from an old blog to my new blog so need to sort that out. But I am not using CF APO but my own custom CF Worker caching. So I’m going to add /tag/wordpress/ into my CF Worker’s extended CF cache ttl grouping which I configure for better cache hit rates :slight_smile:

where fields ctf = cache tier fill (Argo tiered cache), cs = CDN Cache Status hit/miss etc and crs = Cache Response Status

ctf-false cs-miss crs-404

and this refers to Bot Management score = 1 non-human and is CF verified bot

1 x Verified Bot
pzcat /home/cfcmm-logs/20201203/*.log.gz | jq -r 'select(.BotScore <=100 and .ClientRequestHost == "blog.centminmod.com" and .CacheCacheStatus != "hit") | "\(.ClientIP) \(.RayID) \(.ClientRequestURI) \(.ClientRequestMethod) \(.ClientRequestReferer) \(.EdgeResponseStatus) \(.OriginResponseStatus) \(.EdgeRequestHost) ctf-\(.CacheTieredFill) cs-\(.CacheCacheStatus) crs-\(.CacheResponseStatus) \(.EdgeColoCode) \(.ClientCountry) \(.ClientIPClass) \(.WorkerStatus)-\(.WorkerSubrequest)-\(.WorkerSubrequestCount) \(.BotScore) x \(.BotScoreSrc) \(.ClientRequestUserAgent) d-\(.ClientDeviceType) edgeip-\(.EdgeServerIP)"' | egrep -i -v 'cdn-cgi|index.rss|xidel|UptimeRobot|HetrixTools|CloudFlare-Prefetch' | grep -w 'Googlebot' | tail -10

66.249.65.20 5fb9b8454194ec9a /tag/wordpress/ GET  200 200 blog.centminmod.com ctf-false cs-miss crs-200 DFW us searchEngine unknown-true-0 1 x Verified Bot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) d-desktop edgeip-172.69.69.188
66.249.65.20 5fb9b8450ed1ec9a /tag/wordpress/ GET  200 0 blog.centminmod.com ctf-false cs-miss crs-200 DFW us searchEngine ok-false-1 1 x Verified Bot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) d-desktop edgeip-
66.249.65.20 5fb9c623c9fd0eee /tag/wordpress7/ GET  200 0 blog.centminmod.com ctf-false cs-miss crs-200 DFW us searchEngine ok-false-1 1 x Verified Bot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) d-desktop edgeip-
66.249.65.20 5fb9c62427bc0eee /tag/wordpress7/ GET  200 200 blog.centminmod.com ctf-false cs-miss crs-200 DFW us searchEngine unknown-true-0 1 x Verified Bot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) d-desktop edgeip-162.158.75.204
66.249.65.16 5fba42f25df4ec8a /tag/wordpress/feed/ GET  200 0 blog.centminmod.com ctf-false cs-miss crs-200 DFW us searchEngine ok-false-1 1 x Verified Bot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) d-desktop edgeip-
66.249.65.16 5fba42f28693ec8a /tag/wordpress/feed/ GET  200 200 blog.centminmod.com ctf-false cs-miss crs-200 DFW us searchEngine unknown-true-0 1 x Verified Bot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) d-desktop edgeip-162.158.75.60
66.249.65.4 5fba6c8b96d82fc9 /tag/wordpress7/feed/ GET  200 200 blog.centminmod.com ctf-false cs-miss crs-200 DFW us searchEngine unknown-true-0 1 x Verified Bot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) d-desktop edgeip-162.158.75.48
66.249.65.4 5fba6c8b7d7b2fc9 /tag/wordpress7/feed/ GET  200 0 blog.centminmod.com ctf-false cs-miss crs-200 DFW us searchEngine ok-false-1 1 x Verified Bot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) d-desktop edgeip-
66.249.65.30 5fbac1084f78e027 /?p=110%2Fwordpress-static-file-cdn-hosting-page-speed-amazon-aws-cloudfront GET  404 0 blog.centminmod.com ctf-false cs-miss crs-404 DFW us searchEngine ok-false-1 1 x Verified Bot Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) d-mobile edgeip-
66.249.65.30 5fbac108f494e027 /?p=110%2Fwordpress-static-file-cdn-hosting-page-speed-amazon-aws-cloudfront GET  404 404 blog.centminmod.com ctf-false cs-miss crs-404 DFW us searchEngine unknown-true-0 1 x Verified Bot Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) d-mobile edgeip-172.69.70.195

I am also seeing CF-Cache-Status: HIT and cf-apo-via: cache from the result found in Google Search Console, even after a Live Test.

But when the APO cache has been primed for a few days and when looking at my origin’s access log, 99% of requests to my origin come from crawlers.

Unfortunately, there isn’t any header info in the NGINX logs, so I cannot know the headers for those crawler requests.

Here’s an example of a few log entries:

66.249.73.21 - [03/Dec/2020:07:20:41 -0500] “GET /mur-des-mensonges-donald-trump-manhattan-radio-free-brooklyn/ HTTP/1.1” 200 17309 “-” “Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “66.249.73.21” “www.ipnoze.com” sn=“www.ipnoze.com” rt=0.000 ua=“-” us=“-” ut=“-” ul=“-” cs=“HIT”

207.46.13.123 - [03/Dec/2020:07:15:17 -0500] “GET /photos-victoriens-droles/ HTTP/1.1” 200 16843 “-” “Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)” “207.46.13.123” “www.ipnoze.com” sn=“www.ipnoze.com” rt=0.000 ua=“-” us=“-” ut=“-” ul=“-” cs=“HIT”

77.204.144.117 - [03/Dec/2020:07:20:42 -0500] “GET /adultes-coinces-terrains-de-jeux-enfants/ HTTP/1.1” 200 18406 “https://pinterest.com/” “Mozilla/5.0 (Linux; Android 10; COL-L29 Build/HUAWEICOL-L29; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/87.0.4280.86 Mobile Safari/537.36 [Pinterest/Android]” “77.204.144.117” “www.ipnoze.com” sn=“www.ipnoze.com” rt=0.000 ua=“-” us=“-” ut=“-” ul=“-” cs=“HIT”

As you can see, they all seem to be normal 200 GET requests for regular posts.

Most definitely a great idea.

you can setup a custom nginx origin log format with added fields including cf ray/request id and query string and accept header metrics too. As those requests could have query strings too. This is what I use

log_format cf_apo '$remote_addr - $remote_user [$time_local] $request '
               '"$status" $body_bytes_sent "$http_referer" '
               '"$http_user_agent" "$http_x_forwarded_for" "$gzip_ratio"'
               ' "$connection" "$connection_requests" "$request_time" $http_cf_ray $http_cf_request_id '
               '$ssl_protocol $ssl_cipher'
               ' "$http_accept"'
               ' "$args"';
1 Like

Awesome! Will definitely try that right away.

By the way @eva2000, is there a reason why you’re using $remote_addr and NOT $http_cf_connecting_ip?

no particular request, works the same AFAIK

oh if you’re into json, you can also use nginx logging in json format https://community.centminmod.com/threads/how-to-configure-nginx-for-json-based-access-logging.19641/ :slight_smile:

log_format cf_json escape=json '{'
  '"msec": "$msec", ' # request unixtime in seconds with a milliseconds resolution
  '"connection": "$connection", ' # connection serial number
  '"connection_requests": "$connection_requests", ' # number of requests made in connection
  '"pid": "$pid", ' # process pid
  '"request_id": "$request_id", ' # the unique request id
  '"request_length": "$request_length", ' # request length (including headers and body)
  '"remote_addr": "$remote_addr", ' # client IP
  '"remote_user": "$remote_user", ' # client HTTP username
  '"remote_port": "$remote_port", ' # client port
  '"time_local": "$time_local", '
  '"time_iso8601": "$time_iso8601", ' # local time in the ISO 8601 standard format
  '"request": "$request", ' # full path no arguments if the request
  '"request_uri": "$request_uri", ' # full path and arguments if the request
  '"args": "$args", ' # args
  '"status": "$status", ' # response status code
  '"body_bytes_sent": "$body_bytes_sent", ' # the number of body bytes exclude headers sent to a client
  '"bytes_sent": "$bytes_sent", ' # the number of bytes sent to a client
  '"http_referer": "$http_referer", ' # HTTP referer
  '"http_user_agent": "$http_user_agent", ' # user agent
  '"http_x_forwarded_for": "$http_x_forwarded_for", ' # http_x_forwarded_for
  '"http_host": "$http_host", ' # the request Host: header
  '"server_name": "$server_name", ' # the name of the vhost serving the request
  '"request_time": "$request_time", ' # request processing time in seconds with msec resolution
  '"upstream": "$upstream_addr", ' # upstream backend server for proxied requests
  '"upstream_connect_time": "$upstream_connect_time", ' # upstream handshake time incl. TLS
  '"upstream_header_time": "$upstream_header_time", ' # time spent receiving upstream headers
  '"upstream_response_time": "$upstream_response_time", ' # time spend receiving upstream body
  '"upstream_response_length": "$upstream_response_length", ' # upstream response length
  '"upstream_cache_status": "$upstream_cache_status", ' # cache HIT/MISS where applicable
  '"ssl_protocol": "$ssl_protocol", ' # TLS protocol
  '"ssl_cipher": "$ssl_cipher", ' # TLS cipher
  '"scheme": "$scheme", ' # http or https
  '"request_method": "$request_method", ' # request method
  '"server_protocol": "$server_protocol", ' # request protocol, like HTTP/1.1 or HTTP/2.0
  '"pipe": "$pipe", ' # “p” if request was pipelined, “.” otherwise
  '"gzip_ratio": "$gzip_ratio", '
  '"http_cf_ray": "$http_cf_ray", '
  '"http_cf_request_id": "$http_cf_request_id", '
  '"http_accept": "$http_accept", '
  '"http_args": "$args"'
'}';
1 Like

So after adding the logging vars, I did not see any info that could help me figure out why those requests are hitting my origin server.

Here’s an example request:

66.249.73.21 - - [03/Dec/2020:07:53:08 -0500] GET /photos-embarrassantes-enfance/ HTTP/1.1 “200” 19792 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “66.249.73.21” “-” “www.ipnoze.com” sn=“www.ipnoze.com” rt=0.000 ua=“-” us=“-” ut=“-” ul=“-” cs=“HIT” “378” “1” 5fbd6dc81336e01f-DFW 06ca42f10e0000e01fb0298000000001 TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384 “text/html,application/xhtml+xml,application/signed-exchange;v=b3,application/xml;q=0.9,/;q=0.8” “-”

is that cf cache status header or a x-cache header for nginx and not cloudflare ? you have another level of caching at nginx origin that is serving that x-cache header

You’re right, that’s my origin server’s header and NOT Cloudflare. Will edit my post. I don’t think there’s any way to get the Cloudflare cf-cache-status header in the NGINX logs.

Yeah you need access to Cloudflare edge server logs which is only available in Cloudflare Enterprise plans. If @yevgen has access to the CF edge server logs, he could dig deeper into requests for a domain I suppose.

1 Like

You did verify for a specific url known to be primed to check against origin nginx logs ? are the URLs repeated or 1 off requests. I see all my cache miss requests in nginx URL for actual Wordpress blog pages (besides tags) are 1 off requests meaning they don’t show up again as first request of a URL not in cache would be a cache miss until repeat visit of URL = cache hit.

From https://support.cloudflare.com/hc/en-us/articles/360049822312 also check if any Wordpress plugins you use are not compatible with APO too.

I know all post URLs are primed on the origin server because I have a system in place that primes all post and AMP URLs after a cache purge. A cache purge / prime is done systematically by a cron job every Friday. I do NOT prime the Cloudflare cache as it is practically impossible to prime all Cloudflare servers from all locations.

They are mostly 1 off requests, but not always. As you can see from the log entries below, some requests can hit my origin server for the same post in a short period of time, which means they are probably bypassing the APO cache, unless they are being served from fresh new locations each time, which I doubt, because this also happens frequently when the cache has been primed for a few days.

109.129.234.7 - [03/Dec/2020:08:21:50 -0500] “GET /fille-garcon-deguises-infirmiere-medecin-reactions/ HTTP/1.1” 200 16676 “https://www.google.be/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36 Edg/87.0.664.47” “109.129.234.7” “www.ipnoze.com” sn=“www.ipnoze.com” rt=0.000 ua=“-” us=“-” ut=“-” ul=“-” cs=“HIT”

66.249.90.212 - [03/Dec/2020:08:22:14 -0500] “GET /fille-garcon-deguises-infirmiere-medecin-reactions/ HTTP/1.1” 200 16676 “-” “Mediapartners-Google” “66.249.90.212” “www.ipnoze.com” sn=“www.ipnoze.com” rt=0.000 ua=“-” us=“-” ut=“-” ul=“-” cs=“HIT”

66.249.90.212 - [03/Dec/2020:08:22:14 -0500] “GET /fille-garcon-deguises-infirmiere-medecin-reactions/amp/ HTTP/1.1” 200 9490 “-” “Mediapartners-Google” “66.249.90.212” “www.ipnoze.com” sn=“www.ipnoze.com” rt=0.000 ua=“-” us=“-” ut=“-” ul=“-” cs=“HIT”

152.67.137.35 - [03/Dec/2020:08:24:32 -0500] “GET /fille-garcon-deguises-infirmiere-medecin-reactions/ HTTP/1.1” 200 16676 “-” “Mozilla/5.0 (compatible; GrapeshotCrawler/2.0; +http://www.grapeshot.co.uk/crawler.php)” “152.67.137.35” “www.ipnoze.com” sn=“www.ipnoze.com” rt=0.000 ua=“-” us=“-” ut=“-” ul=“-” cs=“HIT”

152.67.137.35 - [03/Dec/2020:08:24:32 -0500] “GET /fille-garcon-deguises-infirmiere-medecin-reactions/ HTTP/1.1” 200 16676 “-” “Mozilla/5.0 (compatible; GrapeshotCrawler/2.0; +http://www.grapeshot.co.uk/crawler.php)” “152.67.137.35” “www.ipnoze.com” sn=“www.ipnoze.com” rt=0.000 ua=“-” us=“-” ut=“-” ul=“-” cs=“HIT”

I don’t have any of the incompatible plugins installed, but thanks for the info.

Are you doing the cache via CF API ? FYI it has a 30 url at a time limit, so if you are feeding >30 urls at a time for CF API for cache purge it may not work unless you break them down to <= urls at a time.

Or you mean purge/prime for nginx origin/web app level cache plugin/system?

I was talking about priming the local cache and NOT Cloudflare / APO. I cannot know for sure that the post was primed on Cloudflare. But what makes me think something isn’t right is the fact that even after the cache has been primed for a few days, the logs keep being populated with entries from cralwers. Maybe I should test for longer periods of time?

You need to log the cf request id on nginx origin for those URLs you think should of be cache hits and then dig into the request id on cache analytics side but the cache analytics web gui can’t filter on cf request id. So maybe @yevgen can do that on his end.

I did some tests on your site and all are coming from KV as cache hits

cf-apo-via kv
cf-cache-status HIT
cf-edge-cache cache,platform=wordpress
cf-ray 5fbdb6fc1a8f1d6b-NRT
cf-request-id 06ca70b18c00001d6b1687e000000001

if you are already logging cf-request-id or cf-ray header on origin nginx side, you can try and see if that request id shows in your logs which is shouldn’t if CF APO caching is working

First requests seem to be coming from KV, second from Cache. What does it mean exactly when they are coming from KV?