Hey there, just wanted to share this information for reference and the “public opinion” on this.
We came across that the WAF “Known Bots” was not triggering for many Google IPs. We have a setup where we allow “Known Bots” in one WAF rule and in a later WAF rule send all traffic from certain Geo regions (US as well) to the CF Managed Challenge.
We expected that the “Known Bot” would be complete for all major search engines. Google in particular. Sadly this is not the case. Several legit Google calls ended up in the CF Managed Challenge.
After some lengthy discussions with the CF support I got the feedback that there are “missing IPs” in the “Known Bots” rule.
We were quite irritated as several of that missing IPs are publicly known Google Crawler IPs for years…
Due to this several of our domain dropped out the Google index as the crawler reported “403”.
From our current perspective the “Known Bots” is not usable as it doesn’t do what it promises to do.
At this point in time we add validated IPs manually to a new WAF rule. Still this is highly irritating and unexpected.
For now this is a WAF filter to fix the missing “Known Bots” data:
Hey,
I went ahead and escalated this issue. Do you have any rayids or logs to show that the crawlers were being blocked despite a rule allowing known bots? Do you have any ticket number?
thanks for the escalation. Check out our ticket #2649240. Rays and WAF logs are all in there.
It’s open for a month now. Took some time to convince someone that this actually is a real issue…
To spice things up, it’s not just about Google IPs missing in “Known Bots” but there are as well logged Google request (100% sure they are from Google) in the WAF from a CF worker IP ( 2a06:98c0:3600::103 ). So there are some things around the WAF and Known Bots which are obviously broken.
Regarding the CF worker IP: We were able to link those WAF events with a 99.9% certainty to legit Google calls. These requests matched in WAF and in Google Search Console to the second. I am very certain that these calls were not spoofed. In addition the WAF displays Googles ASN. So the WAF info doesn’t add up.
You can match them against the WAF rule I posted above.
And using the Wayback Machine / Internet Archive it is easy to confirm that these IPs are on that list for year(s)! So it’s not about “new” IPs and CF wasn’t just able to add them instantly…
hi @JanTh was there any update on this. Interesting I came across this as I had suspected this mid-late last year but never considered such a view. I would be interested to know if there is any update on this one or official comments from Cloudflare team. Do we need to manually add the ips in the Google Json to WAF allow list?
There is a WAF rule within my first post you can copy and paste. You can add this to the WAF Rule screen at the bottom into a text field “Expression Preview”. You have to enable the code view with a text link above the text field (“Edit expression”) before being able to insert into that.
Use “Allow” as action and make sure that rule is on position #1.
It’s your choice. Just read the text below the dropdown and decide what you want. We only ever use “Allow”.
Is it a good to add this?
Again - up to you - I wouldn’t do this as you do not know what Google Cloud traffic might be routed through that ASN. Do not assume it’s just legit crawlers. The CF security features are there for good reasons. There are very rare usecases where it is “smart” or “good” to disable them like you did in your rule.
Just set the fake google bot rule to log only, you can do this with yandex bing baidu etc as well. Do not give google asn a bypass all waf. Also, if this was a major issue others deindexed as well and much more complaints would be logged, i suspect this blocking isnt the issue, just think.
Check google console indexing report, how many urls are 403 as this is clearly stated, what are the google bot logs in settings crawler stats you did not post these so a few things to check.
if you post these since if its crawling and http status 200 from googlebot you been spanked sadly by google with an algorythm update. Blocking several googlebot i.p addresses out of thousands i would be surprised if this was the case.
Keen to see the indexing section report of 403 and crawler page stats. Curious to see
The issue is only apparent if you use a GEO WAF rule like “Challenge all traffic from outside EU with a Managed Challenge” and combine it with an “Allow” for “Known Bots”.
I believe most people are not doing this and thus the missing IPs in the “Known Bots” are not that obvious.
But let me share my data with you if this can convince you.
The screenshots shows the case that a Google Crawler request is reported with a Cloudflare Worker IP6 address and blocked.
There is no reason to believe this has anything to do with Spoofing or any Google Algo updates. We are very aware of these possibilities and ruled them out.
Let me just attach one recent example. It is accordingly backed by a 403 in the Google Search Console. You can see a legit Google Crawler IP which should be allowed per the “Known Bots” rule:
Also dead wrong about challenge traffic like you have setup, i got that many rules similar, many here do since botfight mode is useless dude, again if this was a issue we would all be getting deindexed hence i doubt Cloudflare is to blame soley for your current issue. I also cannot see legit crawlers being blocked from my logs amd my site get over 20 million crawls per 90 days.
Post your google console logs, no point dismissing until you do some actual investigation into google even if some google i.ps have issues, it takes google ages to deindex a site unless it was a manual penalty or an algo update which if you follow seo you know every few days google is adjusting like a mad scientist these days making life miserable for most whitehat websites.
The google console page index section should have all your posts as 403 and the google console googlebot crawler stats section should reflect this.
CF confirmed to us that Google Crawler IPs are missing in their “Known Bots”. So the issue is real.
Only a small(?) subset of Google Crawler IPs are missing. At least those I added in the WAF rule in my first post
From our data points it looks like Google is doing some kind of IP sharding for its Crawler. We can see that particular domains have this issue while others don’t. And the sites which have the issues keep being crawled by the same Google IPs. Whilst others working well with “Known Bots” and constantly being crawled by other IPs. So the IPs are not randomly used like round robin or such.
Only a small set of our clients is affected
We are a SaaS company and run several hundred sites / domains through our CF account so I am certain that we have a relevant amount of data points
When the Google crawler continues to get 4xx errors for any given URL that URL is most probably delisted at some point in time. What else should the crawler do when it isn’t able to access the content anymore? (as discussed here for example: What does Google do with indexed pages returning 403? - Webmasters Stack Exchange )
This is also backed by the 403 errors in the Google Search Console. They are there yet I cannot publish client data publicly. We also had clients not being able to confirm their “Property” (DNS TXT Record in CF) in the GCS due to this. It instantly worked after “fixing” the allow list with the missing crawler IPs. That’s again as close to hard proof as it can get with a blackbox like Google.
From my point of view your arguments are fair and valid - based on the fact that the crawler IPs missing in Known Bots are simply not used to crawl your site(s). Independent of the amount of traffic.
I do not see any technical or SEO issues on our pages. It wouldn’t make sense that there are any when the crawler isn’t able to reach the site at all . And as SaaS company - all our clients are technically setup 100% identical. If we fail somewhere basic - we would know instantly by a mob of clients with torches and pitchforks at our office
Yet that Google IPs are missing in “Known Bots” remains a fact. And thus that WAF rule is unreliable. Besides that strange CF Worker IP… The potential danger of being delisted is there as well. The particular risk that this actually happens most certainly varies a lot.
That’s all I wanted to make people aware of as long as CF takes time to fix this.