Known Bots rule lacks many Google Crawler IPs

Hey there, just wanted to share this information for reference and the “public opinion” on this.

We came across that the WAF “Known Bots” was not triggering for many Google IPs. We have a setup where we allow “Known Bots” in one WAF rule and in a later WAF rule send all traffic from certain Geo regions (US as well) to the CF Managed Challenge.
We expected that the “Known Bot” would be complete for all major search engines. Google in particular. Sadly this is not the case. Several legit Google calls ended up in the CF Managed Challenge.
After some lengthy discussions with the CF support I got the feedback that there are “missing IPs” in the “Known Bots” rule.

We were quite irritated as several of that missing IPs are publicly known Google Crawler IPs for years…
Due to this several of our domain dropped out the Google index as the crawler reported “403”.

From our current perspective the “Known Bots” is not usable as it doesn’t do what it promises to do.
At this point in time we add validated IPs manually to a new WAF rule. Still this is highly irritating and unexpected.

For now this is a WAF filter to fix the missing “Known Bots” data:

(http.user_agent contains "Google" and ip.geoip.asnum eq 15169 and ip.src in {66.249.66.129 66.249.66.131 66.249.66.128 66.249.66.159 66.249.66.133 66.249.65.186 66.249.65.188 66.249.65.184 66.249.65.169 66.249.65.171 66.249.65.173 66.249.65.190 66.249.77.30 66.249.65.185 66.249.65.183 66.249.64.104 66.249.64.106 66.249.64.120 66.249.64.122 66.249.77.28 66.249.64.0/27 66.249.64.128/27 66.249.64.160/27 66.249.64.192/27 66.249.64.224/27 66.249.64.32/27 66.249.64.64/27 66.249.64.96/27 66.249.65.0/27 66.249.65.128/27 66.249.65.160/27 66.249.65.192/27 66.249.65.224/27 66.249.65.32/27 66.249.65.64/27 66.249.65.96/27 66.249.66.0/27 66.249.66.128/27 66.249.66.192/27 66.249.66.0/24})

Yet this list will not be complete as we still see additional IPs popping up. I will keep updating the filter to document the missing IPs…

5 Likes

Hey,
I went ahead and escalated this issue. Do you have any rayids or logs to show that the crawlers were being blocked despite a rule allowing known bots? Do you have any ticket number?

2 Likes

Hey @jnperamo

thanks for the escalation. Check out our ticket #2649240. Rays and WAF logs are all in there.
It’s open for a month now. Took some time to convince someone that this actually is a real issue…

To spice things up, it’s not just about Google IPs missing in “Known Bots” but there are as well logged Google request (100% sure they are from Google) in the WAF from a CF worker IP ( 2a06:98c0:3600::103 ). So there are some things around the WAF and Known Bots which are obviously broken.

Best Regards,

Jan

1 Like

This might be expected as some people try to abuse workers and as a result, CF has some built in rules to prevent people from launching such attacks.

Just to confirm, the IPs you saw being blocked are listed on the google dev guide and are not just from google reporting a crawler user agent?

1 Like

Regarding the CF worker IP: We were able to link those WAF events with a 99.9% certainty to legit Google calls. These requests matched in WAF and in Google Search Console to the second. I am very certain that these calls were not spoofed. In addition the WAF displays Googles ASN. So the WAF info doesn’t add up.

Google IPs: These missing IPs - at least those couple I checked - are well documented here: https://developers.google.com/static/search/apis/ipranges/googlebot.json

        {"ipv4Prefix": "66.249.64.0/27"},
        {"ipv4Prefix": "66.249.64.128/27"},
        {"ipv4Prefix": "66.249.64.160/27"},
        {"ipv4Prefix": "66.249.64.192/27"},
        {"ipv4Prefix": "66.249.64.224/27"},
        {"ipv4Prefix": "66.249.64.32/27"},
        {"ipv4Prefix": "66.249.64.64/27"},
        {"ipv4Prefix": "66.249.64.96/27"},
        {"ipv4Prefix": "66.249.65.0/27"},
        {"ipv4Prefix": "66.249.65.128/27"},
        {"ipv4Prefix": "66.249.65.160/27"},
        {"ipv4Prefix": "66.249.65.192/27"},
...

You can match them against the WAF rule I posted above.

And using the Wayback Machine / Internet Archive it is easy to confirm that these IPs are on that list for year(s)! So it’s not about “new” IPs and CF wasn’t just able to add them instantly…

It’s all documented within the ticket as well :wink:

hi @JanTh was there any update on this. Interesting I came across this as I had suspected this mid-late last year but never considered such a view. I would be interested to know if there is any update on this one or official comments from Cloudflare team. Do we need to manually add the ips in the Google Json to WAF allow list?

was there any update on this

They are still “investigating”… No real updates. It’s just the Google crawler. CF doesn’t seem to really think dropping out the index is critical

(post deleted by author)

This is very important information to me. Do I set the rules in the WAF menu?
Should I add ip and bypass in WAF?

There is a WAF rule within my first post you can copy and paste. You can add this to the WAF Rule screen at the bottom into a text field “Expression Preview”. You have to enable the code view with a text link above the text field (“Edit expression”) before being able to insert into that.

Use “Allow” as action and make sure that rule is on position #1.

Best,
Jan

1 Like

Oh, and make sure to open a Ticket with the CF support to let them know that being kicked out of the Google Index is somewhat a critical issue :wink:

1 Like

Thank you for your kind reply. Is this the right way to do it? But isn’t “bypass” a better option than “allow” ?

Is it a good to add this?

But isn’t “bypass” a better option than “allow” ?

It’s your choice. Just read the text below the dropdown and decide what you want. We only ever use “Allow”.

Is it a good to add this?

Again - up to you - I wouldn’t do this as you do not know what Google Cloud traffic might be routed through that ASN. Do not assume it’s just legit crawlers. The CF security features are there for good reasons. There are very rare usecases where it is “smart” or “good” to disable them like you did in your rule.

Thank you. You’ll be right.
I’m using a translator. Forgive me if there was any rudeness in what I said.

Just set the fake google bot rule to log only, you can do this with yandex bing baidu etc as well. Do not give google asn a bypass all waf. Also, if this was a major issue others deindexed as well and much more complaints would be logged, i suspect this blocking isnt the issue, just think.

Check google console indexing report, how many urls are 403 as this is clearly stated, what are the google bot logs in settings crawler stats you did not post these so a few things to check.

if you post these since if its crawling and http status 200 from googlebot you been spanked sadly by google with an algorythm update. Blocking several googlebot i.p addresses out of thousands i would be surprised if this was the case.

Keen to see the indexing section report of 403 and crawler page stats. Curious to see

No reason to apologize, it’s all good :slight_smile:

The issue is only apparent if you use a GEO WAF rule like “Challenge all traffic from outside EU with a Managed Challenge” and combine it with an “Allow” for “Known Bots”.

I believe most people are not doing this and thus the missing IPs in the “Known Bots” are not that obvious.

But let me share my data with you if this can convince you.
The screenshots shows the case that a Google Crawler request is reported with a Cloudflare Worker IP6 address and blocked.


There is no reason to believe this has anything to do with Spoofing or any Google Algo updates. We are very aware of these possibilities and ruled them out.

Let me just attach one recent example. It is accordingly backed by a 403 in the Google Search Console. You can see a legit Google Crawler IP which should be allowed per the “Known Bots” rule:

Google console results?

Also dead wrong about challenge traffic like you have setup, i got that many rules similar, many here do since botfight mode is useless dude, again if this was a issue we would all be getting deindexed hence i doubt Cloudflare is to blame soley for your current issue. I also cannot see legit crawlers being blocked from my logs amd my site get over 20 million crawls per 90 days.

Post your google console logs, no point dismissing until you do some actual investigation into google even if some google i.ps have issues, it takes google ages to deindex a site unless it was a manual penalty or an algo update which if you follow seo you know every few days google is adjusting like a mad scientist these days making life miserable for most whitehat websites.

The google console page index section should have all your posts as 403 and the google console googlebot crawler stats section should reflect this.

1 Like

Let me sum up my current knowledge:

  • CF confirmed to us that Google Crawler IPs are missing in their “Known Bots”. So the issue is real.
  • Only a small(?) subset of Google Crawler IPs are missing. At least those I added in the WAF rule in my first post
  • From our data points it looks like Google is doing some kind of IP sharding for its Crawler. We can see that particular domains have this issue while others don’t. And the sites which have the issues keep being crawled by the same Google IPs. Whilst others working well with “Known Bots” and constantly being crawled by other IPs. So the IPs are not randomly used like round robin or such.
  • Only a small set of our clients is affected
  • We are a SaaS company and run several hundred sites / domains through our CF account so I am certain that we have a relevant amount of data points
  • When the Google crawler continues to get 4xx errors for any given URL that URL is most probably delisted at some point in time. What else should the crawler do when it isn’t able to access the content anymore? (as discussed here for example: What does Google do with indexed pages returning 403? - Webmasters Stack Exchange )
  • This is also backed by the 403 errors in the Google Search Console. They are there yet I cannot publish client data publicly. We also had clients not being able to confirm their “Property” (DNS TXT Record in CF) in the GCS due to this. It instantly worked after “fixing” the allow list with the missing crawler IPs. That’s again as close to hard proof as it can get with a blackbox like Google.

From my point of view your arguments are fair and valid - based on the fact that the crawler IPs missing in Known Bots are simply not used to crawl your site(s). Independent of the amount of traffic.

I do not see any technical or SEO issues on our pages. It wouldn’t make sense that there are any when the crawler isn’t able to reach the site at all :wink: . And as SaaS company - all our clients are technically setup 100% identical. If we fail somewhere basic - we would know instantly by a mob of clients with torches and pitchforks at our office :wink:

Yet that Google IPs are missing in “Known Bots” remains a fact. And thus that WAF rule is unreliable. Besides that strange CF Worker IP… The potential danger of being delisted is there as well. The particular risk that this actually happens most certainly varies a lot.

That’s all I wanted to make people aware of as long as CF takes time to fix this.

5 Likes