Known Bots rule lacks many Google Crawler IPs

Let me sum up my current knowledge:

  • CF confirmed to us that Google Crawler IPs are missing in their “Known Bots”. So the issue is real.
  • Only a small(?) subset of Google Crawler IPs are missing. At least those I added in the WAF rule in my first post
  • From our data points it looks like Google is doing some kind of IP sharding for its Crawler. We can see that particular domains have this issue while others don’t. And the sites which have the issues keep being crawled by the same Google IPs. Whilst others working well with “Known Bots” and constantly being crawled by other IPs. So the IPs are not randomly used like round robin or such.
  • Only a small set of our clients is affected
  • We are a SaaS company and run several hundred sites / domains through our CF account so I am certain that we have a relevant amount of data points
  • When the Google crawler continues to get 4xx errors for any given URL that URL is most probably delisted at some point in time. What else should the crawler do when it isn’t able to access the content anymore? (as discussed here for example: What does Google do with indexed pages returning 403? - Webmasters Stack Exchange )
  • This is also backed by the 403 errors in the Google Search Console. They are there yet I cannot publish client data publicly. We also had clients not being able to confirm their “Property” (DNS TXT Record in CF) in the GCS due to this. It instantly worked after “fixing” the allow list with the missing crawler IPs. That’s again as close to hard proof as it can get with a blackbox like Google.

From my point of view your arguments are fair and valid - based on the fact that the crawler IPs missing in Known Bots are simply not used to crawl your site(s). Independent of the amount of traffic.

I do not see any technical or SEO issues on our pages. It wouldn’t make sense that there are any when the crawler isn’t able to reach the site at all :wink: . And as SaaS company - all our clients are technically setup 100% identical. If we fail somewhere basic - we would know instantly by a mob of clients with torches and pitchforks at our office :wink:

Yet that Google IPs are missing in “Known Bots” remains a fact. And thus that WAF rule is unreliable. Besides that strange CF Worker IP… The potential danger of being delisted is there as well. The particular risk that this actually happens most certainly varies a lot.

That’s all I wanted to make people aware of as long as CF takes time to fix this.

5 Likes