Fake Google Bot

Hmm yeh probably tricky to filter out on those.

Possibly capture some request headers from legit and non legit googlebot requests, and see if legit google requests always use certain headers a certain way that dodgy ones dont?

Or maybe google has listed IP for googlebot, and just whitelist googlebot useragents only from those IP?

Or maybe google has listed IP for googlebot, and just whitelist googlebot useragents only from those IP?

Good idea, but Google doesn’t lis IPs. Only way to verify Google bot is by checking host:
https://support.google.com/webmasters/answer/80553

This bots can even solve Cloudflare Challenge. I’ve added some ranges to CF firewall and they solved the googe captcha, because I can still see IPs from this range in my server logs.

:wave: @katarzynastarzewska,

If you are on a paid plan make sure you have the WAF enabled, there is a standard rule for fake GoogleBots.

— OG

Just bought paid plan and enabled WAF, but it didn’t help. Now they send about 120k requests per hour.

I’d say one of those two options might be the most feasible to block them

(http.user_agent contains "Googlebot" and ip.geoip.asnum eq 14618)

and

(http.user_agent contains "Googlebot" and not cf.client.bot)

The first blocks requests with a user agent containing “Googlebot” and coming from Amazon’s networks. The second one blocks based on the same user agent but requests which are not Cloudflare internally marked as “known crawler” requests.

1 Like

Thank you, but unfortunately none of this options work.

The first one doesn’t work, because I already used blocking by ASN.

Cloudflare shows about 700k requests from Googlebot in Analytics > Security, so I guess cf.client.bot is checked by useragent not by bot IP, hostname or ASN.

If you already blocked the network and still get requests from that network, these requests must come directly to your server and you should have a look at your server’s firewall instead.

The details are not public, but I am pretty sure that is somewhat IP block based and not user agent based.

1 Like

Logs on my server end with “X-Middleton/1”, so I think requests come from Cloudflare, because it is added by Apache CF Module.

Which Apache CF module? There is one module and it just rewrites IP addresses and doesnt add and such data. Also, that string does not seem very Cloudflare related, furthermore if requests pass through that module it, it will always be added.

The way you described it, the most likely explanation are direct requests. Whats the domain and would you feel comfortable sharing the server IP address?

Which Apache CF module? There is one module and it just rewrites IP addresses.
https://support.cloudflare.com/hc/en-us/articles/200170786-Restoring-original-visitor-IPs-Logging-visitor-IP-addresses-with-mod-cloudflare-

Requests aren’t direct because I’m now blocking about 700k of them by IP ranges:

That is the module I was referring to and that module does not add any such data.

Anyhow, if you can rule out direct requests and requests still hit your server you got the wrong ASN and should check that out.

In any case, the second expression should still block them, of course assuming they come with the user agent in question.

Which Apache CF module? There is one module and it just rewrites IP addresses and doesnt add and such data. Also, that string does not seem very Cloudflare related, furthermore if requests pass through that module it, it will always be added.

" X-Middleton/1" - my mistake, it is added by Ezoic CDN, but it means that requests arent direct.

Where do you take that from? That value appears to be appended regardless of where the request came from.

Again, if you can rule out direct requests you got the configuration wrong. But I am still pretty sure it is direct requests.

Once again, whats the domain and server IP?

In any case, the second expression should still block them, of course assuming they come with the user agent in question.

So just one rule? Can you post a screenshot of that rule? Also, try blocking instead.

For the fourth time, what is the domain?

I’m sorry, but I can’t share domain and IP.

In that case it is impossible for the community to say anything.

If you applied the expressions as I described the requests should be blocked. If they are not, either the configuration is wrong, or the requests do not match the configuration (different values), or the requests are direct.

I am afraid this is all that can be said at this point.