Identify Google Bots (or good bots in general) at application level

Hello,

I’d like to expose the subscriber only content to the Google Bot.

I’d therefore like to be able to determine if the client is indeed the Google Bot or not, in order to grant or not to grant access.

Currently Cloudflare provides the possibility in a firewall to tell if the Bot is a known good bot (cf.client.bot); how can I use this information in my application?

Is there a HTTP header where this information is stored (i.e.: the real client IP is in the CF-Connecting-IP header)?

A workaround would probably be to block all request with “Googlebot” in it which are not a “good known bot” and enable the subscriber only content for all resulting user agents with “Googlebot” but I don’t really like that solution. Any ideas?

Thanks,

Marco

You can do this by following the host verification process Google recommends in

In PHP, the code would look like this:

        $ua = $request->header('User-Agent', '');
        if ($ua == '') {
            return new JsonResponse(['error' => 'user-agent not supplied.'], 400);
        }
        if (strpos(Str::lower($ua), 'googlebot') !== false) {
            $addr = gethostbyaddr($ip);
            // ensure the HOST
            if (!Str::endsWith($addr, ['.googlebot.com', '.google.com'])) {
                return self::errorFakeGoogleBot();
            }
            // prevent hostname fake
            if (gethostbyname($addr) != $ip) {
                return self::errorFakeGoogleBot();
            }
        }

(this is effectively pseudocode, you won’t be able to just copy and paste it)

1 Like

Hello,

thanks for your reply.
I had implemented something similar myself (just a proof of concept in powershell):

$isGoogleBot = $false
$names = (Resolve-DnsName $ip).namehost
$names | % { write-host $_; if ((($_ -match "googlebot.com$") -or ($_ -match "google.com$")) -and ( ((Resolve-DnsName $_).IPaddress -contains $ip))) {$isGoogleBot = $true} }

I hoped that instead of verifying it myself, that could be done directly by CloudFlare.

I’ve then noticed that on my domains we’ve already enabled the “Fake Google Bot” detection (WAF: Cloudflare Specials -> Rule id 100201 .

I’ve verified our logs and indeed all requests coming with a user agent “googlebot” seem to be legitimate.

I’ve performed some quick test to verify the behavior and indeed my requests get blocked.

I think that we can then assume that every request with the user agent “googlebot” is a legitimate one, so I guess the application could just rely on that info. What do you think

By the way, the info on the Google page you linked (and which I had also found) seem to be imprecise.
Among the list of IPs I’ve retrieved from our logs, there also some which don’t match neither .googlebot.com nor .google.com.

For example, 107.178.231.94 and 130.211.96.77 both belong to Google, but their reverse DNS entry point to: something.bc.googleusercontent.com (so neither googlebot.com nor .google.com).

This is also interesting: Cloudflare Managed Special rules are blocking Googlebot - #14 by dmz

  • 100035 - Fake google bot , based on partial useragent match and ASN
  • 100035 C - Fake google bot , based on exact useragent match and DNS lookup
  • 100035 D - Fake google bot , based on partial useragent match and DNS lookup

So, I’d say that Cloudflare is already doing it.

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.