Help Needed Please To Exempt Google Bots From Firewall Rules

We have recently been attacked by some malicious scripts that are trying to flood our site with queries. We have identified it coming from some less common user agents and so we created some firewall rules to block these user agents, which seems to have generally mitigated the problem.

And in order to prevent the possibility of any false positives on those user agents, we also added something into the blocking rules to cover just the typical queries being sent by the scripts.

It seems though we might have accidentally blocked some of Google’s search bots too. I say this because I received an email from Google Search Console saying that some of the pages on our site can’t be accessed because of a “Server error (5xx)”.

So I went in and turned off “Bot Fight Mode” under the Bots tab within Firewall settings.

I also added another condition to each of the user agent blocking rules for “Known Bots” and then switched the Known Bots condition to “off”. But I am a bit confused whether I added the Known Bots condition in correctly or not in order to exempt known bots from the blocking rules.

Here is an example of what I have for my user agent blocking rules:

User Agent > Contains > “The Bad User Agent”
And
URL Full > Contains > “The Query Typically Being Sent By The Script”
And
Known Bots > Off

Thus, I have the following questions please:

1 – Should I disable or enable Bot Fight Mode?

2 – In my example rule above did I add in the “Known Bots” command correctly in order to exempt all known bots from the user agent blocking rules I created?

3 – I also have some Captcha and JS Challenge rules set separately for certain countries. Besides the USA, are there any other countries that Googles sends their bots from which could be affected by these country rules and also be a possible cause of the “Server error (5xx)”?

4 – I noticed under Firewall > Tools that you can also set up some User Agent blocking rules there too, but I have avoided using this function because it seems you can’t add any conditions to the rules, nor can you set up a rule that contains only a portion of the user agent. So I think it is best that I continue to set up these user agent blocking rules within the normal Firewall Rules settings as I have been doing since it offers more flexibility with Firewall Rules in general?

The rule will work for what you want but it must be at the top of the list - in case any of the subsequent BLOCK rules have some side effects on Googlebot. I would probably just use ALLOW “Known Bots is on” unless you want to block non-Google bots.

I would also leave Bot Fight mode off.

Thank you for replying. Sorry if my message was a bit long and confusing. I already have the rule at the top of the list of all my rules. What I am trying to do though is not to block Google bots or any other important bots. The problem right now is I think some Google bots may be getting blocked and I think that is why I received an email from Google Search Console saying that some of the pages on our site can’t be accessed and resulting in Google receiving a “Server error (5xx)”.

So if I want to make sure I don’t block any bots it sounds like I have to turn “Known Bots” to On?

Also, do you know if Google sends their Bots only from IP addresses in the USA or also from other countries too?

I will also keep Bot Fight mode off.

Thank you again.

Well, as far as I have checked and what I see on my Firewall Rules, the true ones are comming from the USA and Google IP.

But, there are also some Petalbot and zoominfobot comming from Google Cloud servers, I guess.

Source:

So, not to confuse, and to exclude all except Googlebot, I decided to create a Firewall rule to allow the real Googlebot and only Googlebot to access and read my robots.txt file which contains location to my sitemap.xml file as below:

The rule with action “Block” - Block robots.txt file acces to all except true Googlebot:
(http.request.uri.path contains "robots.txt" and ip.geoip.asnum ne 15169 and not http.user_agent contains "Googlebot")

  • explanation: any request comming to robots.txt file for which the IP address is not a part of Google LLC (by it’s ASN) and it’s user-agent string does not contain “googlebot” (from the source linked above) are being blocked

Could be I am wrong, while a lot of bots, be it spider or search or some SEO, are trying to access robots.txt file (as far as I am seeing on Firewall Events).

Nevertheles, as far as knowing someone else, and as there are actually bots not even looking for robots.txt file, rather trying and going directly to the /sitemap.xml or similar URLs for sitemap file and crawl it that way, or someone else could try and can open it that way either.

Maybe, if wanted, I could also secure the sitemap.xml file itself and lock it only to Googlebot too.

Just to keep in mind, I am also using another Firewall Rule to actually block these and other bots, even I am blocking Bing and Yandex bots. But, that’s me. Maybe you want and/or need them too in some case :wink:

I keep the Bot Fight Mode option enabled too.

And for feedback information, I am not getting 5xx errors in Google Search Console with that approach and none Firewall Events logged for Googlebot being challenged or blocked.

Furthermore, also to note regarding this, keep the robots META tag defined and added to the each Webpage and have the correct value of the URL link for the canonical META tag, also both inside the <head> HTML element :slight_smile:

Again, that’s what it’s and how it’s working in my case.
Maybe it would not be suitable and would not work for someone else’s case, but I hope it helps a bit to get an experience from another use case and a view from a different perspective too :wink:

Thank you for all your insight and yes we have only seen Google bots coming from USA based IP addresses so far too, which is good because it keeps it simple. But some of Google’s bots don’t identify themselves as a Google bot and yet they have legitimate Google IP addresses, so it can be confusing trying to identify all the real Google bots sometimes.

Also, if you are trying to block search engine bots then a few things to possibly keep in mind:

1 – Robots.txt files are often ignored by search engine bots and are not very reliable and we don’t rely on it at all. We just set it up for allow all and then block things in other ways. Because Google might respect Robots.txt, but many other bots don’t.

2 – If you want to block any kind of non-malicious bot, meaning one that is not flooding your site with malicious queries, and where the bot doesn’t need to be blocked at the proxy level to avoid all the queries from freezing your sever, then it seems the surest way to do that is via .htaccess with a user agent blocking rule and not with a firewall rule. We use the following code:

RewriteCond %{HTTP_USER_AGENT} Bot 1 To Be Blocked [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bot 2 To Be Blocked [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bot 3 To Be Blocked [NC]
RewriteRule .* - [F,L]

3 – Also, actually identifying the Google bots can be confusing because it uses multiple user agents and some of the Google user agents contain no Google identifier as I mentioned above. For example, we have seen Google use at least 2 different user agents on our site (below) and the first one doesn’t even identify itself as Google, yet both are legitimate Google bots. This is why I say it can be a bit confusing to know exactly what is real Google sometimes:

IP: 66.249.65.202 – Google IP
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.90 Mobile Safari/537.36 (compatible

IP: 66.249.65.200 – Google IP
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

As for us, we don’t want to block any legitimate search engine bots. But we have a whole list of bad bots in our .htaccess file that we do block using similar code to the code above and this also prevents the bots from ever getting to your sitemap.xml file I believe. To be effective with this method though you need to be sure you have all the bots you want to block correctly added into your htaccess. Then I think you are good and you don’t have to keep creating rules for each one. You can just keep adding more bot user agents to the htaccess file as you discover new ones. On the following link are some lists of some of the common bad bots we are already blocking and we also have code to block any queries that are sent from an IP with no user agent at all:

https://privatebin.support-tools.com/?e19125780c0668d8#vtwFliZfIPphLe3Kk3zIXyegLSEqqtYySm/EI6Mkzec=

If you want to block anonymous traffic from the Tor network you can also add a Firewall Rule on Cloudflare to block Tor completely, which is listed as an option at the end of the countries list for blocking by country.

Our problem really is only bad scripts that are trying to query flood our site. So what I am trying to understand is how to properly use the “Known Bots” rule so that I can add it to any rule that blocks a user agent, but without accidentally blocking any known search engine bots when I block a particular user agent. So I still want all search engines to get through any of our attempts to block by user agent. Thus, if I use the following rule format below then will all search engine bots still be able to get through on any of our user agent blocking rules?

User Agent > Contains > “The Bad User Agent”
And
URL Full > Contains > “The Query Typically Being Sent By The Script”
And
Known Bots > On

Thanks again.

Can you clarify this? As I recall, Google’s cloud hosting is on the same ASN as Google’s real bots.

I have two Firewall Rules to block bad bots:

  1. Allow if it’s a Known Bot AND not a handful of known bot user agent strings I don’t want crawling my site.
  2. Block a bunch of ASNs where bots typically come from, like AWS, Hetzner, OVH, and several large hosting companies. Every time I see a bot on my site, I note the IP address, track down the ASN, and add it to the list.

The way I do it is ALLOW Known Bots followed by a BLOCK Google ASN (and AWS, Azure).

This way I get the Googlebot but also block spam/scam/SQL injections coming from hosted instances on Google Cloud.

1 Like

Sorry if what I wrote was confusing about real Google bots. I was trying to say that sometimes the User Agents Google uses doesn’t make it fully obvious at a glance that it is a Google bot. Sometimes they do, but a lot of the time they are using the following user agent on our site and it doesn’t identify itself as being Google at all:

Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.90 Mobile Safari/537.36 (compatible

Where you wrote "ASNs where bots typically come from, like AWS, Hetzner, OVH, and several large hosting companies" can I ask two questions please?

1 - If you completely block the ASNs from those big companies then is there a chance of accidentally blocking any normal visitor traffic?

2 - How are you able to easily and quickly determine if something is a bot on your site? Perhaps from the user agent or do you grab the IP and look it up?

And one last question please: If you add ALLOW Known Bots (The Rule In The On Position) onto an existing rule which is blocking a user agent, then the Known Bots still get through from that user agent, correct? This is what I want and what I am trying to do. Sorry to keep coming back to this question, but I am still a bit confused on the use of this Known Bots rule.

Thank you all for your kind patience with my repeated queries.

That sure doesn’t look like a Google bot to me. Unless it’s because you (or your server) cut off the end of the string. Every UAS Google lists has “google” somewhere in it:

  1. No. “Normal” visitors should not be coming through a cloud hosting company. Some might be using a minor or self-hosted VPN, but I don’t completely block. I JS Challenge, which has been pretty effective.
  2. Bots are usually hitting only HTML and not loading any JS or CSS. Or hit a bunch of 404 errors, usually because they’re probing for vulnerabilities. Or they have a bot string that’s up to no good, like “python”.

I see you mention ALLOW and “blocking” in the same rule. It’s one or the other, and you’d adjust the logic accordingly. I already gave my setup in my first example, and it requires two rules.

2 Likes

Thank you. I think I see what is happening now with that Google user agent I posted. Yes, I think it is one of Google’s mobile bots noted on the link you provided but the UA is getting truncated such that it doesn’t show the Google bot identifier part of the UA on our site. That makes sense. Sorry my bad.

Thanks also for explaining how you block those cloud servers and how you identify bots. If I also just want to block a few of the biggest hosts with a JS Challenge then would the 3 you mentioned be good enough (AWS, Hetzner, OVH) or are there other big ones I should also include too? Sorry, I don’t know the names very well so that is why I am asking.

I think I am understanding now on how to block a user agent without blocking any Google bots since you explained that it has to be one or the other. So I think I have a better solution than using Known Bots at all for my purposes.

So say for example I want to create a rule to block the following user agent used by Google, but I don’t actually want to block Google itself, and only block any other bots that might use that agent too, then should I set it up as follows?

User Agent > Contains > Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.90 Mobile Safari/537.36
And
URL Full > Contains > “The Query Typically Being Sent By The Malicious Bot”
And
User Agent > Does Not Contain > Google

Would the above work please?

There’s a list of ASN belongs to hosting providers:

3 Likes

Thank you. That is quite a list (741 total). Would you block all of them with a JS Challenge? And is there a way to add that CSV file as a list to Cloudflare?

UPDATE:

I blocked the 2 ASNs from AWS, the 2 from Hetzner, and the 1 from OVH by adding each as an ASN rule manually. If there are others though that should be added or a way to add a bunch of them as a list then I would be interested to do that too. Thank you.

UPDATE 2:

I ended up adding a few more of the big ones. Below are the 10 ASNs I went with. Seems like a good start:

32244, 22611, 54641, 47583, 55293, 37153, 24940, 16509, 14618, 16276

(ip.geoip.asnum eq 16276) or (ip.geoip.asnum eq 14618) or (ip.geoip.asnum eq 16509) or (ip.geoip.asnum eq 24940) or (ip.geoip.asnum eq 37153) or (ip.geoip.asnum eq 32244) or (ip.geoip.asnum eq 22611) or (ip.geoip.asnum eq 54641) or (ip.geoip.asnum eq 47583) or (ip.geoip.asnum eq 55293)

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.