Tons of bots' traffic. How to limit it with CloudFlare?

Hi,

I’ve just found that I’m receiving tons of hits per minute from Googlebot, Bingbot, Yandex bots, AhrefsBot, Applebot…

I’m only interested in the bots of the most important search engines (Google, Bing), and would like to limit the traffic of the rest. I’m aware about the existence of the ‘Crawl-delay’ directive for ‘robots.txt’, but I guess that not all the bots will respect it.

I’ve been browsing the possibilities of limiting the bots’ traffic with Cloudflare (Dashboard > Firewall > Tools), but I’m not sure about the best option to do it.

Any tip would be welcome. Thank you very much.

Is this also related to this topics?

Have you got “Bot fight management” enabled in Cloudflare dashboard?
Are you using any firewall rules or page rules?
How about Security Level in your Cloudflare dashboard?

I use firewall rules like this:

If it’s a Verified Bot, AND it’s one of these bots, then Allow.

  1. ALLOW (cf.client.bot and (http.user_agent contains "UptimeRobot" or http.user_agent contains "DuckDuckBot" http.user_agent contains "Googlebot" or http.user_agent contains "bingbot"))

And any other Verfied Bots get blocked.

  1. BLOCK (cf.client.bot)
2 Likes

Hi @sdayman,

Thank you very much for your nice answer. I’ve just implemented the rules you suggested.

In just 5 minutes, CF

  • allowed 1.47k events from Google
  • allowed 900 events from Microsoft
  • blocked 1.18k from Facebook: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

Is it worthwhile to allow visits from the Facebook’s bot? I guess that these visits are used to crawl my pages’ OpenGraph in order to show a summary of their content to FB’s users, but I don’t know if my server’s effort is worth it.

Thank you.

That’s quite an interesting number for Google, as it matches their 5 hits per second example here:

https://support.google.com/webmasters/answer/48620?hl=en

The Microsoft rate equates to exactly 3 hits per second. So those numbers don’t seem to be out of line with standard practice.

As for Facebook, it’s completely up to you how to handle that.

1 Like

Hi,

By default the crawl rate was established at “Let Google optimize for my site (recommended)”, but if I selected “Limit Google’s maximum crawl rate” the selector was placed at the position of 3.5 requests per second.

I’ve just moved it manually to 0.7 requests per second.

Thank you for your nice answer.

Edit: Since I started with the Firewall Rules, Googlebot visited my website 16.5k times.

1 Like

Hi again,

I’m seeing that I’m still receiving tons of requests by ‘SEMrushBot’. I’ve included it within my ‘robots.txt’ and handled it though PHP to detect their user-agent to block it.

Why ‘SEMrushBot’ is not flagged with ‘cf.client.bot’? It should have been blocked, right?

I don’t think SEMrush is on the cf.client.bot list. The list isn’t up to date, but here it is:

1 Like

Thank you very much again, @sdayman.

I’ve build a boolean condition to catch all those “bad bots” by using

(http.user_agent contains "SemrushBot") or ....

In 5 minutes, I have already caught 500 requests by these bad bots.

1 Like

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.