As I understand it I can use firewall rules to block these user agent strings.
But I did some research and some of the people who run these bots advise just blocking the UA name.
For example instead of blocking “AhrefsBot/6.1” I would block “AhrefsBot”.
But I am not sure if this would work with the Cloud Flare setup?
I would like to be able to do it this way if possible as it would block all bots from a site rather than just the current version to be hitting my site.
Is this possible?
I hope that makes sense. This is all very complicated for me!
(http.user_agent contains "AhrefsBot" and not cf.client.bot) or (http.user_agent contains "SemrushBot" and not cf.client.bot) or (http.user_agent contains "YandexBot" and not cf.client.bot) or (http.user_agent contains "CCBot" and not cf.client.bot)
Here we use and not cf.client.bot in order to make sure these legitimate bots - that index the web - don’t get blocked. Anyone faking the bot user agent would then be blocked.
If you don’t care about being indexed by legitimate bots, you don’t need the “known bots” part of these rules.
Don’t know about the other two, but Yandex is a legitimate search engine and AhrefsBot is used by (some) legitimate services such as schools. Both of those crawlers respect robots.txt and that is the preferred method.
The problem with using a firewall to control bots, they may continue until they hit robots.txt and unnecessarily flood logs to the point where people may ignore their firewall logs.
(http.user_agent contains "AhrefsBot" or http.user_agent contains "SemrushBot" or http.user_agent contains "YandexBot" or http.user_agent contains "CCBot") and not cf.client.bot
This will definitely block Semrush and CC however as these two are not covered by the bot flag.