If there is any pattern in the bot’s requests, you may be able to create a custom firewall rule to block it specifically or challenge a wider range of traffic that may contain the bot.
You could also try Bot Fight Mode, that is very unforgiving and may block the bot. Just be aware that there’s no way to bypass it for specific teaffic so it’s either on or off.
thank you for your help. Bot fight mode is already enabled. I found the folowwing entry in the cloudflare firewall (NOT blocked):
“Access allowed, manage definite bots rule”
Does that mean that cloudflare detected the bot and therefore showed up a challenge ? But in the end the bot could access our website and copied around 1000 pages last night.
Same as my approach.
I allow only Google to visit robots.txt file and sitemap (.xml) files.
On a Pro plan, on " Configure Super Bot Fight Mode" under the " Definitely automated" I have set to “Challenge”. Therefore, I managed to find out few bots/crawlers and daily it clears out approx. ~1000 requests. They mostly go to /rss or /feed, and crawl each page of the link they find in it.
I am not saying all crawlers are going to /feed and we should challenge every request which contains /feed - as far as users can have some RSS reader app installed on a desktop PC, mobile phone, or some other app, etc., so I would block even them - that’s not good.
You might want to check from which AS numbers are the requests coming, and therefore block few ASNs completely - if that could be a good start, at least to try out.
In terms of a Managed Rules on a Pro plan, I have enabled all of them under the " Package: OWASP ModSecurity Core Rule Set", selected “Medium” for sensitivity and “Challenge” for the action.
Nevertheless, I found out blocking requests coming from HTTP/1.0 are usually the ones too:
You could setup some custom Firewall Rules as @domjh suggested, like if from the coming request the user-agent contains “crawl” or “feed” or “parser”, then block.
There are online tools like code.google.com/p/feedparser - is it good or not? Depends.
Others like crawlson.com and similar like Comscore crawler and other “user-agents” like JetSlide, mojeek, rssapi, aiohttp, SimplePie, CrowdTangle.
There are even some search engines which have user-agents like omgili.com.
You can even try to block Bingbot - just to see if any request are actually hitting your server from it - or, at least using Managed Firewall Rules the “fake bingbot” and other “fake” are blocked using it
Haven’t seen it yet. I think it could be “good bot” like Yandex or some other (from verified and good bot list of cf.client_bot) like Facebook externalhit (which can be used for scrapping / DDoS) so it was allowed? I am not sure.