Same as my approach.
I allow only Google to visit robots.txt file and sitemap (.xml) files.
On a Pro plan, on " Configure Super Bot Fight Mode" under the " Definitely automated" I have set to “Challenge”. Therefore, I managed to find out few bots/crawlers and daily it clears out approx. ~1000 requests. They mostly go to /rss or /feed, and crawl each page of the link they find in it.
I am not saying all crawlers are going to /feed and we should challenge every request which contains /feed - as far as users can have some RSS reader app installed on a desktop PC, mobile phone, or some other app, etc., so I would block even them - that’s not good.
You might want to check from which AS numbers are the requests coming, and therefore block few ASNs completely - if that could be a good start, at least to try out.
In terms of a Managed Rules on a Pro plan, I have enabled all of them under the " Package: OWASP ModSecurity Core Rule Set", selected “Medium” for sensitivity and “Challenge” for the action.
Nevertheless, I found out blocking requests coming from HTTP/1.0 are usually the ones too:
You could setup some custom Firewall Rules as @domjh suggested, like if from the coming request the user-agent contains “crawl” or “feed” or “parser”, then block.
There are online tools like code.google.com/p/feedparser - is it good or not? Depends.
Others like crawlson.com and similar like Comscore crawler and other “user-agents” like JetSlide, mojeek, rssapi, aiohttp, SimplePie, CrowdTangle.
There are even some search engines which have user-agents like omgili.com.
You can even try to block Bingbot - just to see if any request are actually hitting your server from it - or, at least using Managed Firewall Rules the “fake bingbot” and other “fake” are blocked using it
Haven’t seen it yet. I think it could be “good bot” like Yandex or some other (from verified and good bot list of cf.client_bot) like Facebook externalhit (which can be used for scrapping / DDoS) so it was allowed? I am not sure.