Hostpapa technical support, some months ago, asked me to sign up to use Cloudflare as some of the WordPress installations I have on my account were performing their own cron-jobs that were eating up too much in the way of server resources. I corrected this issue. A few weeks ago Hostpapa contacted me, and long-story-short, indicated that for a time, recently I was getting too much web crawling traffic to the WordPress installations which I have in my public_html/webs folder; they could not or would not identify which WP installations were at fault. I have setup a robots.txt file that specifically disallows web crawlers from crawling that folder, so I am at a loss as to how to prevent the excessive crawling. Is there any way to forcibly prevent the excessive crawling without doing the simple/stupid option of deleting my WordPress installations?
An example of the robots.txt file:
User-agent: *
Disallow: /webs/
webs is the folder in which the WP installations are present.
I am sorry to hear that.
I also receive daily thousands of requests (user agents like bing, yandex, iar-something, admantex, dotbot, semrush, ahrefs, MJ12bot, seekport, python-something, petalbot, huawei something, blexbot, etc), but block most of the requests from them either by their user-agent or AS number.
Nevertheless, many crawlers do not respect what is written into the robots.txt file.
Furthermore, many of them just go directly to try if they can access the sitemap.xml or sitemap_index.xml file from which they just crawl-up the URLs.
I believe yes.
Kindly, apply the Firewall Rules as needed using below articles which contain mostly all of the “bad bots” including the crawlers, etc.:
The ASNs list to block - not all, but some of them at least, here:
You could also create a Firewall Rule to allow the good bots using cf.client_bot field/option.