Problems with web crawlers not respecting robots.txt file

Hostpapa technical support, some months ago, asked me to sign up to use Cloudflare as some of the WordPress installations I have on my account were performing their own cron-jobs that were eating up too much in the way of server resources. I corrected this issue. A few weeks ago Hostpapa contacted me, and long-story-short, indicated that for a time, recently I was getting too much web crawling traffic to the WordPress installations which I have in my public_html/webs folder; they could not or would not identify which WP installations were at fault. I have setup a robots.txt file that specifically disallows web crawlers from crawling that folder, so I am at a loss as to how to prevent the excessive crawling. Is there any way to forcibly prevent the excessive crawling without doing the simple/stupid option of deleting my WordPress installations?

An example of the robots.txt file:
User-agent: *
Disallow: /webs/

webs is the folder in which the WP installations are present.

Thank you for your help!

I am sorry to hear that.
I also receive daily thousands of requests (user agents like bing, yandex, iar-something, admantex, dotbot, semrush, ahrefs, MJ12bot, seekport, python-something, petalbot, huawei something, blexbot, etc), but block most of the requests from them either by their user-agent or AS number.

Nevertheless, many crawlers do not respect what is written into the robots.txt file.
Furthermore, many of them just go directly to try if they can access the sitemap.xml or sitemap_index.xml file from which they just crawl-up the URLs.

I believe yes.
Kindly, apply the Firewall Rules as needed using below articles which contain mostly all of the “bad bots” including the crawlers, etc.:

The ASNs list to block - not all, but some of them at least, here:

You could also create a Firewall Rule to allow the good bots using cf.client_bot field/option.

I went and added the 7G Firewall to my .htaccess file. Hopefully it will help. Thanks for the suggestions!

1 Like

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.