Firewall rule only filtering out bot traffic partially

We’ve got some trouble with bot traffic (likely scraping) on a website that’s hosted on a fairly small server. The bot requests

  • a large set of changing pages
  • in irregular intervals
  • in the order of ~1x/h
  • from various IPs
  • around the clock
  • all at once - i. e. some 200 base pages in 5s
  • no assets.

So it’s definitely bot traffic. Caching has proven useless and we switched to Cloudflare in order to be able to soft-block the bot.

Currently, it always makes requests with this user agent:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36

That’s useful because our Linux userbase is otherwise very small. So I first created a firewall rule specifying “equals” this exact user agent and to present it with a captcha. That caught some 100 requests in 24h while some 5000 still got through. Thus I changed the filter to just “contains”

Mozilla/5.0 (X11; Linux x86_64)

which catches ~1k per day while ~4k still get through.

What’s happening here? It’s all the exact same user agent (as far as it’s visible to me / nginx). All the requests do have Cloudflare IPs in the X-Forwarded-For shown in the access logs, I evaluated that with a bash script - they’re not circumventing CF somehow. (There are some for which Whois doesn’t return OrgName: Cloudflare but I see legitimate requests coming from the same IPs. That very likely means that the Whois data isn’t completely accurate. Plus that’s a small share overall.)

Before I start speculating how they could be working around the captcha - does someone have a simple explanation and solution for the problem?

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.