We’ve got some trouble with bot traffic (likely scraping) on a website that’s hosted on a fairly small server. The bot requests
- a large set of changing pages
- in irregular intervals
- in the order of ~1x/h
- from various IPs
- around the clock
- all at once - i. e. some 200 base pages in 5s
- no assets.
So it’s definitely bot traffic. Caching has proven useless and we switched to Cloudflare in order to be able to soft-block the bot.
Currently, it always makes requests with this user agent:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36
That’s useful because our Linux userbase is otherwise very small. So I first created a firewall rule specifying “equals” this exact user agent and to present it with a captcha. That caught some 100 requests in 24h while some 5000 still got through. Thus I changed the filter to just “contains”
Mozilla/5.0 (X11; Linux x86_64)
which catches ~1k per day while ~4k still get through.
What’s happening here? It’s all the exact same user agent (as far as it’s visible to me / nginx). All the requests do have Cloudflare IPs in the X-Forwarded-For shown in the access logs, I evaluated that with a bash script - they’re not circumventing CF somehow. (There are some for which Whois doesn’t return OrgName: Cloudflare but I see legitimate requests coming from the same IPs. That very likely means that the Whois data isn’t completely accurate. Plus that’s a small share overall.)
Before I start speculating how they could be working around the captcha - does someone have a simple explanation and solution for the problem?