Which is the best option to fight against web scrapping


I’d like to avoid web scrappers from stealing contents from my site. Before being working with Cloudflare, I had a PHP script which checked if an IP address was a known search bot. If not, it checked if such IP address crawled my webpages with a high speed rate (e.g. 100 pages per minute). If so, it was banned.

After starting to work with Cloudflare, I stopped that PHP script since --obviously-- it was only seeing CloudFlare IP addresses, and I cannot distinguish between a legit Googlebot visit and a malicious scrapping script.

Now I wonder about the best way to protect my website against these scrapping visits. I’ve seen that I can configure the Firewall options (Pro Plan) with a ‘Security Level’ and a ‘Bot Fight Mode’, and wondered if there are experiences about the best options.

Thank you very much.

Just rewrite IP addresses


For example, you could use firewall rules and challenge based on the threat score

(cf.threat_score gt XXX)

With “Security Level” set to Medium and a “Bot Fight Mode” enabled, I noticed a sharp decline in the more aggressive bots.

If your raw server logs reveal the scraper user-agent that is slipping through, you can create a CF firewall, User Agent Blocking rule to block it. Some scrapers are kinda dumb/lazy and don’t bother cloaking their activities or changing their user-agent.
You can also add a bot specific blocking firewall rule like (cf.client.bot and cf.threat_score gt 10) https://developers.cloudflare.com/firewall/known-issues-and-faq/#how-does-firewall-rules-handle-traffic-from-known-bots
I did the opposite and set it to allow known “good” bots as the first rule on the list
(cf.client.bot and cf.threat_score lt 10) .
Then, after 3500 entries in 24hrs, I reviewed which bots had a low score and the AS Number where they were hosted. I then started crafting rules for the misbehaving bots that CloudFlare flagged as “good” with a low score.
My plan was to experiment with lowering the score until Sogou and other rude bots were filtered out. I grew tired of reviewing logs and simply blocked AS numbers and identified bots generating useless traffic.

Sadly, the CloudFlare User Agent Blocking configuration is only designed to help you block a very specific User-Agent you have found in your logs. It does not accept wildcards to block anything with “python” or “Java” in it. You have to copy & paste the entire user-agent text.
Experimenting with custom filter rules, you can test rules like (http.user_agent contains “wget”) or rules with regex https://developers.cloudflare.com/firewall/cf-firewall-rules/fields-and-expressions/

1 Like

This topic was automatically closed after 30 days. New replies are no longer allowed.