Which is the best option to fight against web scrapping

Hi,

I’d like to avoid web scrappers from stealing contents from my site. Before being working with Cloudflare, I had a PHP script which checked if an IP address was a known search bot. If not, it checked if such IP address crawled my webpages with a high speed rate (e.g. 100 pages per minute). If so, it was banned.

After starting to work with Cloudflare, I stopped that PHP script since --obviously-- it was only seeing CloudFlare IP addresses, and I cannot distinguish between a legit Googlebot visit and a malicious scrapping script.

Now I wonder about the best way to protect my website against these scrapping visits. I’ve seen that I can configure the Firewall options (Pro Plan) with a ‘Security Level’ and a ‘Bot Fight Mode’, and wondered if there are experiences about the best options.

Thank you very much.

Just rewrite IP addresses

https://support.cloudflare.com/hc/en-us/sections/200805497-Restoring-Visitor-IPs

For example, you could use firewall rules and challenge based on the threat score

(cf.threat_score gt XXX)

With “Security Level” set to Medium and a “Bot Fight Mode” enabled, I noticed a sharp decline in the more aggressive bots.

If your raw server logs reveal the scraper user-agent that is slipping through, you can create a CF firewall, User Agent Blocking rule to block it. Some scrapers are kinda dumb/lazy and don’t bother cloaking their activities or changing their user-agent.
You can also add a bot specific blocking firewall rule like (cf.client.bot and cf.threat_score gt 10) https://developers.cloudflare.com/firewall/known-issues-and-faq/#how-does-firewall-rules-handle-traffic-from-known-bots
I did the opposite and set it to allow known “good” bots as the first rule on the list
(cf.client.bot and cf.threat_score lt 10) .
Then, after 3500 entries in 24hrs, I reviewed which bots had a low score and the AS Number where they were hosted. I then started crafting rules for the misbehaving bots that CloudFlare flagged as “good” with a low score.
My plan was to experiment with lowering the score until Sogou and other rude bots were filtered out. I grew tired of reviewing logs and simply blocked AS numbers and identified bots generating useless traffic.

Sadly, the CloudFlare User Agent Blocking configuration is only designed to help you block a very specific User-Agent you have found in your logs. It does not accept wildcards to block anything with “python” or “Java” in it. You have to copy & paste the entire user-agent text.
Experimenting with custom filter rules, you can test rules like (http.user_agent contains “wget”) or rules with regex https://developers.cloudflare.com/firewall/cf-firewall-rules/fields-and-expressions/

1 Like

This topic was automatically closed after 30 days. New replies are no longer allowed.