Stopping Screen Scrapers

We recently, within the last 24 hours, have experience a spike of about 50% traffic to our site. I know this is unusual for us because I track human vs bot traffic (with custom code and Google Analytics) and we’ve established what our baseline normal traffic patterns are.

I’m certain that it is bot/screen scraping traffic, but I’m unable to prove it because it appears they are using many random IPs and common User Agent strings.

CloudFlare doesn’t appear to be doing anything about it. What are my options to combat whoever is doing this? Aside from spiking our traffic volumes, they are scraping data off the site that they shouldn’t be.

What options do I have to stop this?

Thank you all!

What platform are you using, wordpress?

No, it’s entirely custom built. PHP, Python, JS.

If you have screen scrapers that use fresh IPs and residential proxies, there’s nothing you nor Cloudflare can do to stop them in a traditional sense.

Common techniques against screen scrapers are:

  • hide your sensitive information behind slightly complicated APIs [called using Javascript]
  • require users solve recaptcha to view sensitive information
  • require a login to view the sensitive information

Although all of these can be bypasses if your screen scrapers are targeting your website specifically (Recaptcha can be bypassed with captcha solving farms).

Linkedin has been fighting this battle for a while and their solution is to hide profile information from users with a low number of connections and require a login or sign-up if you come from Google.

I’ve implemented Server-side Excludes, however I do not think they are being used (I don’t know how to actually test it).

I’m debating implementing a captcha just to view content (as opposed to when users submit into a form). I’m hoping it costs a decent amount of money to implement a captcha solving farm and I don’t think I’m THAT valuable. :slight_smile: This might be the way to go perhaps.

I wish there was a more intelligent way to handle this via CF.

How effective is “Under Attack Mode” at stopping scraping? It seems fairly unobtrusive … thoughts on engaging that for extended periods of time?

Yeah i’ve only known of SSE triggering when the trust factor for a client IP isn’t perfect but not enough to be really considered a bot (eg. perhaps they have a hostile VPN extension like hola that sells your connection as a residential proxy). If they’re using really fresh IPs then SSEs probably wont trigger.

Under Attack Mode is pretty good at requiring your attackers put more effort into the scraping. They’ll either have to run node and parse the javascript CF gives them, or run something heavy like Chromium to continue scraping. You can control it via page rules (security level) if you want to only use it on your information pages and not your homepage or what have you, although there’s no issue with leaving it on globally for extended periods of time/forever.

In the past DigitalOcean (CF Enterprise customer) has used Under Attack Mode on their login page to prevent automated sign-in attempts.

That’s not true, UAM is just a JavaScript challenge that something simple like a node.js / python module will solve and bypass easily. Nothing hard about using these modules, they aren’t heavy either.

import cfscrape

cfscrape.create_scraper()
cfscrape.get('https://website.com/')

If it isn’t residential IP addresses then you can grab a list of ASNs (Not sure if CloudFlare has a way to block cloud-based ISPs.) An you can set a captcha challenge to block those.

Scary that there is a package out there already …

@smalldoink - are you saying to somehow check if the incoming IP is residential and if not, block it? I don’t believe CF supports this natively, right?

I did mention that:

Maybe “run node and parse the javascript” is giving CF too much credit but it’s the same theory of it being another hurdle screen scrapers have to account for.

A firewall rule can block ASNs, although I couldn’t find a free list of ASNs used by cloud computing companies.

(ip.geoip.asnum in {16276 14061 20473 16509})

(the example ASNs are OVH, digitalocean, one vultr DC, aws/amazon)

However, at least based on previous information, the scrapers might be purchasing and using residential proxies, so blocking Cloud providers might produce little effect.

I’d recommend sampling some IPs you think might be scrapers and running them through something like https://ipinfo.io to get the ASN. If it’s a cloud computing company, blocking that ASN might help.

1 Like

CloudFlare doesn’t support this natively but they could to prevent scraping in the future. There are many databases out there (or they could build their own) that would check if the IP is apart of an ISP or business or hosting. If the IP’s ASN is under hosting, they could send a captcha challenge or temporarily block this. I see this being easily done and quite effective.

There is a publicly available list of cloud computing ASNs. The list is here

1 Like

Very helpful - thank you both!

1 Like

In addition to the excellent suggestions below/above Cloudflare has a Bot management solution designed to deal with scrapers, credential stuffing, inventory hoarding, etc.

2 Likes

A link for the bot management product :wink:

Thanks guys. It’s very cool but unfortunately only available on the Enterprise plan :frowning:

I’ve already enabled Bot Fight mode but honestly do not know if it is doing anything.

Note that there are legitimate bots on cloud ASNs that are checking email links, and caching / compressing remote email images. I discovered this when doing sending to an email list and also found Cloud ASN requests related to password reset emails and reset / welcome emails.

This topic was automatically closed after 30 days. New replies are no longer allowed.