We recently, within the last 24 hours, have experience a spike of about 50% traffic to our site. I know this is unusual for us because I track human vs bot traffic (with custom code and Google Analytics) and we’ve established what our baseline normal traffic patterns are.
I’m certain that it is bot/screen scraping traffic, but I’m unable to prove it because it appears they are using many random IPs and common User Agent strings.
CloudFlare doesn’t appear to be doing anything about it. What are my options to combat whoever is doing this? Aside from spiking our traffic volumes, they are scraping data off the site that they shouldn’t be.
If you have screen scrapers that use fresh IPs and residential proxies, there’s nothing you nor Cloudflare can do to stop them in a traditional sense.
Common techniques against screen scrapers are:
hide your sensitive information behind slightly complicated APIs [called using Javascript]
require users solve recaptcha to view sensitive information
require a login to view the sensitive information
Although all of these can be bypasses if your screen scrapers are targeting your website specifically (Recaptcha can be bypassed with captcha solving farms).
Linkedin has been fighting this battle for a while and their solution is to hide profile information from users with a low number of connections and require a login or sign-up if you come from Google.
I’ve implemented Server-side Excludes, however I do not think they are being used (I don’t know how to actually test it).
I’m debating implementing a captcha just to view content (as opposed to when users submit into a form). I’m hoping it costs a decent amount of money to implement a captcha solving farm and I don’t think I’m THAT valuable. This might be the way to go perhaps.
I wish there was a more intelligent way to handle this via CF.
Yeah i’ve only known of SSE triggering when the trust factor for a client IP isn’t perfect but not enough to be really considered a bot (eg. perhaps they have a hostile VPN extension like hola that sells your connection as a residential proxy). If they’re using really fresh IPs then SSEs probably wont trigger.
Under Attack Mode is pretty good at requiring your attackers put more effort into the scraping. They’ll either have to run node and parse the javascript CF gives them, or run something heavy like Chromium to continue scraping. You can control it via page rules (security level) if you want to only use it on your information pages and not your homepage or what have you, although there’s no issue with leaving it on globally for extended periods of time/forever.
In the past DigitalOcean (CF Enterprise customer) has used Under Attack Mode on their login page to prevent automated sign-in attempts.
That’s not true, UAM is just a JavaScript challenge that something simple like a node.js / python module will solve and bypass easily. Nothing hard about using these modules, they aren’t heavy either.
If it isn’t residential IP addresses then you can grab a list of ASNs (Not sure if CloudFlare has a way to block cloud-based ISPs.) An you can set a captcha challenge to block those.
@smalldoink - are you saying to somehow check if the incoming IP is residential and if not, block it? I don’t believe CF supports this natively, right?
Maybe “run node and parse the javascript” is giving CF too much credit but it’s the same theory of it being another hurdle screen scrapers have to account for.
A firewall rule can block ASNs, although I couldn’t find a free list of ASNs used by cloud computing companies.
However, at least based on previous information, the scrapers might be purchasing and using residential proxies, so blocking Cloud providers might produce little effect.
I’d recommend sampling some IPs you think might be scrapers and running them through something like https://ipinfo.io to get the ASN. If it’s a cloud computing company, blocking that ASN might help.
CloudFlare doesn’t support this natively but they could to prevent scraping in the future. There are many databases out there (or they could build their own) that would check if the IP is apart of an ISP or business or hosting. If the IP’s ASN is under hosting, they could send a captcha challenge or temporarily block this. I see this being easily done and quite effective.
In addition to the excellent suggestions below/above Cloudflare has a Bot management solution designed to deal with scrapers, credential stuffing, inventory hoarding, etc.
Note that there are legitimate bots on cloud ASNs that are checking email links, and caching / compressing remote email images. I discovered this when doing sending to an email list and also found Cloud ASN requests related to password reset emails and reset / welcome emails.