trying to use Browser integrity check to stop a bot that scrape our pages, I have enabled the browser integrity check on the page pattern, but now If I try to curl the page I still get the complete html.
And the bot still continue to access the pages. (I block him by IP for now, but I let him go through by disabling the block to check the browser integrity check fix)
Cloudflare’s Browser Integrity Check (BIC) is similar to Bad Behavior and looks for common HTTP headers abused most commonly by spammers and denies access to your page. It will also challenge visitors that do not have a user agent or a non standard user agent (also commonly used by abuse bots, crawlers or visitors).
the sad truth is: there is no real easy way of stooping scrapers, you can put a rate limiting and other tricks like hidden urls that only scrapers will access and block every ip who visit it(except good bots)
Thank you for your reply @dmz , I read that documentation and thought that if I tried to curl the page url, it should return me a Cloudflare blocked page, but I’m still able to scrape the html
Yes @boynet2 that’s real sad indeed, thanks for the suggestion though. In our case I think the bot is configured to scrape our images so the hidden link trick won’t do it, maybe the rate limiting tho
Does Cloudflare provide this kind of features?
I don’t believe Browser Integrity Check will be able to handle the problem you are describing, so I mentioned the firewall and user-agent rules. Have you tested any of these features?
If you can provide more information about this bot behavior, maybe we can help you better:
Does the User Agent vary?
Is the IP/Country always the same?
Cloudflare offers a Rate Limiting solution, however if you can deal with the issue using the features I mentioned, you’ll save some money (since Rate Limiting charges are based on usage).
I’m able to block the bot by IP and so far it worked.
The user-agent used was a “legit” Mozilla browser user agent.
“Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0”
so it does a decent job as trying to look like a legit request.
The bot hit our images page, following an URL pattern /image/id
after we block his IP, since few hours now, he changed his behavior and tried to hit our image CDN directly using a python user agent… But still using the same ip and the CDN is also behing Cloudflare so he is sill blocked.
However since it look like the bot is custom and target specifically our website I think it’s a matter of time before the hacker use a serverless architecture.
I’ve looked into the rate limiting and I think it will be the solution if it comes to that…
Do you know if we can limit that feature to a specific url pattern?