Problem with automated website scraper

Hi,
on our website we have a problem with automated bots scrapping our content / data.

What we’ve done so far:

  • Sitemaps are hidden and are registered on (for us) important web services like google or bing
  • URL design of our site does no longer contain “numbers” like this-page-2, this-page-3 etc so you can not guess the URL of other (new) pages

New:

  • Cloudflare Pro with rate limiting

Do you think this is enough or can we do something else to prevent unwanted bots to scrape our site ?

Thanks

Hi @user42010,

If there is any pattern in the bot’s requests, you may be able to create a custom firewall rule to block it specifically or challenge a wider range of traffic that may contain the bot.

You could also try Bot Fight Mode, that is very unforgiving and may block the bot. Just be aware that there’s no way to bypass it for specific teaffic so it’s either on or off.

1 Like

Hi @domjh

thank you for your help. Bot fight mode is already enabled. I found the folowwing entry in the cloudflare firewall (NOT blocked):

“Access allowed, manage definite bots rule”

Does that mean that cloudflare detected the bot and therefore showed up a challenge ? But in the end the bot could access our website and copied around 1000 pages last night.

I’m not too sure, perhaps another MVP who has used Bot Fight Mode more can confirm.

Same as my approach.
I allow only Google to visit robots.txt file and sitemap (.xml) files.

On a Pro plan, on " Configure Super Bot Fight Mode" under the " Definitely automated" I have set to “Challenge”. Therefore, I managed to find out few bots/crawlers and daily it clears out approx. ~1000 requests. They mostly go to /rss or /feed, and crawl each page of the link they find in it.

I am not saying all crawlers are going to /feed and we should challenge every request which contains /feed - as far as users can have some RSS reader app installed on a desktop PC, mobile phone, or some other app, etc., so I would block even them - that’s not good.

You might want to check from which AS numbers are the requests coming, and therefore block few ASNs completely - if that could be a good start, at least to try out.

In terms of a Managed Rules on a Pro plan, I have enabled all of them under the " Package: OWASP ModSecurity Core Rule Set", selected “Medium” for sensitivity and “Challenge” for the action.

Nevertheless, I found out blocking requests coming from HTTP/1.0 are usually the ones too:

You could setup some custom Firewall Rules as @domjh suggested, like if from the coming request the user-agent contains “crawl” or “feed” or “parser”, then block.

There are online tools like code.google.com/p/feedparser - is it good or not? Depends.
Others like crawlson.com and similar like Comscore crawler and other “user-agents” like JetSlide, mojeek, rssapi, aiohttp, SimplePie, CrowdTangle.

There are even some search engines which have user-agents like omgili.com.

You can even try to block Bingbot - just to see if any request are actually hitting your server from it - or, at least using Managed Firewall Rules the “fake bingbot” and other “fake” are blocked using it :wink:

Haven’t seen it yet. I think it could be “good bot” like Yandex or some other (from verified and good bot list of cf.client_bot) like Facebook externalhit (which can be used for scrapping / DDoS) so it was allowed? I am not sure.

1 Like

@fritex thanks for your input. I enabled the “OWASP ModSecurity Core Rule Set” (previously was disabled). Hope that gives a bit more security.

For most of other tips (like block ASN or identify the user agent) i already did today. But thanks again for your opinion it helped me a lot.

1 Like

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.