How do I block automated SEO searches for "scraping footprints"?

I run a search engine, and have a problem specific to search engines - I’m getting sometimes over 160,000 page searches a day for “scraping footprints”, which SEO practitioners use to generate lists of pages to “target”. I think they are run from “SEO proxy farms” so as to be more difficult to detect, e.g. globally distributed residential IPs, ordinary browser user agents etc. The giceaway is the long search query that includes the targeted term and a fragment of a page template for a page type or system they want to target.

There’s more detail on the issue in Web: Almost all searches are currently from an unknown origin, impacting running costs, so try to block these · Issue #55 · searchmysite/searchmysite.net · GitHub and some comments at Almost all searches on my independent search engine are now from SEO spam bots | Hacker News .

At the moment I’m blocking in my reverse proxy by returning a 403 for all search requests which don’t have a referrer. This is successfully blocking most of the requests, although has a couple of unfortunate side-effects: (i) it is stopping direct links to search results from working, and (ii) it is stopping the Firefox search integration from working because Firefox doesn’t set a referrer.

Various comments have suggested I use Cloudflare to solve the problem, so I’ve set up Cloudflare on my site. Howewer, I’ve tried a few things that haven’t slowed the requests:

  • Set up a firewall rule to block known bots. This has only blocked a handful of bots. It seems to be looking for bot strings in the user agents. Unfortunately the bots I’m up against aren’t playing so nice as to set identifiable user agents.
  • Switched on Bot Fight Mode. Again this has no effect, because these requests are trying to disguise themselves to not look like being from bots.
  • Switched Security level to High.

Any ideas for other settings to try?

Known bots in the firewall rules means known, friendly bots - like Google’s official crawler.

Thanks. The friendly bots aren’t causing an issue, and if they were they’d be easy to block. It is the unfriendly ones which are causing the issue, and they’re difficult to block because of the lengths they go to in order to disguise themselves via the proxy farms. The impression I got was that Cloudflare had a team identifying the compromised residential machines used in the proxy farms, and maintained lists of their constantly changing IPs or fingerprints of the user agents or something like that.

Yeah - my point was just that you said you made a rule to block them which wouldn’t help in your attempt at blocking bad bots.

There’s a few metrics that are available and mostly depend on your plan.

Threat Score - this is the ‘reputation’ of an IP address and your Security Level dictates the threshold at which Cloudflare serves a Managed Challenge to the request.

Bot Score - This isn’t really exposed to you as something that you can change, but Super Bot Fight Mode available on Pro & above plans gives you 4 categories where you can block, serve a challenge or allow based on these.

I’m not too sure which of the 4 thresholds available in SBFM that the regular Bot Fight Mode uses to block requests.

https://developers.cloudflare.com/bots/ overall is a good read, specifically the Concepts tab which talks about the different metrics and whatnot. If you look at the ‘Plans’ tab then you’ll see the increasing amount of methods that Cloudflare uses to identify automated requests to that zone, i.e Business has machine learning whereas Pro does not.

As you’ve realised, the globally distributed residential IPs is a way around Threat Score/Security Level & mimicing a browser fully (i.e with JavaScript support) is a way to try and get around Managed Challenges.

Once someone is using methods like that, there’s a need for more involved rules (or a higher plan) to block them effectively. I’d have a read through the bots documentation that I linked which should be a good foundation.

Many thanks for detailed response. I’ve read through the documentation now.

FYI I’ve been running with the following rules fro 9 days now:

  • A Web Application Firewall (WAF) rule to block requests to /search where the threat score is > 80 (out of 100). It says this has blocked 6 requests (out of a 11.28K total) in the past 24 hours, suggesting it isn’t doing much.
  • Switched Bot Fight Mode on. This includes JavaScript Detections (JSD), which I really thought would work because it injects some JavaScript onto every page and rejects requests which don’t execute the JavaScript, similar to how the analytics solution avoids counting bots. I can’t see how to find stats on how many requests have been blocked by this, but initial informal scanning of the logs suggested it wasn’t doing much. That said, over the past week there has been a dramatic decline in searches reaching the site - I suspect that is a result of the “attack” burning out, but I guess it could be the JSD. One way to confirm would be to switch it off and see if the numbers start increasing dramatically again.

This is really high - Cloudflare recommend challenging more than 10 and blocking more than 50.

Represents a Cloudflare threat score from 0–100, where 0 indicates low risk. Values above 10 may represent spammers or bots, and values above 40 identify bad actors on the Internet.

It is rare to see values above 60. A common recommendation is to challenge requests with a score above 10 and to block those above 50.

https://developers.cloudflare.com/ruleset-engine/rules-language/fields/

A lot of bots emulate a browser as best they can - the anti-bot measures get more complex as you go up the plans but yeah, BFM could be hit and miss.

Thanks. I’ve lowered the threat score and am seeing more requests blocked now, e.g. I was around 0.1% blocked with a threat score of 80 and around 4% blocked at 40.

Separately, the automated SEO searches “attack” (if you can call it that) seems to be over now, subsiding from a peak of over 160K searches a day three weeks ago to a much more manageable 3K requests a day for the past two weeks, so I don’t think I’m going to investigate further. Not sure I’ll mark this as having a solution though, because I think it subsided by itself rather than as a result of any specific Cloudflare config. Worth also reiterating that this particular problem is very specific to search engines, i.e. large numbers of searches for “scraping footprints” combined with the SEO search terms to get list of URLs to target which are run on “SEO proxy farms” to be difficult to trace and block.

Thanks again for your input - it has been very interesting and useful to learn about Cloudflare.

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.