The Never-Ending Nightmare of Bot Fight Mode blocking legitimate APIs

The problem

For the past months, the community forums have been plagued with the many horror stories faced by Cloudflare users who turned Bot Fight Mode on and suddenly lost access to all sorts of third-party APIs which would attempt to connect to their Cloudflare-protected systems and fail the JavaScript Challenge — thus immediately getting blocked.

At the time of writing, Cloudflare has successfully identified a hundred ‘good’ bots — these will never be presented with a JS challenge, but, naturally enough, you can block them with your own rules.

However, this is not the case if you use a third-party API, especially if it’s being provided from a cloud, which can have thousands (or millions…) of IP addresses that will legitimately attempt to connect to your system — and get promptly blocked by the JS challenge.

Bot Fight Mode, alas, at least for the free service, has the following fundamental characteristics:

  1. You can either turn it on or off. There is no middle ground. Either you protect your whole domain, or you protect none of it. The choice is yours.

  2. Bot Fight Mode always takes precedence over your WAF rules. This means that if Bot Fight Mode is enabled, you can, at best, block some of the ‘good’ bots with additional rules. What you cannot do is to allow traffic from certain well-known sources to bypass Bot Fight Mode. By design, this is not how it was implemented.

  3. In theory, based on my understanding of the explanations posted here, if you have access to the IP Access Rules, you might be able to place a rule there to allow some servers to go through, bypassing Bot Fight Mode:

    Bot Fight Mode keeps blocking my Next.js app calls to the backend API:

That’s the theory, but… read on!

Cloudflare’s assumptions

The idea behind Bot Fight Mode (and I assume Super Bot Fight Mode too, although I haven’t tested it out) is very simple: every day, 30-40% of all web traffic is due to bots, most of which benign (legitimate crawling engines), but with an ever-growing number of malicious ones which simply attempt to find weak spots in your Web services. To address those bots, Cloudflare always had a simple mechanism: CAPTCHA, or its successor, JavaScript Challenges. These are extremely effective against malicious bots, since, by definition, they are run by automated tools, not web browsers with JavaScript engines, and thus will fail such challenges and can be safely blocked. This mechanism has been the source of much research (academic and corporate) and it can be said that it’s ‘proven’ to work rather well.

Therefore, Cloudflare makes a simple assumption: if it’s a human, it has a browser running JavaScript. If so, it will pass any challenge easily. If it’s a bot, there is no ‘browser’ (so to speak), and thus all validation attempts will immediately fail. This gives Cloudflare a very effective way to distinguish between humans and bots, allowing the former and blocking the latter.

Of course, legitimate bots — those from legitimate crawling & searching engines — should be allowed to access one’s websites without any kind of restriction in place. This is where things get tricky: how do you identify a bot as being legitimate? Obviously, you cannot rely upon the User-Agent header, which is easily forged.

From what I understand of Cloudflare’s system, their ‘benign bot’ validation mechanism relies on two principles. Firstly, for any ‘new’ bot beyond those that are well-known, Cloudflare will determine its activity, based on pattern-matching with some sort of artificial intelligence. The idea is to capture the signature of a legitimate 'bot, and, thanks to having access to gazillions of data — logs of bots coming in from the same range of IP addresses, with the same User-Agent header, retrieving content in a predictable way, and respecting the robots.txt file (if it exists) — I can imagine that it’s not too hard to precisely determine if a certain bot is, in fact, what it seems to be, based on its behaviour. In a sense, what Cloudflare’s data-crunching AI is doing is to reverse-engineer the crawling algorithm used by a ‘benign bot’, thus being able to figure out if a certain request for data is, indeed, legitimate.

Considering that the JS Challenge mechanism is built on similar assumptions, I can imagine that the Bot Fight Mode uses something similar to automatically let legitimate bots go through the firewall.

The second principle, of course, is to examine a freshly submitted bot for evaluation, placing it in a restricted environment, and figuring out if it behaves in a way that is consistent with the algorithm of a ‘benign bot’. Some posts here on the community forums tend to imply that such requests for a new bot to be allowed through CF’s own firewall are often ignored/disregarded; I’d claim, however, that each request may, indeed, be honoured, but it will require a ‘quarantine period’ during which CF runs its tests.

Because Cloudflare is so sure that the overall mechanism works, they don’t even consider a few exceptions — some of which are sadly quite frequent.

Providing APIs behind Cloudflare protection

Here is my use-case scenario: how to successfully provide a web services API on a single server behind the Cloudflare protections, when the connections to it can come from any IP address in the world, and not just a limited set?

So, I have several domains (almost all of them registered with the Cloudflare Registrar), each of which may have more than one server — usually, they have many more, even if they’re (possibly) pointing to the same physical server. From the perspective of an external client, this is irrelevant: there are many different servers, each having its own FQDN. In fact, since clients will only see Cloudflare’s IP addresses — and possibly never gain access to the real server’s IP address — they will not even know where their request is being reverse-proxied through, and that’s exactly how we want Cloudflare to work.

But clients can come from any possible number of IP addresses. Consider an API that will be consumed by residential users or mobile users — all of which will get IP addresses randomly assigned from a pool, which often may not even be known in advance. Especially if you wish your API to be accessible from everywhere, not just one provider or even one country.

In the specific case where I stumbled upon the overeager Bot Fighting issue, things were even more complicated. The requests actually come from a virtual server out of a pool, managed via AWS. At any time, Amazon might switch the IP address where that particular instance is running — and from within the instance I might not be able to know what that instance’s specific IP address is, in advance of contacting my own server, hidden behind Cloudflare’s services. In essence, I’m making a request coming from a server that is virtually spread among a cloud, to a final destination which is actually a server under my control, but which, in turn, is also virtualised by Cloudflare’s cloud as well.

In such circumstances, the client never knows the IP address of the remote connection in advance; conversely, my physical server has no idea of the real IP address for the next request. It only knows what Amazon will tell it what to use. Although it’s possible for the client to make a quick local request to learn what IP address it’s running on, that address is useless from the perspective of someone setting up a WAF on my server — because you will only see requests coming from the Amazon Cloud Services, not from the ‘real’ server. And because such servers are actually not real, but virtual, created on-demand, it’s not even guaranteed that, when establishing a correlation between the current IP address (as reported by the local operating system) and the IP address currently assigned to a cloud proxy instance, such correlation will hold up in subsequent requests, since Amazon may assign a different IP address on the next time the instance is launched.

What this means is that one cannot even create a ‘dynamic’ configuration (via the Cloudflare API) where somehow this correlation is ‘translated’ into a well-formed WAF API rule, that can be selectively added in a fully automatic way. Somewhere in this process, a human will have no other choice but to intervene and manually change whatever rule is in place.

This can be done (to a degree) but it’s not deployable beyond testing purposes.

Different concepts, different views

The culprit, IMHO, is in the priorities given to the many layers of filtering. At the WAF level, you can specify several ways to filter out requests and mark them as allowed — by examining the headers and checking for specific markers (including the name — or URI — of the server to be contacted. You can certainly create rule to allow requests to go through a single server.

However, WAF rules cannot override the filtering done by the Bot Fight Mode. The rules for Bot Fight Mode will always override whatever the WAF rules say.

On the other hand, Bot Fight Mode is either turned on or off for the whole domain (erroneously misrepresented on the current documentation as the server's name). There are a few more options (filter by some headers, for example, or add further headers to the request, etc.), but what you cannot do is to restrict Bot Fight Mode to a single FQDN. It might be possible with the Super Bot Fight Mode, but most definitely not with the ‘simple’ Bot Fight Mode.

This essentially means that if you have one web server providing API services to an unknown number of clients, coming from an unlimited pool of possible IP addresses, you have basically to turn Bot Fight Mode off for all web servers under that domain. There is no alternative — at least, as far as I can see and understand the documentation.

A proposal to Cloudflare

Bot Fight Mode is extremely useful, if you’re running all sorts of plain old Web servers, showing content from popular CMS (or even static pages!). The problem only manifests itself when trying to protect web services, making API requests to a server behind Cloudflare’s infrastructure — because such requests will have a very similar signature to bot attacks, and, as a precaution, Cloudflare will send such clients a challenge (usually via JavaScript). This, unfortunately, will fail with legitimate clients.

As a consequence of the way the many rule systems work at the different layers, for such sites, the only solution, for now, is shutting Bot Fight Mode off — which will also deprive Cloudflare from data worth analysing (so that future generations can be trained with machine-learning running on legitimate data and therefore learn to recognise it better).

There are a few solutions that come to mind:

  1. Introduce a quarantine mode, like what some complex email filtering systems do (e.g. SPF, DKIM, but also spam-fighting tools such as Rspamd, SpamAssassin…). This would allow Cloudflare’s machine-learning engine to benefit from getting trained with the extra traffic from legitimate APIs providing web services, while at the same time not block such traffic at the domain level.
  2. Create a special rule at the WAF level that can override Bot Fight Mode. That would be the best solution, especially if it could be deployed simultaneously with the first rule, e.g. instruct the WAF to enter quarantine mode for some FQDNs, leave others on full active mode, and on some, turn it off. The complexity — and number — of rules allowed would depend on the user level (Free, Pro, Enterprise…).
    An additional advantage of this approach would be the ability to check via the WAF’s rules if a request had an expected header or not (known to exist for legitimate requests only); if not, it would be very easy to block any simple-minded bot checking for WordPress vulnerabilities, for example.
  3. Allow Bot Fight Mode to be activated by FQDN as opposed to the whole domain. That would work, too, although it’s worth noticing that the WAF rules are finer-grained and more useful.

I believe this could be a starting point to make Bot Fight Mode be able to protect perfectly legitimate browser-less API entry points, where there is no human user to click on CAPTCHAs or do something JavaScript-related to gain access.

TBD!

2 Likes