Cloudflare Bot detection miscategorization

There is a very specific use case that is triggering Coudflare’s Bot detection feature to deny access to legitimate users, but Cloudflare does not provide a clear mechanism for external platform providers to connect with them to identity problems and improve the Bot detection feature’s ability to differentiate between this legitimate use cases and actual bots.

What is the best way to make contact with the Cloudflare group responsible for Bot detection so that we can help them to better understand this specific use case, and to improve this feature to avoid refusing access to valid connections when customers enable this feature?

This is how you request an exemption:

1 Like

Are you running some kind of service that is being blocked (i.e. It is a bot, but you want to become a Verified Bot), or is this impacting users using standard browsers?

Can you provide some detail?

I have directed a few providers to the form above, but I’m unclear if they have been verified successfully.

1 Like

This use case is proxy servers commonly used in academic environments, and are not “Bots” per-se, but due to the way they work – injecting themselves into the middle of HTTP transactions in order to force IP authentication to content providers by making all user requests originate from the proxy’s IP address – they alter the HTTP requests/responses just enough to trigger Bot Management rules due to those subtle changes to the HTTP conversation. If you are not familiar with rewriting proxy software, think of it like a combination of a forward proxy minus the PAC file to the client browser, and a reverse proxy to the upstream content provider.

So it’s not something that is trivial to classify using a simple User-Agent header match, and it’s not software that behaves as a crawler or traditional bot because it is leveraged by real end-users using regular browsers. But to the Cloudflare platform, it is going to appear to be software masquerading as a browser because it passes the User-Agent header through unmodified so that sites still doing header-sniffing instead of feature detection don’t break, but it does not perfectly emulate the native browser behavior (e.g. QUIC support does not exist yet in any of the rewriting proxy software that I know of, and certain headers may be removed that would normally be sent by the browser).

I went ahead and submitted information to that form, but because this is such a different situation, I’m not sure that the form is asking the right questions that Cloudflare needs to know to fully understand and be able to categorize this traffic effectively, which is why I’m trying to figure out how to engage with Cloudflare to help them do a more effective job of handling this kind of traffic so that legitimate user access is not impacted even if the site needs to enable some of the Cloudflare platform’s protection features.

2 Likes

What would really help narrow this down is if you could work with a site owner to check their Firewall Events Log in their Cloudflare dashboard to see which Cloudflare setting is blocking access.

We have been working with the specific site owner that was involved with this immediate problem, and they suggested that we engage with Cloudflare directly as well to make it easier for them to update their rulesets though mechanisms like Bot Tags.

Also note that this problem is not limited to a single site. Cloudflare is used by multiple major academic content providers, each hosting content for hundreds of different journals and societies, so this is a general issue across multiple Cloudflare customer organizations, and there are 4 major rewriting proxy platforms in common usage in the academic market that behave similarly. That is why we think this needs to be looked at beyond the scope of just a single Cloudflare customer site so that the next customer site that needs to enable these features will be less likely to impact legitimate users.

2 Likes

It’s probably the same issue with all these customers.

Which feature is that? Their Security Level? Super Bot Fight Mode for Definitely Automated Bots? Browser Integrity Check?

Yes, this is Monster in the Middle.

A few resources to get you moving in the right direction:

1 Like

Unfortunately, I do not have an answer for that; the account holder who ran into this most recent issue did not share that level of detail with me, and I do not have access to these features to test for myself. I suspect it was Browser Integrity Check based on the conversations that I had with them.

Yes, this is exactly a MITM scenario, but one that is over 20 years in the making, and deeply entrenched. Without going too deeply into the world of academia, there are basically 5 choices available to academic libraries for authenticating users to publisher platforms:

  1. Use an identity federation. This is an expensive proposition to join an identity federation, and available support for federated identity is hit-or-miss on the publisher platforms: some support only one specific identity platform, some support a small handful, many support none at all.

  2. Use direct SAML authentication outside of an identity federation. This is even less well supported by the publisher platforms than identity federations are, though there is some very early work being done in this area. OIDC has not been an option that is generally supported by an academic publishing platform for external authentication purposes at all. Further, not all institutions have access to SAML IdPs to support this.

  3. Use a campus VPN so that students appear to come from the campus IP range. While generally effective and secure, this can be a VERY expensive proposition due to the hardware and connection licensing requirements, as well as the IT user support load that VPNs can require.

  4. Use LTI integrations, but those are only now starting to be supported by some publishers, and the LTI protocol has the limitation that all connections must originate from within the LMS environment, so sharing URLs is not available currently if using LTI.

  5. Use a rewriting proxy (nay MITM) solution that sits in the middle of the conversation and presents a single IP address to the publisher platforms. This is the current solution used by literally thousands of libraries around the world to access academic content.

Options 1, 2, and 3 require a certain amount of budget allocation and technical sophistication in order to support, which are not always available, especially at the smaller end of the college spectrum.

Option 4 requires sacrificing certain key features and capabilities due to a lack of protocol support for external entry points.

Option 5 has been the defacto method for over 2 decades now, and as much as I would like to see it go away, sometimes we have to live in the world we are in while working towards the world we would like to be in.

The software vendors have designed their rewriting proxy software such that they can be installed and maintained by users with very little technical training and on very modest hardware, putting that capability within reach of practically any library. Until a few key technology pieces fall into place, and a few “layer 8” issues get resolved, the use of rewriting proxies in general is going to persist in academia for the foreseeable future.

So, returning back to the original issue, is Cloudflare willing work with proxy software publishers and proxy hosting providers to make it less likely that academic publishers using your service will unintentionally block traffic from legitimate users when activating the various defensive features available on your platform? How can this be achieved, and what information does Cloudflare need? Would mitmengine fingerprints be the best starting point, or is there a different place to begin?

7 Likes

Tell ya what, open a ticket via email: support AT cloudflare DOT com and that should get the ball rolling. Be sure to cram it full of as much information as you can, including impacted domains and a link to this thread, plus that other one you were in earlier. The ticket will auto-close, but you’ll get a ticket # in the reply that we can pass along if you post that # here.

It would be most helpful if you could find out which setting is blocking you. Otherwise, Cloudflare will have to poke around in user accounts, which is difficult when those users haven’t opened tickets.

3 Likes

Sure thing – 2310805 has more than you ever wanted to be able to forget about this topic, and is bursting at the seams with information.

Thanks for running this up someone’s flagpole internally to see who salutes it.

4 Likes

The situation looks rather complicated; however, if enough people present it as a problem, I bet there is something that can be done.

In the meantime, I went ahead and escalated the ticket for you.

3 Likes

Thanks! I’m going to loop in the Cloudflare customers that I know to reference that ticket as well so that you can have a feel for how much of an impact this is among your subscribers, and suggest that interested 3rd parties follow and/or like this topic directly to help you gauge how widespread this is among academic libraries and the desire from proxy operators (who are closer to end-users) for time to be invested in working on this.

12 Likes

Is there any value to gathering fingerprints to add to mitmengine? It does not appear that project has been actively maintained for a couple of years now, so while I am able and willing to contribute and generate a PR, I want to do so effectively.