Rate Limiting by User Agent Anomaly

So I wanted to rate limited by User Agent (googlebot of all UAs) but that’s not directly possible. That’s one for the road-map really, as these bots spread their activity by many many Ip addresses within the same C class, they don’t get high RPM’s per IP, but they do in total.

Here’s the use case:

This Black Friday coming, we’d like to respectfully ask certain bots like Googlebot to temporarily not come crawling. A 429 response is perfect for that. When the site is under heavy load from real users, the last thing we want is downtime due to hungry bots. So Rate Limiting seemed nice, but this can only be done by IP address, not by ASIN or UA.

But…

With a work-around, we can still Rate Limit by UA.

Step 1: We create a firewall rule based on UA which does not match regex (Googlebot|Pinterestbot|bingbot|MJ12bot) and we select Bypass Rate Limiting. Essentially saying, everybody is welcome to bypass Rate Limiting, but not you Googlebot, Pinterestbot, bingbot and MJ12bot! We effectively disable Rate Limiting for all but them four UAs.

Step 2: I enable Rate Limiting in Simulation mode. I select to include cached hits and a very low RPM, currently testing with 2 requests per minute. Because we want to detect and throttle them early when the proverbial ■■■■ hits the Black Friday fan. And because these IPs are spread hence low RPM per IP.

This works in theory. It even works in practise with MJ12bot. But not for Googlebot.

So my Rate Limiting Simulation does log bots, but not Googlebot. And I am sure they trigger the threshold.

If I go to Firewall > Overview I can pick Googlebot by UA. Then I see many many IP addresses. In the last 30 minutes they together will have hit the site 63K times. When we look down the list of IPs used by Googlebot, the top one has clocked 298 requests in the past 30 minutes. Call it 300. That’s 10 per minute. Enough to trigger my Rate Limiting match rule of 2 RPM. But I’m not seeing Googlebot entering my simulation. Why not?

Googlebot is matched in other Firewall Rules with Allow actions, but that shouldn’t affect Rate Limiting according to these docs:

So is Google getting preferential treatment somehow? Why do ‘slower’ bots get matched by this setup, but the more hungry Googlebot is not?

There’s no IP Whitelist affecting these Googlebot IPs.

For reference, this MJ12bot has hit the site 84 times in the past 30 minutes from one IPv6 address and has been rate limited.
Googlebot on the other hand has many IPs in the region of 300 requests per 30 minutes and does not seem to match the rate limiting rule.

I don’t understand why. Am I missing something?

Seems like my suspicion is true. Now running for almost 24 hours, I’ve “caught” one Googlebot UA IP into Rate Limiting, but it’s a fake Googlebot, not from a Google ASIN.

So is it true, what I’m seeing here is that CloudFlare has special measures that prevents the “Known Bots” from getting rate limited?

FYI Unfortunately CF Support confirms what I found:

I have checked with colleagues on this and unfortunately, Google bot is allow listed to the extent that it is not possible to rate-limit it. It is possible to block it using Firewall rules, but as you mention this is not ideal in this case. This will be the case until rate-limiting is part of Firewall rules.

That last sentence is intriguing, it would be perfect if Rate Limiting was an Action under Firewall Rules!

This topic was automatically closed after 30 days. New replies are no longer allowed.