So I wanted to rate limited by User Agent (googlebot of all UAs) but that’s not directly possible. That’s one for the road-map really, as these bots spread their activity by many many Ip addresses within the same C class, they don’t get high RPM’s per IP, but they do in total.
Here’s the use case:
This Black Friday coming, we’d like to respectfully ask certain bots like Googlebot to temporarily not come crawling. A 429 response is perfect for that. When the site is under heavy load from real users, the last thing we want is downtime due to hungry bots. So Rate Limiting seemed nice, but this can only be done by IP address, not by ASIN or UA.
With a work-around, we can still Rate Limit by UA.
Step 1: We create a firewall rule based on UA which does not match regex (Googlebot|Pinterestbot|bingbot|MJ12bot) and we select Bypass Rate Limiting. Essentially saying, everybody is welcome to bypass Rate Limiting, but not you Googlebot, Pinterestbot, bingbot and MJ12bot! We effectively disable Rate Limiting for all but them four UAs.
Step 2: I enable Rate Limiting in Simulation mode. I select to include cached hits and a very low RPM, currently testing with 2 requests per minute. Because we want to detect and throttle them early when the proverbial ■■■■ hits the Black Friday fan. And because these IPs are spread hence low RPM per IP.
This works in theory. It even works in practise with MJ12bot. But not for Googlebot.
So my Rate Limiting Simulation does log bots, but not Googlebot. And I am sure they trigger the threshold.
If I go to Firewall > Overview I can pick Googlebot by UA. Then I see many many IP addresses. In the last 30 minutes they together will have hit the site 63K times. When we look down the list of IPs used by Googlebot, the top one has clocked 298 requests in the past 30 minutes. Call it 300. That’s 10 per minute. Enough to trigger my Rate Limiting match rule of 2 RPM. But I’m not seeing Googlebot entering my simulation. Why not?
Googlebot is matched in other Firewall Rules with Allow actions, but that shouldn’t affect Rate Limiting according to these docs:
So is Google getting preferential treatment somehow? Why do ‘slower’ bots get matched by this setup, but the more hungry Googlebot is not?
There’s no IP Whitelist affecting these Googlebot IPs.
For reference, this MJ12bot has hit the site 84 times in the past 30 minutes from one IPv6 address and has been rate limited.
Googlebot on the other hand has many IPs in the region of 300 requests per 30 minutes and does not seem to match the rate limiting rule.
I don’t understand why. Am I missing something?