Hello,
I’m seeing a strange pattern in my logs where a Googlebot IP address, but also a Cloudflare IP address for workers (2a06:98c0:3600::103) triggers a ModSec rule. I would like to understand it, and resolve it if possible because this pattern is occuring approximately 600 times per day.
It goes like this (not necessarily in this order since all 3 log entries have the same time stamp):
There is a request for a URL from, 66.249.77.104 with a 200 response. This is a Googlebot IP.
There is a request for the same URL from, 2a06:98c0:3600::103 with a 403 response. This is an IP associated with Cloudflare workers.
There is an Apache error entry (again for 2a06:98c0:3600::103):
—for example—
[client 2a06:98c0:3600::103] ModSecurity: [file “/etc/httpd/modsecurity.d/00_asl_z_searchengines.conf”] [line “105”] [id “303800”] [rev “5”] [msg “Atomicorp.com WAF Rules: Fake Googlebot webcrawler”] [data “”] [severity “ERROR”] Access denied with code 403 (phase 1). Lua Data: 2a06:98c0:3600::103 reverse and forward records did not match. [hostname “www.rockbrookcamp.com”] [uri “/blog/traits-beautiful-people/”] [unique_id “ZGjle-KrQ4sP2-p0kKAHrAAAANc”]
I should mention that the user agent for these requests (#1 and #2 above) is:
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.142 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
I should also mention that the ModSecurity rules here are written by atomicorp and are hosted on our server.
In the error, “Lua Data: 2a06:98c0:3600::103 reverse and forward records did not match” makes me think this is related to this IP address “intentionally not identifying the original client’s IP” for some reason. https://news.ycombinator.com/item?id=26690788
I’m not sure if that’s relevant.
Why is this Cloudflare IP address triggering this ModSec rule, and how is it related to the Googlebot?
Other than allowlisting that IP address (seems risky) on my server’s WAF, I’m not sure how to configure things better to avoid this repeated error.
Yes. It is possible the rule is simply blocking a bad actor using a Cloudflare Worker, but I can’t explain the parallel nature of these entries. And the consistency and regularity of the pattern. The Googlebot IP is not being blocked, but the CF worker IP is. So the mystery arises because both are trying to get the same URL at the same time, one successful and the other not.
So do you mean… someone is using a Worker to send a fake Googlebot IP request and another (hidden in the Worker IP) IP request at the same time? That makes me wonder why that fake Googlebot IP is not being caught by the WAF.
Well, nothing related to workers except Zaraz for Google Analytics 4. And as I mentioned, I toggled it off and saw no change. Argo is on. Early hints on. APO off. Verified Bots and Definitely Automated are both allowed. I can’t think of any rules that would be involved. Maybe something else?
I’m still puzzled how this Cloudflare Worker IP and the Googlbot IP can be related. For each instance of this, both entries have the Googlebot user agent when trying to get a URL, one blocked and one allowed.
The block is not happening in Cloudflare. It is the WAF at the server being triggered. And the rule blocking is complaining about a mismatch of “reverse and forward records.”
Checking my logs, this apache error has been occurring at least since October 3. It may have been earlier since the log file probably rotated. It really looks like legitimate Googlebot traffic is somehow tied to traffic from 2a06:98c0:3600::103, which is then getting a 403 apache error from my server’s WAF.
I had a similar issue with one of my domains. Same double requests, same requests coming from Known Bots (Googlebot, Twitterbot, Applebot etc.) arriving at the origin with Cloudflare Workers public IP. It was recently determined to have been caused by AMP Real URL. The issue went away immediately after turning this feature off.
In my case I had turned it on for testing, but soon gave up the idea of an AMP site and forgot to turn that feature off. For that reason, it was a no-brainer to just turn it off. If you use AMP Real URL in a production setting, and in case you test and confirm yours is also a result of AMP Real URL, you’d then need to evaluate whether it’s worth keeping it while Cloudflare fixes the issue for good.
The Cloudflare Worker that is behind AMP Real URL seems to somehow remove some of the request headers, even those added by a Transform Rule That’s how, in my case, I first learned there was a problem, as I have a TR that adds a header without which the origin will block the request. And that’s probably why mod_security is also blocking those requests, as the CF-Connecting-IP header is probably arriving at the origin empty.
With the help of Cloudflare support, the issue has been tied to Automatic Signed Exchanges (SXGs) feature. Working now on a possible explanation and (ideally) workaround.
Yes, the Automatic Signed Exchanges feature is legitimately causing traffic on that IP6 address. For me, I was seeing a 403 response because that traffic was triggering a rule in the Atomicorp WAF installed on my server (not the Cloudflare WAF).
In other words, there was nothing that could be done in Cloudflare to fix this. In the end, I left SXG activated, and modified the server WAF a bit to avoid the 403 error.