Crawl bots still passing through with AS Num firewall rule in place

I created a firewall rule trying to only allow accessing from a particular ISP AS number.

ip.geoip.asnum ne 1221 and http.host eq "www.example.com" BLOCK

On the original server, I have ufw rule in place only allow Cloudflare IP ranges to web ports. But on the nginx reverse proxy log, I still see lot of bots (like Googlebot, bingbot, etc.) knocking the door via Cloudflare source IP range.

Trying to understand the logic of CF firewall rule. Isn’t my created rule only allow web request from AS number 1221? Why those bots can still pass through? I don’t believe all those bots have repos in AS 1221.

Thanks.

The expression is fine and should allow access to that hostname only if the request comes from the indicated AS.

In these case requests are usually direct, but you seemingly ruled that already out. Did you double check that this really can’t be the case and your server only accepts connections from Cloudflare?

If so, I’d check if these requests might be for a different hostname. In that case, the rule would certainly not fire.

You can also check the order of the rule, if there is e.g. a previous allow rule that would prevent the rule from firing as well.

What’s the actual hostname?

Thanks Sandro for your promptly reply.
I am also thinking because of the hostname that this rule is not hit.
My server DNS A record is managed by CF. For example: abc.example.com 123.123.123.123 proxied. I have another web server www.example.com and example.com pointing to a different IP address under the same zone. If the bot request does not match hostname ‘abc.example.com’, how the traffic hit my server behind CF? This is the part I am not sure.
Please advice, thanks.

So you only have one hostname configured for that IP address? In that case it really shouldn’t be requests for other hostnames.

But again, if you posted the actual hostnames it would be easier :wink:

I do have another A record pointing to the same IP which exposes the original server. This is purely for SSH and non web direct access. The ufw rule only allow CF range to access port 443. If those bots try to access my server port 443 via original IP directly, as long as they are not sourced from CF IP, ufw will drop the connnection.

I still a bit hesitated to post the actual hostname due to privacy concern. Tried to send private message to you but cannot find how in this community.

I am afraid you cannot send a private message, but you can post it briefly, notify me, and then remove your posting.

Alternatively you can run a check at sitemeer.com and tell me the exact time you did so.

All right, you can edit your posting to remove it and then delete the posting.

1 Like

Just sent a request for that hostname from a number of locations and always got a 403, so I’d assume the block does work. Could it be that you have some networks whitelisted somewhere?

I’d check the firewall event log for that particular hostname.

Thanks, I see CF firewall successfully blocked some requests for this hostname.

However, my concern is on my web server log, I sometimes still see bots knocking the door.

172.69.63.28 - - [13/Oct/2021:16:35:19 +1100] "GET /yeezy-boost-700-v2-static-white-kanye-west-sneakers-running-shoes-ef2829 HTTP/1.1" 403 196 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
108.162.246.205 - - [13/Oct/2021:17:35:41 +1100] "GET /robots.txt HTTP/1.1" 403 134 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

I have no idea how they get in. So I created some user-agent block on nginx configuration to return 403. I expect CF firewall rule can filter out all the noise instead I doing it manually on the server.

And one more recent nginx log

162.158.94.218 - - [13/Oct/2021:19:03:37 +1100] "GET / HTTP/1.1" 401 94 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36"

401 is a success request passing through CF firewall, but looking for web server login authentication. The user-agent is definitely not me from AS 1221.

For starters I would recommend to rewrite IP addresses as it is difficult to tell where these requests came from.

And I would check any whitelisted entries. If you have whitelisted any networks, your firewall rule won’t fire either.

That seems to be a general ISP, so these requests might come from another customer of your ISP.

Maybe limit it to your IP address or consider using Cloudflare Access instead.

Thanks Sandro for your promptly reply.
I am also thinking because of the hostname that this rule is not hit.
My server DNS A record is managed by CF. For example: [nomadsguide.co] 123.123.123.123 proxied. I have another web server www.example.com and example.com pointing to a different IP address under the same zone. If the bot request does not match hostname ‘abc.example.com’, how the traffic hit my server behind CF? This is the part I am not sure.
Please advice, thanks.

As with the original poster, you’ve blocked all access that aren’t from the list at cloudflare.com/ips?

I can almost confirm it is due to the hostname condition that the firewall rule not get hit.
I nailed down the rule from AS number control to source IP list (4 IP addresses) control, but still see crawl bots accessing my server.

This is some nginx reverse-access log from original server after applying the source ip + hostname firewall rule in CF (not ip.src in $trustip and http.host eq “www.example.com”) block.

162.158.107.122 - - [14/Oct/2021:13:00:15 +1100] "GET /image/cache/catalog/p/d4f9a4f8a0c711ea87f2f33e5b103bfa-74x74.png HTTP/1.1" 403 134 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
172.70.130.7 - - [14/Oct/2021:13:22:09 +1100] "GET /index.php?route=product/product/review&product_id=22 HTTP/1.1" 403 196 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
172.69.63.96 - - [14/Oct/2021:13:23:29 +1100] "GET / HTTP/1.1" 403 134 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Then my question becomes is the ‘hostname’ condition in firewall rule behaviour as expected? How to define firewall rules if have multiple web servers under the same zone?

I have no idea how crawl bots coming through CF without matching ‘hostname’ condition.

I tend to test removing ‘hostname’ condition in firewall rule only using ip.src control but will impact other servers in the same zone.

@sdayman yes, ufw only allow CF IPS accessing input 443.

Interesting, after I applied firewall rule (not ip.src in $trustip) block, I still see access log from some bots to my original server.

The only guess I can think is someone using CF pointing a A record to my original server real IP. So it will not hit my zone CF firewall rules but still passing through CF IPs. Isn’t a way for attackers to bypass some CF security controls?

Based on what you described the hostname should not be relevant.

As I said earlier, the block generally appears to work

however apparently still allows certain requests and that will be most likely because of some whitelisting on your side or - as I also mentioned - because you are allowing an entire AS.

Hi @sandro , thank you for looking into this issue.
Were you just sending a bunch of request from user-agent ‘okhttp’? I have already update the CF firewall rule to only allow my IP addresses, not the AS number anymore. But still see requests coming through from CF IPs.

Some 403 is returned from Nginx reverse proxy configuration, some are returned from the web application. Not returned from CF as I expected.

If you only allow your IP address you really should not get requests from anywhere else. Can you post the exact expression?

Also, as I mentioned earlier, did you check the rule order?

Also, also, I’d still rewrite the IP addresses as you currently have no way of telling who sent those requests.

I am on CF free plan, only created three rules in the zone.

First block rule

(http.request.uri contains "xmlrpc.php") or (http.request.uri.path contains "/wp-login.php" and ip.geoip.asnum ne 1221) or (http.request.uri.path contains "/wp-admin" and ip.geoip.asnum ne 1221) or (cf.client.bot)

Second block rule

(http.request.uri.query contains "author_name=") or 
(http.request.uri.query contains "author=" and not http.request.uri.path contains "/wp-admin/export.php") or 
(http.request.full_uri contains "wp-config.") or 
(http.request.uri.path contains "/wp-json/") or 
(http.request.uri.path contains "/wp-content/" and http.request.uri.path contains ".php") or 
(http.request.uri.path contains "phpmyadmin") or 
(http.request.uri.path contains "/phpunit") or 
(http.request.full_uri contains "<?php") or 
(http.cookie contains "<?php") or 
(http.request.full_uri contains "passwd") or 
(http.request.uri contains "/dfs/") or 
(http.request.uri contains "/autodiscover/") or 
(http.request.uri contains "/wpad.") or 
(http.request.full_uri contains "webconfig.txt") or 
(http.request.full_uri contains "vuln.") or 
(http.request.uri.query contains "base64") or 
(http.request.uri.query contains "<script") or (http.request.uri.query contains "%3Cscript") or 
(http.cookie contains "<script") or (http.referer contains "<script") or 
(upper(http.request.uri.query) contains " UNION ALL ") or (upper(http.request.uri.query)contains " SELECT ") or 
(http.request.uri.query contains "$_GLOBALS[") or (http.request.uri.query contains "$_REQUEST[") or (http.request.uri.query contains "$_POST[")

Third block rule

(not ip.src in $trustip and http.host eq "example.example.com")

$trustip list only contains 4 IP addresses.