I can almost confirm it is due to the hostname condition that the firewall rule not get hit.
I nailed down the rule from AS number control to source IP list (4 IP addresses) control, but still see crawl bots accessing my server.
This is some nginx reverse-access log from original server after applying the source ip + hostname firewall rule in CF (not ip.src in $trustip and http.host eq “www.example.com”) block.
188.8.131.52 - - [14/Oct/2021:13:00:15 +1100] "GET /image/cache/catalog/p/d4f9a4f8a0c711ea87f2f33e5b103bfa-74x74.png HTTP/1.1" 403 134 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
184.108.40.206 - - [14/Oct/2021:13:22:09 +1100] "GET /index.php?route=product/product/review&product_id=22 HTTP/1.1" 403 196 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
220.127.116.11 - - [14/Oct/2021:13:23:29 +1100] "GET / HTTP/1.1" 403 134 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Then my question becomes is the ‘hostname’ condition in firewall rule behaviour as expected? How to define firewall rules if have multiple web servers under the same zone?
I have no idea how crawl bots coming through CF without matching ‘hostname’ condition.
I tend to test removing ‘hostname’ condition in firewall rule only using ip.src control but will impact other servers in the same zone.
@sdayman yes, ufw only allow CF IPS accessing input 443.