Cloudflare WAF false positive: BingBot

Auto-closing topics too quickly really is an issue here. There are dozens of auto-closed issues about this topic, while it was never resolved. I just link one here, and create this as follow-up, in the hope it is not again auto-closed: Cloudflare blocking bingbot crawl

@tye730 @fritex pinging you here, as you are probably interested.

Today I switched to the new managed WAF rules, and watched the event log. Cloudflare’s own ruleset triggered a block of the following request:

{
  "action": "block",
  "clientASNDescription": "MICROSOFT-CORP-MSN-AS-BLOCK",
  "clientAsn": "8075",
  "clientCountryName": "US",
  "clientIP": "40.77.202.147",
  "clientRequestHTTPHost": "dietpi.com",
  "clientRequestHTTPMethodName": "POST",
  "clientRequestHTTPProtocol": "HTTP/2",
  "clientRequestPath": "/matomo/matomo.php",
  "clientRequestQuery": "?action_name=Profile%20-%20helio58%20-%20DietPi%20Community%20Forum&idsite=1&rec=1&r=936160&h=9&m=57&s=18&url=https%3A%2F%2Fdietpi.com%2Fforum%2Fu%2Fhelio58&_id=23275a695d662683&_idn=1&send_image=0&_refts=0&pv_id=ufOegz&pf_net=0&pf_srv=18&pf_tfr=0&pf_dm1=63&uadata=%7B%7D&cookie=1&res=320x568",
  "datetime": "2024-02-24T17:57:18Z",
  "rayName": "85a99713ba28307c",
  "ruleId": "ae20608d93b94e97988db1bbc12cf9c8",
  "rulesetId": "efb7b8c949ac4650a09736fc376e9aee",
  "source": "firewallManaged",
  "userAgent": "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)",
  "matchIndex": 0,
  "metadata": [
    {
      "key": "ruleset_version",
      "value": "184"
    },
    {
      "key": "version",
      "value": "184"
    },
    {
      "key": "type",
      "value": "customer"
    }
  ],
  "sampleInterval": 1
}

This rule is named Anomaly:Header:User-Agent - Fake Bing or MSN Bot. However, looking at the user agent, it is the correct BingBot. Probably Cloudflare expects an old user agent, which changed 2 years ago: Announcing user-agent change for Bing crawler bingbot | Bing Webmaster Blog

Comparing the user agents, expected in 1st line and the one which triggered the WAF rule below:

Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

A perfect match, I would say, hence the Cloudflare managed rule is wrong.

Or do I understand the rule wrong, and it triggers when an IP or something uses the BingBot user agent, while it cannot be the BingBot, based on IP range or other request parameters?

1 Like

It’s not a false positive. The PTR doesn’t match the format Microsoft says requests will take. If you believe this is a valid request from Microsoft Bing you should contact them because they have misconfigured their client based on the information they have provided to others to ensure requests are legit.

4 Likes

With PTR you mean the IP address? I just checked 40.77.202.147 with the BingBot verification tool of Microsoft, and it says it is indeed a BingBot IP address: Bing - Verify Bingbot Tool

Where do you get other information from?

EDIT: Same on the other tool on Bing webmasters, probably the same but embedded in different site: Bing Webmaster Tools

EDIT2: Last but not least, here a JSON file with a list of all prefixes: https://www.bing.com/toolbox/bingbot.json
Among them "ipv4Prefix": "40.77.202.0/24", matching the IP we got.

So wherever you/Cloudflare get the information from, it is wrong/outdated and this WAF rule IS a false positive.

Or again, I do not understand what you mean with “PTR”, which I only know as answer for reverse DNS requests, hence the hostname an IP is mapped to, based on information given in the response from this IP.

https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26

dig +short 147.202.77.40.in-addr.arpa ptr

No PTR, thus no Bing Bot.

Edit: fixed dig command

3 Likes

Okay, so based on the IP address, being part of the official BingBot IP prefixes, it is definitely BingBot. But then indeed the PTR does not match what this page says it should be, more confusingly the verification tool linked right above this text says that it is BingBot.

It is still a false positive, but then caused by contradicting information given by MS. Since blocking BingBot is pretty much damaging SEO, I still suggest to adjust/disable this rule OOTB, to not hurt Cloudflare customers.

I’ll see whether I can send a ticket to Bing/MS regarding this.

Bing says it is.

That link about reporting issues is only about behavior. Not about how the IP address resolves.

1 Like

Exactly. The IP address is correct, the PTR not. I just sent a support request to Bing Webmaster.

It is some particular BingBot instances only, who have no PTR, while most do have one. I just checked a bunch of other IPs from the JSON list, and most of them to have a PTR record as expected, like msnbot-20-43-120-16.search.msn.com for 20.43.120.16.

1 Like

Microsoft has told others how to verify a bot is a Bingbot by checking to see if the IP has a properly formatted PTR. 147.202.77.40 isn’t based on the information they say proves an IP is part of Bingbot. If another tool of theirs happens to say it is a valid that indicates a disconnect in their tooling. However it’s not a false positive as their answer to a query for the IP address to determine if the PTR matches says it does not. When or if Microsoft corrects the PTR record for this IP it will pass, until then it’s not a false positive, it is an accurate verification (failure) of whether or not the IP is part of the Bingbot tool.

https://www.bing.com/webmasters/help/how-to-verify-bingbot-3905dc26

Could it be an IP in use by Bingbot? Sure. But it’s going to be correctly flagged and blocked by any service using the standards Microsoft has laid out for verification until Microsoft corrects the failure condition that exists.

4 Likes

Not sure about your definition of a false positive: In fact Cloudflare blocks some real BingBot instances from crawling Cloudflare-protected websites, as long as admins do not manually disable or override the default WAF rule which does this. This is all that matters for me, as it hurts our SEO, regardless whether it is caused by MS/Bing not serving the intended PTR, or by Cloudflare, and hence whether it matches your definition of a false positive or not.

I see that the issue indeed lies more on MS/Bing side, hence I contacted them. Whether and when this will be fixed, i.e. all BingBot instances serve the intended PTR, or the documentation regarding it is updated, is another question. And as long as it is not fixed, it would help Cloudflare customers in terms of SEO to have this rule disabled by default, or at least have this topic here as information about the matter, so they know why BingBot may be blocked, and how it can be prevented.

Valid crawlers publish ways for security tools to identify them from malicious crawlers. Microsoft is a multi-billion dollar company that is fully capable of managing their tools. Their failure to do so puts web administrators at risk. Disabling a tool to prevent fake bots because the real bot can’t be bothered to meet their own criteria is certainly an option.

Generally Microsoft screws this up every year or so. The assumption that the tool indicating this is a valid crawler is correct (with no insight into how it operates) vs their published mechanism for validation is certainly a decision you can choose to make. But there is no evidence that tool is correct, just that it returns a different response.

A false positive would mean the Cloudflare has failed in its check against published mechanisms to determine if a crawler is valid. It demonstrably has not. Cloudflare has had false positives with crawlers before based on incorrect logic by the development team. This is not such an instance.

3 Likes

Instead of checking the PTR, the IP could be checked against the list from Bing I linked above: https://www.bing.com/toolbox/bingbot.json
Their validator seems to do the same, i.e. seems to not check the PTR, otherwise it would/should fail as well.

Otherwise, I personally would rate the downside of worse SEO larger than the downside of some random crawler on our website. If sensitive information is publicly available, then one has a problem either way.

Thank you for asking and providing more feedback information.

The first thing that caught my eye was why the Bingbot is going over the POST method at all? :thinking:

By the definition and known stuff, crawling should go for the GET method.

Nevertheless, wonder if this request was made manually by the Bing Webmaster Tools via the input form and therefore was POST for the particular URL, or somehow it figured it out or the URL was submitted via the sitemap or URL list file?

I’ve encountered a lot of Fake ones and the true good ones, was also “hit” by the crawl requests (GET) and indexing of Bingbot every 20k+ requests causing my Website bandwidth to rise with no particular reason.

Since they were both mixed, I decided to block the whole Microsoft ASN as I am not using their services, nor dependent, and less web traffic comes from their search engine.
That caused nowadays having daily 20k+ requests blocked daily from Microsoft ASN.
Since 99% of visitors are coming from Google search to my Websites.

Could be, but cannot tell for sure since I don’t know what stuff is running in the background.

I went to double-check, just in case the IP over the AbuseIPDB since Fake Bingbots do exist, even abused Bing IP addresses are out there:

This seems to be the case with this one too :thinking:

Furthermore, Cloudflare surely uses multiple factors before taking action for a particular request.

1 Like

In that case you should turn off every fake crawler rule in the WAF for a service you deem essential to your SEO and leave them off permanently.

Problem solved.

1 Like

Matomo is a self-hosted tracker. The original request is done via GET. The HTML document contains a script which gathers the information and sends it with this request to the PHP backend via POST.

Most requests are served by Cloudflare and our server is more than capable enough to handle search engine and other crawlers, even when they’re going crazy by times, like Amazon voice assistant a whole ago with thousands of requests a day. Also some search engine crawlers crawl in waves, especially Yandex. Google and Bing crawl our website at moreless constant rate. We also use the webmaster/search console tools and configure them to prefer crawls when visitor numbers are lower (MET night time). If you have problems with the bots, go block them. The usual way would be to send X-Robots-Tag: noindex, nofollow, but of course a firewall does as well. We however want search engines to index our website and actively do everything for this. A firewall which counteracts this, is a problem for us. Even some fake bots, slipping through, would not bother us at all.

Anyone can report any IP at this database. People who feel offended/annoyed by (official) search engine crawlers may report them there. This IP range is controlled by MS/Bing, it is in their own public list of BingBot IP ranges, and their own validation tool verifies that it is a regular Bing bot. So there is zero reason to suspect it is a fake BingBot.

This may be the exact problem: It checks the PTR, probably in addition to the IP, while the IP alone would not fail a check. The PTR can be set or not set, and obviously MS/Bing failed to set it consequently for their BingBot servers. An IP address however is always there, so is the 100% reliable factor to check, especially since MS/Bing provides the list of BingBot prefixes.

For me it was solved before I opened this topic. I was opening this with the aim to solve this at its core for everyone, since all Cloudflare customers are affected, and most surely do not recognise it. And SEO is not exactly something unimportant for people who want their website to be recognised. Bing is at least the 2nd largest search engine, hence not exactly unimportant for SEO.

I think I am having the same issue.

I understand that ‘Bing is not reporting the ips correctly’.

I can confirm this because when you run the test tool inside of the bing webmaster console for a page with ?THISISUNIQUE2943928821 and then check the logs, indeed Bing is being flagged with a False Positive.

There is literally no way to see that this is a fake bing bot, except by checking the RDNS.

When I check the RDNS on these ips that Cloudflare is blocking, they have a rdns of *.search.msn.com indicating that they are indeed real.

My website was removed from the Bing index and this is literally the only thing I can find. Since turning off the “cloudflare option” I have now 10x the amount of “Bing Bots” visiting my website and they all have a RDNS of *.search.msn.com

I hope my website comes back to the index soon. Cloudflare also doesn’t give me the option of checking the rDNS anywhere.

Similar happened to me after website added to Bing Webmaster Tools, later removed, however it continued to crawl and it made me to block the Microsoft ASN AS8075 to get away with so much unneeded gigabytes of daily traffic and bad bots coming to scan and probe my website for possible vulnerabilities, since everything comes from the same ASN.

Exaggerated by some dozens of magnitudes or orders :smile:. For any larger content, there is the Cloudflare cache, for images, stylesheets, scripts etc automatically. If that WAF rule was fixed, it would block fake Bing bots without hurting your SEO or having your website removed from Bing Webmaster Tools.

Nah, back then I just got tired of :point_up: and neede easy exit :grin: