Crawler Hints sends wrong URLs to search engines

Hi everyone,

I decided to write a new topic because I think there’s a problem with Crawler Hints. Crawler Hints is a new tool in beta status that might help any search engine to benefit from the last updated content of websites under Cloudflare. It’s a noble goal and I do admire what Cloudflare tries to make every day. For this, thank you everyone.

Unfortunately, errors might happen, such as in this case. I do not know in particular what search engine benefits and uses the crawler hints (or IndexNow protocol for tech savie), but Bing shows a helpful list to check which URLs Cloudflare sends.

While most of urls are correct and shows a really fast update by Cloudflare, there are some others that look like fetched from logs. I explain myself better, but please forgive me for any mistakes. I’m not an English native speaker. We all know that internet has bot that checks daily websites and scrapes the content. Those bot check, for example, sensitive logs, update scripts, Wordpress files and more just to do some reconnaissance about the website.

As mentioned in “From 0 to 20 billion - How We Built Crawler Hints”, the first signal for a potential URL to submit is the cache miss, which is totally reasonable. If I publish a new article or a new page on my blog, and visitors start to visit my website, a high number of cache MISS will be in Cloudflare logs. However, let’s think about the bots checks too. If my website does not have wp-login.php (a common php script to check if there’s Wordpress installed), most of times there’ll be a cache miss for sure. The origin checks and answers 404. The problem is that the same url https://example.com/wp-login.php is sent to Bing.

So I noticed that when a serie of bots/real humans start to get 404 page (and produces some cache MISS because Cloudflare did not cache the answer), Crawler Hints mark that URL as “potential helpful” for IndexNow protocol, submitting it to Bing. It’s totally nonsense, and I’d like to submit this post to mark how important Crawler Hints are. Your mission is to keep at minimum useless request from search engines, but at the moment you’re sending to Bing&Others totally wrong URLs that waste tons of resources.

Ideally Cloudflare should check the origin and HTTP status before sending a Crawler Hints.

I’m available to discuss more.

Pinging @akrivit that seemed to have more information about Crawler Hints. Since I’m managing some websites, I’m reporting the same issue on every website that has Crawler Hints enabled. If you wish, I might create a new ticket.

we do. and also will run some de-duplication from the redundant misses you mentioned earlier. Keep testing it and playing with it though and lmk if you find issues.

1 Like

Hi akrivit, I think the main issue here is that someone visit any page of my website => throws 404 page => cache miss from Cloudflare => Cloudflare submits that URL to IndexNow/Crawler Hints. If you could take a look, and investigate a little bit, that would be awesome.

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.