How does Crawler Hints work, in detail?

I’ve read the two blog posts

…which are kinda heavy on hype and light on technical details. I wish engineers would write these blog posts instead of (or in tandem with) salespeople.

So, how does Cloudflare determine that a page has “changed” ?
identical bytes?
last-modified header?
percentage of change?
can we ignore the surrounding layout (e.g. always updated sidebar) ?

Each dynamic page we serve always has some bytes that change, so I’m worried that every page request will be considered as “changed”, which would result in incorrect signals to search engines and possibly increased crawling.

Also, the first blog post mentions sitemaps, but how does this fit in?
do you auto-generate sitemaps? at what url(s)?
do you override existing sitemaps?

:rofl:
The authors are engineers. But the blog posts are for more general consumption.

From the blog post:

We see what pages our customers are serving, we know which ones have changed (either by hash value or timestamp) and so can automatically build a complete record of when and which pages have changed.

@akrivit is in the Community and can probably go deeper into the details.

1 Like

Hype and general consumption are fine, but as an engineer I also really need technical details in order to make my decisions.

In the blog I did see “either by hash value or timestamp”, but that’s too light on details. Especially since “hash value” is such a unrealistic criteria for “has this page changed?” that I would have to assume the Crawler Hints feature is only suited for static sites. And that doesn’t really make sense, given what I know of Cloudflare.

Oh, a related question: do you notify crawlers about urls that are Disallowed in robots.txt ?

cache status, some status codes, etag/last-mod. not looking at dynamic right now. cadence is still determined by crawler which uses additional signals like sitemaps, we are just providing additional heuristic to help make that decision. Working on adding more variables soon.

3 Likes

Thanks. As long as weak etags are ok we’d be able to provide a correct signal of when the page last changed semantically.

I understand that, but if the heuristic is always wrong that might influence the crawler in the wrong way. But I’m confused about that sitemap comment above; according to the blog that’s exactly the mechanism through which Cloudflare provides additional heuristics to the crawlers.

Well, I realize it’s a work in progress, keep it up! :+1: