Enhancement for crawler/bot matching

Currently cf.client.bot is a boolean flag which indicates whether the request came from a known crawler. While this allows for some generic filtering it would be nice to have more granular control over such requests to block (or allow) for example only specific crawlers.

For this I’d like to suggest cf.client.crawler (different name to keep backwards compatibility with the current flag, but any other name should be fine too) which contains an optional lowercase string specifying the crawler which sent the request (or null/undefined/etc. in case of no crawler).

This would allow for a more flexible configuration of the following type

A gentle nudge :slight_smile:

A mighty unpopular idea apparently :smile:

Comments appreciated.

My brain overloaded when you wanted to allow for “more flexible.”

Why not combine the existing Known Crawler rule with a User Agent string rule?

1 Like

Thats a fair point and I didnt consider that earlier.
Though I’d still believe

cf.client.crawler in {"google" "bing"}

is easier, more straightforward, and less error prone than

cf.client.bot and (http.user_agent contains "Google" or http.user_agent contains "Bing")

The latter also requires one to either know all applicable user agents or find a common pattern and use contains.

1 Like

I like the idea, and I wish CF does something about its known-bot list to allow users to exclude some bots. Many of those in the list, while not malicious, may have no business crawling a user’s website.

Having to resort to a combination of the known-bot filter and another operator, while OK in a small rule, may add to the complexity of a larger rule and force the user to give up the UX for the rule editor, where one is more likely to make a mistake.

1 Like