This posting is part of a series on Cloudflare’s firewall engine and discusses rules which might make your site just a tad less welcoming to automated robots and crawlers.
Benefit of the doubt
So far all firewall tips were about blocking bots. While that’s fair, and actually most of the time the very idea of a firewall, today we’d like to be generous for once .
You’ll be most likely familiar with it, but just in case you weren’t, ever since 1994 there’s been a relatively straightforward standard of telling search engines and any other crawlers what they should and should not request. It’s a regular text file which follows a certain syntax, is supposed to be saved in your site’s root directory, and is named robots.txt - robotstxt.org.
Of course, we could simply say we block every request which follows a certain pattern, however it might be a good idea to give the benefit of the doubt even to the shadiest looking crawler, maybe it will actually be robots.txt-compliant and follow those instructions. And that’s exactly what we are going to do with the following rule.
(http.request.uri.path eq "/robots.txt")
Challenge or Block?
Neither. The point of this very rule is to allow these requests - and it should be relatively high up in the rule list to fire as soon as possible.
As always, don’t just copy/paste things and first evaluate if a new rule fits within your site setup and be careful when making such changes as they could break your site if not implemented with care. Also, pay attention to the order of the firewall rules as they are evaluated in order.
Ceterum censeo, Flexible mode is insecure and should be deprecated for the sake of the security of the Internet. Cast your vote at Header indicating encryption status of the origin connection and get more transparency and security on the Internet.