Inexplicable, undesired HTTPS redirect

What is the issue you’re encountering

My website uses Cloudflare Pages and has Cloudflare Proxying enabled. It also has HSTS enabled and Full (Strict) HTTPS encryption. I had “Always use HTTPS” enabled, too, since I thought I wanted no insecure HTTP requests. However, I want Google not to index the insecure HTTP version of my website, and on Google Search Console I find that it received a ‘robots.txt’ on both an insecure and a secure HTTP requests. I’d like ‘robots.txt’ to be a different one, configured to disable indexing, when served over insecure HTTP as opposed to HTTPS. I thought I’d use a Rewrite rule to rewrite insecure HTTP requests to ‘robots.txt’ to a different Pages files that is configured the way I want, but “Always use HTTPS” prevented that from working as requests would become HTTPS before matching any rules. Therefore, I disabled “Always use HTTPS” (and will set up other rules to make the rest of the site inaccessible from HTTP), but somehow requests are still being redirected against my will. Why would that be? Is it possible to fix that or is it related to HSTS (Note that it’s not only browsers doing this, which I would expect, but raw HTTP requests such as with curl)? Alternatively, is there another way to achieve what I want (that is, preventing the insecure-HTTP pages from being indexed)?

Cloudflare Pages does not support HTTP, it is HTTPS only.

I believe you should be able to create a Worker to intercept the HTTP request before it reaches pages, though.

1 Like

I wasn’t aware of Pages lack of support for HTTP, thanks for letting me know.

I will, however, bring back my point: I wish for Google not to index the HTTP url to my domain. Since Googlebot seems to be fine with handling redirects — in fact, the HTTP ‘/robots.txt’ redirects to the HTTPS one, — I don’t strictly need to serve it in insecure HTTP. What I would like to do is to make sure that, upon an insecure-HTTP request to ‘/robots.txt’ a different Pages file is served instead. And if that is only through a redirect to HTTPS, I’m fine with that; it may actually be even better. My concern is that the HTTPS redirect seems to be happening before matching any rules, so I couldn’t manage to get a Rewrite transform to rewrite insecure-HTTP requests to ‘/robots.txt’.

Is there a way to have some rules run before the HTTPS redirect happens? Maybe worth pointing out that, even though I’m using Pages to serve my content, I’m talking about a custom domain (applied to the Pages app) that is part of my zone. If a rule that runs before the redirect takes place is not possible, I’d like to know if there’s another way I can solve this problem; preferably one that does not require a worker.

The transform rule does apply before the https redirect, I’ve just tested that.

My bad, a different rule was conflicting with it.

Now, I actually needed that conflicting rule, but maybe there’s a different way to handle that — as I said, I was trying to rewrite ‘/robots.txt’ to a different file (‘/insecure-robots.txt’) when asked via an insecure HTTP request. I’ve got that to work now, but I’d like ‘/insecure-robots.txt’ to fail with 404 if queried directly: as if it wasn’t there (that entire file is just a trick to change the ‘robots.txt’ contents using a Transform rule).

Bonus question: when I opened this thread, it seemingly required me to insert my domain name. However, I’m now concerned because this thread appears in search results when searching for my website name. Can the domain name of my site be redacted from this thread?

EDIT: The conflicting rule I was talking about was my attempt to make ‘/insecure-robots.txt’ invisible: it was a Rewrite rule that would rewrite requests to that file to ‘/404’, which is my 404 error page. However, by enabling that rule — regardless of its ordering relative to the insecure-HTTP ‘/robots.txt’ Rewrite rule — the latter would stop working.

Can you edit your post yourself? If not, I can flag it for a moderator.

You can’t if you want it to be served. http://example.com/robots.txt will redirect to https://example.com/insecure-robots.txt, and that file then needs to be accessible.

1 Like

Can you edit your post yourself? If not, I can flag it for a moderator.

It doesn’t seem like it. These are the buttons I have on my post:
Buttons on my post

It would be perfect if you can flag it for a moderator to redact. Thanks

You can’t if you want it to be served. http://example.com/robots.txt will redirect to https://example.com/insecure-robots.txt , and that file then needs to be accessible.

I imagined that was why it wasn’t working, but conceptually it would work if only I could set up a rule to get the next rules to skip. If that’s possible, the rule that redirects /robots.txt to /insecure-robots.txt would get the “hidden files” rule to skip.

Did, that. Might take a while though, as it’s sunday.

I don’t understand what you mean by that. What are you trying to achieve? That the insecure-robots.txt is only accessible via the http://example.com/robots.txt URL? Then I believe what you want is a worker.

1 Like

Did, that. Might take a while though, as it’s sunday.

Perfect, thanks.

I don’t understand what you mean by that. What are you trying to achieve? That the insecure-robots.txt is only accessible via the http://example.com/robots.txt URL? Then I believe what you want is a worker.

That’s exactly it (the part I’ve made bold). I understand a Worker would solve the issue, but since worker requests are limited (even though the limit is admittedly very generous), I’d like to avoid using a Worker for such a simple use case. Even worse, as long as I’m on the free tier, someone could attempt a DoS attack by making a bunch of requests to any worker URL until I run out of the quota. I’m going to have to use Workers anyway for some operations in my site, but the more I limit those, the better; and since I think Workers will only be strictly required for operations that users make via the frontend site UI — and not 100% necessary for service pages per se (where it’s an HTTP client making requests) — I might find a way to make sure that none of the Workers URLs respond to requests that aren’t made from within the frontend site code. Or even if a bulletproof solution turns out not to be possible without a chicken-and-egg situation (for example, I think Workers would be the only way to implement CSRF tokens), a way to at least make it more difficult to cause a DoS would still rule out many attacks.

Getting back to the point, now I’m not asking to help me figure out a way to prevent Workers requests from outside of the frontend UI — which would be out-of-scope and for me to figure out depending on the architecture of my website — but to clarify the question I asked in my previous post:

What I meant was something like what’s possible in WAF custom rules, where I configure a rule to cause certain requests to skip all other rules. Is something like that possible?

Alternatively, is there any other way I can have a Rewrite rule reach a destination URL that is otherwise inaccessible from direct requests to it? Interestingly, I have a similar, but working, arrangement with a continent-blocking rule, where a Rewrite rule shows a page that says that the website is unavailable in the country by rewriting the request to a ‘/blocked-country’ URL that matches a ‘/blocked-country.html’ file, but a subsequent rule rewrites ‘/blocked-country’ requests to a dummy URL in order to cause a 404 answer. Any clue why it works for the continent-blocking rule but doesn’t or the ‘/insecure-robots.txt’ rule? The only difference I can spot is that ‘/insecure-robots.txt’ has an extension and matches the file as-is, while on ‘/blocked-country’ there’s no ‘.html’ extension and Cloudflare internally rewrites the path to match the ‘/blocked-country.html’ file, thus it might be that this internal Clouflare machinery causes my continent-blocking arrangement to work when it isn’t expected to. Or maybe I’m just missing something. What do you think?

1 Like

Then I think the answer is no. Pages does not support HTTP, so the file always needs to be accessible via HTTPS.

It doesn’t have to reply in insecure HTTP. I’ve set it up and works correctly by redirecting to HTTPS, but simultaneously rewriting. So it goes like:

  1. Request to http://example.com/robots.txt
  2. Rewrite to http://example.com/insecure-robots.txt
  3. Redirect to https://example.com/insecure-robots.txt

That’s perfect. The issue is that I’d like ‘/insecure-robots.txt’ not to be accessible on a direct request, and rather to return a 404 as if it didn’t exist. If I make a rule to make such a rewrite, it appears to conflicts with other rewrite rule and then none of it works. But as of now, with ‘/insecure-robots.txt’ still not hidden, it works. I’m trying to find a way to have it hidden but still be a usable target for my Rewrite rule. The interesting part is that, as I specified, the exact same arrangement works on the ‘/blocked-country’ path. That’s why I went on theorizing that the latter situation might only work coincidentally because of some internal machinery that Cloudflare has to automatically rewrite ‘/blocked-country’ to access ‘/blocked-country.html’.

Thanks for responding, though

This topic was automatically closed after 15 days. New replies are no longer allowed.