Reducing redirects and HSTS

Hi, I have a joomla-4 website running on fedor38 with apache-2.4. It’s a legacy site that has many tens of thousands of articles going back decades, before HTTPS and before we didn’t include the “www” as part of a URL.

This has created many external links of the form http://www.linuxsecurity.com/content/view/123456 that have since been redirected using a PHP script to the modern form, like https://linuxsecurity.com/privacy/eu-opens-public-consultation-on-rfid

The problem I’m having is that it involves many redirects to go from http to https then strip off the “www” plus the translation from /content/view/123456 to the /privacy/eu-opens-public-consultation-on-rfid.

These redirects are creating user delays and impacting our SEO, as per Google’s best practices.

I’m thinking that if I can run the redirection script from port 80 at the apache level that I can do all the redirections in the script there, rather than first having cloudflare (and apache?) convert from http to https.

However, I believe HSTS is preventing me from doing that. How do I balance the HSTS requirements/benefits with cloudflare settings? Do I disable HSTS in cloudflare? With it enabled, it appears port 80 is never even consulted on the webserver.

I know your site has been around a while so I can imagine there’s a lot of historic links you want to maintain :slight_smile: Your redirecting does show your problem…
https://cf.sjr.org.uk/tools/check?60f49e191d254ca4be8af35c8b5874ae#connection-server

Ideally you want to redirect each of those steps direct to the destination if Google doesn’t like the number of redirects, rather than step through the history.

Do not force HTTP as you have been using HSTS. As Cloudflare warns you when you set it up, you are telling a visitor’s browsers to never connect to your site over HTTP for some time (the default is 6 months) and that won’t change those visitors even if you disable HSTS. Forcing HTTP will mean previous visitors can’t reach your site for those 6 months.

If you want to do the redirects on your origin to tidy this up, assume that they will come in on HTTP or HTTPS (make sure SSL/TLS is set to “Full (strict)” on Cloudflare). You could turn off “Always use HTTPS” if you really want to minimise redirects and combine with other redirects on your origin.

You might be limited in a free account by the number, but consider bulk redirects on Cloudflare if you have a list of what you want to redirect from and to as, if possible, I’d always prefer to get Cloudflare to do the redirects…

[add] or consider a Worker that can match the path and calculate/lookup the redirect (I think the logic need would be too much for a dynamic redirect rule).

2 Likes

I know your site has been around a while so I can imagine there’s a lot of historic links you want to maintain :slight_smile: Your redirecting does show your problem…
https://cf.sjr.org.uk/tools/check?60f49e191d254ca4be8af35c8b5874ae#connection-server

Yes, that’s essentially the problem I’m trying to fix. I’ve been using https://httpstatus.io to check links, but wget on the command-line shows equally as well.

Ideally you want to redirect each of those steps direct to the destination if Google doesn’t like the number of redirects, rather than step through the history.

Isn’t this a distinction without a difference? I’m not sure I understand. If it’s http://linuxsecurity or http://www.linuxsecurity, both would just go directly to https://linuxsecurity and the new URL.

Do not force HTTP as you have been using HSTS. As Cloudflare warns you when you set it up, you are telling a visitor’s browsers to never connect to your site over HTTP for some time (the default is 6 months) and that won’t change those visitors even if you disable HSTS. Forcing HTTP will mean previous visitors can’t reach your site for those 6 months.

This was also effectively my question - you are saying I should disable HSTS at the cloudflare level, then perhaps implement it at the apache https level, correct?

I’m already using a “full strict” cert as part of their business plan. We also have well more static redirects than even the business plan will support. I’m thinking about abandoning the bulk of those static redirects and relying on the redirection PHP script we’ve developed that will translate the old /content/view/123456 links into their more modern SEF equivalent.

I’m not sure there’s much I can do about previous visitors, but what is the difference between HSTS and forcing HTTPS with cloudflare?

I hoped you could clarify now with this new information how specifically you think I should be managing this.

No I wasn’t. However you implement HSTS, it tells a browser that connects to HTTP to only use HTTPS in the future for a time period. When proxying through Cloudflare, that will be set at 6 months by default unless you set different. That browser will then only request https:// for all requests to your site even if the user tries http, so if you force to http, the browser will fail to connect to your site as it is told to only use HTTPS for 6 months.

See here…

So you might as well leave HSTS enabled as you need to service HTTPS now anyway.

What I meant was, using this example…
https://cf.sjr.org.uk/tools/check?60f49e191d254ca4be8af35c8b5874ae#connection-server
…is you need to redirect http://www.linuxsecurity.com/content/view/123456 direct to https://linuxsecurity.com/news/privacy/eu-opens-public-consultation-on-rfid without the intermediate steps. That’s the only way to compress the number of redirects. Whether you do that on Cloudflare (with redirect rules or a Worker) or your origin is up to you and how you achieve that, either programmatically or with a lookup table.

1 Like

Yes, that is exactly what I want to do. This is exactly the reason for my post. But it appears cloudflare is bypassing http on our server, so I’m unable to run the redirection script on port 80 do perform those translations and compress the number of redirects.

It sounds like the next step would be to make sure I have “always use HTTPS” is disabled?

  1. HSTS is not much of an issue, as modern browsers will enforce HTTPS regardless of HSTS or not
  2. As long as the redirects are handled by the proxies, there should not be any noticeable delay for the user - I would say you can safely ignore that
  3. Search engines are mostly looking for one authoritative link - if you redirect everything to the naked domain, you already do what they want and that should be all right
  4. Don’t do anything with local port 80 setups - everything in this regard should be handled by the proxies

In short, I don’t think you need to do anything extra here and your setup should be fine, as long as your encryption mode is Full Strict (as @sjr suggested)

1 Like

Do you mean specifically the http and www redirects? The actual article redirects from /content/view/123456 are handled by apache.

The issue I was told I needed to solve was the delay that’s imposed with all the redirects and even that the redirects existed in the first place. Have I received bad information?

I wanted to follow up on this. This is the exact problem I’m trying to solve. Do you have any suggestions on how I might do this?

This is why I was proposing running our redirect script on port 80, as not only are the majority of the external links to port 80, but I could then perform all the redirection there, including stripping off the www and translating the old /content/view/123456 links to their SEF equivalent.

If I can’t/shouldn’t perform any operations on port 80 and let Cloudflare take control of it, then I lose the ability to combine the http and https redirects already.

But without doing it ourselves, we won’t pass a security scan.

That’s part of the problem - we can’t let the proxies handle the requests because there are too many for a static list and it doesn’t have access to the articles in the old database. It wouldn’t be able to run our prediction script that does this translation.

Eventually the user gets to the right authoritative link, but it sometimes involves three or more redirects.

That makes sense, but it automatically adds at least one additional redirect for every request and makes it much more difficult to flatten/compress the rest of them.

Ideas greatly appreciated.

Why not just run it on both http (80) and https (443)? Otherwise you’d have to redirect https to http to search redirects, then you’ll be going to https again anyway, making more work.

Assuming your problem is you have a very big list of redirects, you want to redirect in one step, and you are going to do this on the origin with a script then what I would do is…

Ignore http/https - service both as the input could be http/https but the output will always https.

then

Have a 404 handling script to see if any requests are old link formats, then either…

  • Have a look up table or database to map the old link to the new (seemingly 4 per current page link given the example link your provided) OR
  • In code, simulate the redirects yourself from one format to the next, if that can be done in a programmatic way, then just output the final link as a single redirect (so instead of outputting each redirect, process them yourself to calculate the end redirect).

Once you have the logic, you could later transfer that to a Cloudflare worker if you wanted to offload it from the origin.

1 Like

Just some additional information on what happens when someone tries to open one of these links:

  1. The visitor first needs to perform a DNS query to find out the IP address associated with your hostname. If Always use HTTPS is enabled, Cloudflare will publish HTTPS records that tell the browser that the site is also available on HTTPS.
    This shows in the browser as an Internal Redirect with Non-Authoritative-Reason: DNS and would not slow the site down.
  2. The browser will check if it has an HSTS policy for either the hostname or one of its parent hostnames. This could be supplied via an HSTS header from a previous visit or via the HSTS preload list.
    It would also show up as an Internal Redirect with Non-Authoritative-Reason: HSTS and not slow the site down.
  3. Chrome will try to use HTTPS anyway. This would show up as a Temporary Redirect with Non-Authoritative-Reason: HttpsUpgrades
  4. Only if 3 failed will Chrome try an HTTP connection.

So there is pretty much no point trying to redirect from HTTP directly. Forcing a connection via HTTP would actually slow your site down.

1 Like

I really appreciate all of your help. I think I’m still a bit confused.

Yes, we have this redirect working already - that’s the redirection.php seen in the redirect chain from your app:

Received HTTP redirect code 301, absolute redirect to https://www.linuxsecurity.com/redirection/index.php?type=view&ids=123456&uri=/content/view/123456

There was an error in the script that created an additional redirect which has now been fixed, but I’m unsure what more can be done, if anything, given it’s probably not worth it to bother redirecting at port 80.

Here’s what it looks like now. We are redirecting from /content/view/123456 to the redirection script which is then redirecting to the eventual final destination.

$ wget -O /dev/null http://www.linuxsecurity.com/content/view/123456/
URL transformed to HTTPS due to an HSTS policy
--2024-02-12 17:11:09--  https://www.linuxsecurity.com/content/view/123456/
Resolving www.linuxsecurity.com (www.linuxsecurity.com)... 104.26.4.94, 104.26.5.94, 172.67.73.242, ...
Connecting to www.linuxsecurity.com (www.linuxsecurity.com)|104.26.4.94|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.linuxsecurity.com/redirection/index.php?type=view&ids=123456&uri=/content/view/123456 [following]
--2024-02-12 17:11:10--  https://www.linuxsecurity.com/redirection/index.php?type=view&ids=123456&uri=/content/view/123456
Reusing existing connection to www.linuxsecurity.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://linuxsecurity.com/news/privacy/eu-opens-public-consultation-on-rfid [following]
--2024-02-12 17:11:10--  https://linuxsecurity.com/news/privacy/eu-opens-public-consultation-on-rfid
Resolving linuxsecurity.com (linuxsecurity.com)... 104.26.4.94, 172.67.73.242, 104.26.5.94, ...
Connecting to linuxsecurity.com (linuxsecurity.com)|104.26.4.94|:443... connected.
HTTP request sent, awaiting response... 200 OK

Why? That seems like it would be handled much better by an internal rewrite.

Also, why do you need to run a Script on every request? It seems like you could create a redirect map by running the script once across all articles and be done.

We do that with some of these, but there are tens of thousands (60k?) of articles of this form. Certainly we could just do a bulk redirect, but that would significantly affect the user experience.

We’re only running the script on requests to the old /content/view/123456 URLs.

We’re currently spawning the script with a rewriterule in apache.

Can you explain why you believe this to be the case?

Yes, but your RewriteRule seems to be redirecting to the script instead of just rewriting the request internally. I mean that your RewriteRule ends in [R] or [R=301], which is unusual for what you’re doing.

These are external links. If the user is expecting to read an article on, say, firewall security, and we direct them to something other than that article, they’re going to feel cheated or at least disappointed that they didn’t see what they were expecting, particularly without any kind of explanation.

Can you explain? I’m not sure of another way to do this.

That’s not what I meant. I mean that you run the script once to generate a file like this:

/content/view/1 linuxsecurity.com/some-article
/content/view/2 linuxsecurity.com/some-other-article
...
/content/view/60000 linuxsecurity.com/foo-article

You can then use such a file with mod_rewrite to send people directly to the target instead of invoking your script on each request.
https://httpd.apache.org/docs/2.4/mod/mod_rewrite.html#rewritemap
https://httpd.apache.org/docs/2.4/rewrite/rewritemap.html#dbm

mod_rewrite without the [R] option does not redirect, but simply changes the URL internally. So the user opens /foo, but you rewrite it to /bar.
Apache then responds as if the user had requested /bar and will send the correct response, without the need for a redirect that changes the URL in the address bar.

See this example:

RewriteEngine On
RewriteCond %{REQUEST_URI} ^/rewrite/
RewriteRule ^/rewrite/(.*)$ /$1 [L]
	
RewriteCond %{REQUEST_URI} ^/rewrite2/
RewriteRule ^/rewrite2/(.*)$ /$1 [R=302,L]

I have a test file at https://test.laudian.de/test.html
You can also access that file via these 2 links:

https://test.laudian.de/rewrite/test.html
https://test.laudian.de/rewrite2/test.html

You will notice that with the first link, the URL in the address bar doesn’t change - the redirect happens internally on the server, whereas with the second option, the browser is redirected and forced to send a second request.

To avoid the extra redirect, you want to use option 1 to call your redirect script, and let Cloudflare handle the HTTPS redirect.

That’s very helpful. Thanks so much.

So just to be clear, using RewriteMap with a db hash is obviously going to be faster than a one-to-one RewriteRule for 60k+ URLs, correct?

The approach would be to create a map of all conceivable /content/view articles to their modern equivalent, correct? More specifically, it would take the form:

RewriteEngine On
RewriteMap examplemap "txt:/path/to/file/map.txt"
RewriteRule "^/content/view/(.*)" "${examplemap:$1}"

I’m still unsure I understand how the RewriteCond/RewriteRule entries pertain to this. I do understand how you are replacing the link rather than redirecting to it, but how does that relate to the rewritemap?

RewriteCond %{REQUEST_URI} ^/rewrite/
RewriteRule ^/rewrite/(.*)$ /$1 [L]

In my case, I’m spawning my redirection PHP script in this way:

RewriteRule ^/content/view/([0-9]+)/? /redirection/index.php?type=view&ids=$1&uri=/content/view/$1 [L,R=301]

These are really 2 different approaches.

Option 1 is to just remove the R=301 part from your RewriteRule. Everything keeps working as it is with the redirect script, except you have one less redirect.
You can just try that and it should work.

Option 2 is to completely remove the script and use a file instead that maps all id’s to the new path. But instead of using the txt file directly, you would convert it into a dbm hash file. A txt file with 60k entries would be way too slow.

Option 3 would assume you have all entries in a database and can perform a simple lookup. I assume that is what your script does? In that case, you could also use a map of type fastdbd:

RewriteMap myquery "fastdbd:SELECT destination FROM rewrite WHERE source = %s"

I’d just use option 1 and see if you are happy with the performance. As you are already using the script, you probably are.

1 Like

Thanks very much for all of your help. Great ideas. It’s going to take me some time to make this work, but I’d like to see if I can make the database/rewritemap option work. Definitely always seeking options to improve performance.