Random HTTP 525 SSL Handshake Failed errors that go away after ~20 minutes

Hello,

I know that there are a few topics regarding this issue on the forums, as well as an official support document, but I was still not able to diagnose the issue I’m having, and to solve it. I’m sorry to ask for your time, but I’d be grateful for any advice on this manner.

I’ve been using Cloudflare with pretty much the same configuration for about half a year, and I haven’t touched my server’s configuration in about the same length of time. For the first five months of this year, everything has been going smoothly, but since roughly the beginning of June, I’m getting random spikes of SSL Handshake Failed errors across my domains and subdomains.

I understand that that would typically mean that the certificate I’m using on my server is invalid, except… it’s not really, I have configured auto-renewal on Let’s Encrypt certificates for each of my subdomains, and the renewal process seems to be working. Out of curiosity, I have disabled the Cloudflare proxy for one subdomain, so you can see that there is no problem with accessing the site due to invalid, revoked or expired certificate: https://office.milanvit.net/. On the contrary, https://www.milanvit.net/ is not accessible at this moment, but I suspect it will be in a few minutes (details in the last paragraph). As expected, we’re back in full strength :muscle:

It could also, according to the support document, mean that my server doesn’t support SNI – but according to the Qualsys SSL Server Test, that doesn’t seem to be the case, as running the test for the subdomain office.milanvit.net reports that “This site works only in browsers with SNI support.”.

Finally, the problem happens both with SSL settings set to Full and Full (Strict). What to me seems as a very strange is that the problem always appears randomly (sometimes when I’m asleep, so definitely not as a result of Cloudflare nor server configuration change), and also always goes away after around 20-40 minutes, all on its own.

I’m using Nginx on my server (managed fully by Dokku), and I’d be happy to try any advice you could kindly provide to me. Thank you so much for your time and guidance.

Edit: I wanted to add that I don’t see anything suspicious (or rather, anything at all) in Nginx error logs. If it could be related to Cloudflare not being able to establish connection to my server, then it should probably also be noted that my server is a dedicated server hosted in Hetzner’s data center – and while that doesn’t have to mean anything, even in times when Cloudflare gives me 525 errors, I can still connect to the server from my current location.

Edit 2: My server is located in Hetzner’s German data center, and I’m noticing that all traffic from Germany has been re-routed, according to https://www.cloudflarestatus.com/. Could that be the cause of the issue? But surely re-routing the traffic would not take 30+ minutes, so perhaps not…

Not exactly. An invalid certificate would throw a 526, in your case it appears as if Cloudflare can’t establish an SSL connection at all and that wouldn’t be necessarily certificate related.

So the issue appears without you touching the server at all and then disappears after about half an hour?

What exactly do you mean by connect? Via HTTPS directly or via SSH? If it is the former, then that would be a key bit of information, as it would suggest that TSL is still working.

A few questions regarding your setup

  • How many IP addresses do you have configured on Cloudflare? Do you proxy to just one machine or more?
  • Is there anything in front of Nginx?
  • Do you run anything loadbalance-ish directly on your server?
  • You said you are running Nginx. Is that all or does Nginx proxy anywhere onwards?
  • Anything particular about your Nginx TLS configuration?
  • Do you have anything of the sort of fail2ban configured?

Based on your description - and assuming it is not a Cloudflare issue - my best guess would be that you have some sort of rate limit or temporary ban (hence the question about fail2ban) configured which occasionally kicks in (too many requests over a period of time?) and blocks Cloudflare. Usually I’d expect that rather to be a 523 or 524 instead of a TLS error, however that might depend on how that ban is implemented. Again, just speculation :slight_smile:

1 Like

Exactly! At some occasions, I was woken up by 15+ notifications from UptimeRobot.com saying that my sites are unavailable (I have a lot of subdomains :sweat_smile:), and before I can properly wake up and start debugging, I get another barrage of notifications saying that everything is fine again. (Just to clarify, during the half an hour of downtime, I can see the downtime myself, it’s not false reporting from another service).

For example, here is screenshot of my Slack from yesterday. All of these got resolved before I even had a chance to do anything.

Ah, sorry for being unclear!

Basically, during today’s half an hour of downtime, I disabled Cloudflare proxy for one subdomain, and within a few minutes, I was able to load the site from my browser, using HTTPS. Other subdomains (still proxied via Cloudflare) still continued throwing HTTP 525 errors for many more minutes after that. So yeah, from my point of view, it looked like TLS was still working.

  • How many IP addresses do you have configured on Cloudflare? Do you proxy to just one machine or more?
    • My server only has one IPv4 and one IPv6 address associated with it, and the traffic is proxied to one machine only.
  • Is there anything in front of Nginx?
    • No, Nginx is the server that proxies all the traffic to individual containers. It is configured to terminate traffic with TLS (is that the correct terminology? Basically, should be all good for HTTPS traffic with subdomain-specific Let’s Encrypt certificates)
  • Do you run anything loadbalance-ish directly on your server?
    • No.
  • You said you are running Nginx. Is that all or does Nginx proxy anywhere onwards?
    • The Nginx server is configured to route traffic based on the Host header to various services, each represented by a single Docker container (managed by Dokku).
  • Anything particular about your Nginx TLS configuration?
    • I… think not, I really did not modify much from the standard Nginx installation that came with Dokku.
      In /etc/nginx/nginx.conf, there are the ssl_protocols TLSv1 TLSv1.1 TLSv1.2; and ssl_prefer_server_ciphers on; lines.
      In /etc/nginx/conf.d/dokku.conf, there are the following lines:
      ssl_session_cache shared:SSL:20m;
      ssl_session_timeout 1d;
      ssl_session_tickets off;
      ssl_dhparam /etc/nginx/dhparam.pem;
      ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:E CDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384;
      And then finally, each subdomain loads its specific config from /home/dokku/<subdomain>/nginx.conf, which contains these lines:
      ssl_certificate /home/dokku/office/tls/server.crt;
      ssl_certificate_key /home/dokku/office/tls/server.key;
      ssl_protocols TLSv1.2 TLSv1.3;
      ssl_prefer_server_ciphers off;
  • Do you have anything of the sort of fail2ban configured?
    • Wow, I definitely did not think of this one, that’s an amazing guess! But I only configured fail2ban to ban on ssh connections, surely such connections could not be coming from Cloudflare network, right? Anyhow, I checked fail2ban log. Today, the ~30 minutes outage started at around 08:53 AM JST – the only ban at around that time happened at 08:41 JST, banning an IP address of 182.254.xxx.xxx, which doesn’t seem to match Cloudflare IP ranges…

Thank you so much for your comment, by the way! I already learned a lot, I did not even think of suspecting fail2ban, although probably in this case it might not be the culprit…

Absolutely :slight_smile:

Do you have both, the IPv6 and the IPv4 address configured on Cloudflare?

Based on what you just explained (just one host) we should be able to rule out that Cloudflare gets occasionally routed to another machine. My best guess would still be some rate limiting, is there maybe some else configured in addition to fail2ban?

1 Like

I’d also open a support ticket and forward a few connection IDs of these 525s. Maybe support can tell more details as to what exactly fails.

I do, a single A record for the IPv4 address, and a single AAAA record for the IPv6 address.

It does seem so plausible, doesn’t it… The server software itself should not be doing anything of that sort, I only know that Hetzner has some sort of DDoS protection system in place (on network level), but I don’t think there are any details available, nor is it customizable in any way as far as I can tell, it should really just protect against massive botnet attacks.

I could try asking Hetzner about it, but I think before that, I’ll wait for another case of downtime, note a few Ray IDs and ask the Cloudflare support about it, as they will surely be able to pinpoint the issue much more, and I can confront Hetzner better prepared if the issue lies with Hetzner. (I did not know I can contact support while on a free plan – that’s not a problem, right?)

Just a guess, temporarily remove the AAAA record. Maybe it is something about your server’s IPv6 configuration. Alternatively you could try to remove the A record and check if the issue appears more often or all the time. You can also try the other way round, maybe it is IPv4 specific and IPv6 actually works fine.

Probably the next best step :slight_smile:

Cloudflare do not support dual stack. Even if you setup A and AAAA record. They can only use only use one of them. Most probably all the request will go from cloudflare to origin using IPV4.

Something to do with https://tools.ietf.org/html/rfc6555