Intermittent 525 SSL

I’m getting an intermittent (approximately 0.01% of requests) 525 (SSL Handshake failed) between Cloudflare and our AWS EC2 Windows Server 2016 IIS, with Let’s Encrypt CA using win-acme running under SYSTEM. We’re ReactJs front and .NET Framework 4.7.2 on the back.

Things I’ve tried

  • I’ve been through The Community Tip and Debug Docs
  • I’ve raised a support ticket with Cloudflare, they told me “Connection Reset By Peer” during handshake.
  • I’ve checked IIS logs, I can see the user interacting successfully but the IIS doesn’t record any failure.
  • IIS Failed Request Tracing is on but not showing anything (it’s not getting as far as IIS).
  • I have SCHANNEL logging switched to verbose: “HKLM:\System\CurrentControlSet\Control\SecurityProviders\SCHANNEL” Value 7 - I only see 36880 (SSL negotiated successfully) status codes in the Event Viewer.
  • 525 occurs irrespective of device, browser, OS, method or endpoint (both POST data and GET of images). All clients are in the UK region.
  • Although most of the failures are against the API, we’ve also seen the Cloudflare error page on our automation.
  • My CloudFlare SSL setup is Full(Strict), 1.2 and 1.3 switched on.
  • We have all the ciphers available to TLS 1.2 installed on the server (SSL 3, TLS 1.0 and 1.1 are switched off in registry on server), with SNI support.
  • Server CPU, Mem, disk I/O and network I/O are all low at the times of the 525.
  • We do not have an Elastic Load Balancer, the server connects straight to the AWS Gateway.
  • No other processes are happening at the time (no patching, cert renewal or release).
  • I’ve been through the community here without success 1,2, 3, 4, 5, 6 etc
  • We have rate limiting switched off.
  • Request load is very low, the most recent 525 (this morning) we were at about 50 requests/min.
  • I have a second domain running unproxied at the same server with the same settings and am running a canary (AWS GET service) to hit an endpoint. No failures as of 2021/06/09

Things I can’t easily do

  • Use the Cloudflare Origin CA certificate - moving over requires a fair amount of infrastructure automation change as we have a large number of multi-level sub domains that will need to be specified individually.
  • Whitelisting Cloudflare IP on AWS by adding Cloudflare IPs to the Security Group that joins the EC2 instance to the gateway. I’ll need a Lambda function to keep the list up-to-date.

Things I can’t do

  • Remove Proxying (and see if it goes away). CF provides our DDoS protection, I am unable to turn it off and retain our security accreditation.
  • Install Wireshark on the server - these happen extremely intermittently, I would be generating huge logs. I also have change management restrictions, so I can’t install whatever I like on the production server.

Is there anything I’m missing? If I get a resolution, I’ll post back here to help others.

Companion Server Fault post

Edited 2021/06/09 with feedback from here and Server Fault

5 Likes

Good analysis :+1:t2: best in months :slight_smile:

Unless Cloudflare has intermittent issues with SSL (which we certainly can’t rule out, but it’s rather unlikely) the most likely explanation will be that your origin will drop connections from time to time.

Now the question is what setting to tweak to fix that. You mentioned whitelisting, so maybe you have not completely unblocked all Cloudflare addresses but that’s rather a guess of course.

Time to tag the boys → @MVP

5 Likes

Hi Sandro, thanks for the praise and feedback. You can probably tell I’ve read quite a lot of your replies! There were very helpful getting this far.

As for whitelisting - I’m not doing any at the moment as our site (previously to having CF Proxy) was open to the public. My thinking was that AWS might be bouncing one of Cloudflare’s IPs from time to time and if I add a white list into the security group, AWS might not do the bounce. I can’t justify that, I’m still reading around that one. It seems like a sensible security addition, although I’m not 100% keen on using Lamda Functions to update the ingress list of IPs.

1 Like

Credit where credit is due. It’s rather rare here on the forum that we get more than “site not working” :smile:

So you don’t have anything firewall-esque in place? Anything else that could tamper with network connections? Rate limiting?

Proxied requests can come from any of those networks (well, technically not any because they are still geographically assigned but at least any address from a subset) and if you happened to block one, that could explain why it happens intermittently.

Unfortunately my AWS experience is limited to finding the check out button on Amazon’s website, so I guess I won’t be of much use when it comes to something AWS specific I am afraid.

3 Likes

This is where my lack of AWS expertise becomes obvious, but is that your actual server or some service in front of it? Could that reject connections?

1 Like

Did you cross-check if these errors maybe occur when there are more requests? You mentioned the load is fine, but I am not after an actual issue on your server, but rather thinking whether something (gateway?) might limit requests when there are too many concurrently.

1 Like

Thanks for you thoughts! Let me pass on a little AWS knowledge…

By default an AWS Virtual Machine (aka EC2 Instance) does not have any access to anything but its internal subnet (just like a LAN). To give it access to anything (AWS databases or the internet) you have to set up “Security Groups”, which are like virtual firewalls. The AWS Gateway is a network adapter that connects to the wider internet (like a router on a LAN). You join your virtual machine to the gateway and then use security groups to control access in and out. Our AWS setup is tediously simple!

A security group has a number of rules that look similar to:

Inbound:
HTTP, TCP, 80, from all addresses (0.0.0.0/0 and ::/0)
HTTPS, TCP, 443, from all addresses (0.0.0.0/0 and ::/0)

It’s very permissive. A whitelist approach would be to replace “from all addresses” with a list of Cloudflare IPs.

Some more things I’ve checked (will add to original post) thanks to your questions:

  • We have rate limiting switched off.
  • Request load is very low, the most recent 525 (this morning) we were at about 50 requests/min.

There must be something going on in AWS. There is extra logging you can add in Virtual Private Cloud (VPC) and I’ve got that on my list to check next - I’ve never needed it before as things tended to just work without it!

You did say you cannot take the proxy out of the equation because of security implication, but could you possibly run your own tests directly against the origin? Would it be reproducible in that case? Can you reproduce it at all or is it only based on reports from your users?

1 Like

That’s an interesting idea.

The domain triggers behaviour in our app, so I could try setting up a test domain that isn’t proxied and run a canary (an automated GET request against a known endpoint that happens every 5 minutes) against it. That’s worth a go. It will be next Tuesday at the earliest as I will need some sign off first.

For anyone following along, I’m currently reading up on AWS VPC Flow Logs, which should give me a better view of what is going on inside the networking AWS.

Many thanks, Sandro, you’ve been a star.

1 Like

The same issue happens with our setup. Small portion of requests getting 525.

Re-deployment did not help, support is not responding.

Small update - I’ve added a test domain (also pointing to the live server) that will run unproxied and am running an AWS Systems Manager Canary. The Canary is an AWS Lambda behind the scenes that is a dummy web browser. I have it mimicking what a user would do.

As this is a intermittent bug, I’m uncertain how long I’m going to have to wait to get any results at all. I’ll keep coming back to update here.

1 Like

It’s worth also checking if the problem goes away by running under ‘Full’ instead of ‘Strict’.

Are IIS certificates being used from a CCS store or from the computer certificate store (My/Personal or WebHosting)? Is the cert a wildcard or a specific hostname? Is the process that renews the certificates running as Administrator, Local System or a custom service account?

The reason I ask is that some schannel failures happen because of failure to access the certificate private key and depending on the situation it might not reach http.sys for logging (IIS etc). About HTTPS, SChannel, TLS, CAPI, SSL Certificates and their keys - Microsoft Tech Community

1 Like

You mentioned TLS 1.2+, I’m assuming you haven’t made any registry changes to try to enable TLS 1.3, as TLS 1.3 doesn’t exist on windows server (to my knowledge).

1 Like

Why would you think this would make a difference? Strict just verifies the certificate, but that’s something that only happens on Cloudflare’s side. If there’s an issue with the certificate on the server-side, you referred to, then that would be a general SSL issue. Unless you are suggesting it might fall back to some other invalid certificate, but then we’d have a 526.

Of course, if there is mentioned issue with accessing the certificate on the server, that would immediately explain the 525, because no proper handshake could be established. In that case @rob10 should be able to reproduce that with the new unproxied setup as well.

1 Like

@christopher.cook thank you for the very thoughtful help! I’ll try and answer to the best of my knowledge. It’s worth reiterating that the problem is intermittent. Only about 0.01% of SSL handshakes fail.

It’s worth also checking if the problem goes away by running under ‘Full’ instead of ‘Strict’.
The problem also happens under Full.

Are IIS certificates being used from a CCS store or from the computer certificate store (My/Personal or WebHosting)?
Certificates My/Personal. Installed automatically using win-acme.

Is the cert a wildcard or a specific hostname?
The certificate is a multi-SAN certificate, not a wildcard one. win-acme cycles through all the bindings on all the sites and builds up a certificate request with all of them. I’ve been through the SAN on the ticket and all the domains are there.

Is the process that renews the certificates running as Administrator, Local System or a custom service account?
It runs as Administrator.

I’m assuming you haven’t made any registry changes to try to enable TLS 1.3, as TLS 1.3 doesn’t exist on windows server (to my knowledge).
You’re absolutely correct, thank you. That’s a bad habit I’ve picked up from our pen testers. I’ll update my post. Only TLS 1.2 is available on the server.

I’ve checked that TLS 1.3 on CF is switched on. That might be causing the issue. A small number of clients might be requesting 1.3.

If CF receives a request from 1.3, does it pass that on to the server or will it switch to 1.2 during the handshake?

Thanks for your help, everyone! Intermittent issues are the worst!

The proxies negotiate this with the origin completely independently from what they themselves received.

1 Like

Ah, that makes sense, thank you.

I’m guessing that clients connecting via Cloudflare using TLS 1.3 don’t really use TLS1.3 to talk to the origin server, because Cloudflare supports all sorts of stuff that normal servers don’t (like QUIC), so that’s probably fine.

I think win-acme normally uses SYSTEM for the scheduled renewals task (I’m the author of Certify The Web, a completely different GUI that does the same thing, so I’m not a win-acme expert). I’m assuming your scheduled task is SYSTEM as well and not modified to be Administrator.

From the linked article I was wondering if you needed to enable CAPI logging?

It’s an interesting problem. I’m assuming you’ve already rebooted the server. Windows Firewall can be surprisingly temperamental.

1 Like

My other theory was perhaps cloudflare are indeed testing updates and 0.01% of the time you happen to get a proxy worker that’s different to the usual one, but I guess support would have known that.

How do you see these errors? Are they shown on cloudflare?

1 Like

I’d assume the current unproxied setup might be best to reproduce that. If you get there intermittent connection failures as well, you’ll have confirmed that there’s somewhere an SSL issue on your server-side. Of course, tracking that might be tricky then if something only occasionally breaks.

2 Likes