I’m getting an intermittent (approximately 0.01% of requests) 525 (SSL Handshake failed) between Cloudflare and our AWS EC2 Windows Server 2016 IIS, with Let’s Encrypt CA using win-acme running under SYSTEM. We’re ReactJs front and .NET Framework 4.7.2 on the back.
- I’ve been through The Community Tip and Debug Docs
- I’ve raised a support ticket with Cloudflare, they told me “Connection Reset By Peer” during handshake.
- I’ve checked IIS logs, I can see the user interacting successfully but the IIS doesn’t record any failure.
- IIS Failed Request Tracing is on but not showing anything (it’s not getting as far as IIS).
- I have SCHANNEL logging switched to verbose: “HKLM:\System\CurrentControlSet\Control\SecurityProviders\SCHANNEL” Value 7 - I only see 36880 (SSL negotiated successfully) status codes in the Event Viewer.
- 525 occurs irrespective of device, browser, OS, method or endpoint (both POST data and GET of images). All clients are in the UK region.
- Although most of the failures are against the API, we’ve also seen the Cloudflare error page on our automation.
- My CloudFlare SSL setup is Full(Strict), 1.2 and 1.3 switched on.
- We have all the ciphers available to TLS 1.2 installed on the server (SSL 3, TLS 1.0 and 1.1 are switched off in registry on server), with SNI support.
- Server CPU, Mem, disk I/O and network I/O are all low at the times of the 525.
- We do not have an Elastic Load Balancer, the server connects straight to the AWS Gateway.
- No other processes are happening at the time (no patching, cert renewal or release).
- I’ve been through the community here without success 1,2, 3, 4, 5, 6 etc
- We have rate limiting switched off.
- Request load is very low, the most recent 525 (this morning) we were at about 50 requests/min.
- I have a second domain running unproxied at the same server with the same settings and am running a canary (AWS GET service) to hit an endpoint. No failures as of 2021/06/09
- Use the Cloudflare Origin CA certificate - moving over requires a fair amount of infrastructure automation change as we have a large number of multi-level sub domains that will need to be specified individually.
- Whitelisting Cloudflare IP on AWS by adding Cloudflare IPs to the Security Group that joins the EC2 instance to the gateway. I’ll need a Lambda function to keep the list up-to-date.
- Remove Proxying (and see if it goes away). CF provides our DDoS protection, I am unable to turn it off and retain our security accreditation.
- Install Wireshark on the server - these happen extremely intermittently, I would be generating huge logs. I also have change management restrictions, so I can’t install whatever I like on the production server.
Is there anything I’m missing? If I get a resolution, I’ll post back here to help others.
Edited 2021/06/09 with feedback from here and Server Fault