I understand your sentiment and I did read that you are not brand-new to the platform - still, the certificate validation is one of the things that actually usually works at Cloudflare.
Are you sure that it’s not your server that might be intermittently presenting an invalid certificate? The custom hostname should actually not really be at play in the context of custom hostnames, but it should rather be a mismatch between the certificate and the configured maindomain.com hostnames.
Also, the 521 would point towards a completely different issue of the server not being reachable at all.
Do you have the possibility to extend your logging to also log which certificate was presented when your servers logs the SSL issue?
I’m quite sure Nginx is not intermittently presenting an invalid certificate. There’s only one vhost and it is configured to always use that certificate. I mean, it would be unheard of that Nginx would just bug out like that.
Please disregard the error 521 thing, that actually turned out to be something else.
I’m not sure what kind of extended logging could be done, if you have suggestions please let me know.
I shall insist on my theory that CF is slowly rolling out a bogus update to their servers. The error rate is already at 5% and keeps going up every day.
Naturally, I cannot rule it out, but I really doubt that this is an issue on Cloudflare’s side. There are intermittent hiccups on Cloudflare’s side, but certificate validation so far never was one of them.
If Nginx is the front-server, I would try to enable some sort of SSL debug logging to see which certificate is served for each request and then use that log whenever you run into a 526. Please refer to the Nginx community for details on what is available here and how to enable it.
Another thing you can try is to simplify your CNAME chain. Considering you use CNAME entries, maindomain.com should actually work just fine for your certificates and the custom hostname should only be relevant for proxy certificate. But maybe there is somewhere a glitch and you can simplify the setup.
As mentioned, I certainly cannot rule out an issue with the certificate validation, but I should be surprised if this really was an issue on Cloudflare’s side, because that usually work and the issue here should not be the custom hostname, but rather a mismatch between server certificate and hostname, but with the CNAMEs in place, said domain should actually be accepted just fine.
Hi, I have new evidence that this is not a problem on my end. I have decided to investigate the problem with Wireshark. Here are the results.
First of all, I forced Nginx to use only TLS 1.2 and the AES256-GCM-SHA384 cipher. This is because TLS 1.3 encrypts the handshake, making the problem harder to diagnose. This simple cipher also excludes the possibility of a ServerKeyExchange message, since we are using RSA key exchange.
The exchange is as follows:
CF sends a ClientHello with SNI indicating the hostname.
The server responds with a ServerHello, Certificate, ServerHelloDone.
Whether the connection is successful or not, the certificate is always the same. I triple checked this.
Most of the time, the connection is successful.
When the connection fails, the server sends a TLSv1.2 Record Layer: Alert (Level: Fatal, Description: Bad Certificate) and drops the connection (FIN, ACK).
Up next, I ran the capture for ~15 minutes. Then I checked, for each instance of the problem, the hostname indicated in the SNI field of the ClientHello. It’s always the case that it’s a custom hostname. Requests to a subdomain of the main domain (which are explicitly covered by the certificate) never fail. Requests to custom hostnames intermittently fail with the bad certificate error.
Stats of the capture:
ClientHello: 2585 packets
Alert (Bad Certificate): 30 packets
But wouldn’t the following suggest, it actually is?
The server seems to drop the connection here and refers to a bad certificate, which is not coming from Cloudflare, but from your server.
As it is to be expected. The whole request is for that particular hostname, the main domain is only used for the CNAME entries (and certificates for this domain will also be accepted).
IMHO, the error would indicate that Nginx has some issues handling the certificate. Just a stab in the dark, but could it be that you need to configure the whole certificate chain and Nginx currently cannot verify that and hence aborts the connection?
My bad. I meant the client. It’s CF that sends the alert and drops the connection, not Nginx.
I do not have client certificate authentication configured.
I have also discovered something new. If you proxy the custom hostname through CF (thus flattening the CNAME chain - such that a DNS lookup returns a simple A record to a CF node) the error goes away. If you do not proxy it (such that the DNS resolves a CNAME record to my main domain), the error happens.
In other words, O2O (orange-to-orange) works perfectly.
This does not lead to a fix, though, because most of our SaaS clients (who bring their own hostname) do not use Cloudflare and have to rely on the CNAME record (in fact this is standard procedure and recommended by the CF SaaS docs).
and certificates for this domain will also be accepted
Yes, that should be the case 100% of the time, but it’s only 99%.
Fair enough, but that seem to be suggested by the 526 anyhow. Did you verify the server actually sent an acceptable certificate? Did you log which certificate was used for that request? If not, we still can only assume it was the right one.
IMHO, this is where the issue will be. I may be wrong of course
And you managed to verify this consistently? I am asking, as this might be a hint for Cloudflare as to what is happening, however I’d still somewhat doubt that to be the issue, as the ultimate IP address will be in either case a Cloudflare proxy, so orange-to-orange should not matter much here.
Fair enough then. Even though I still somewhat doubt that it’s a verification issue (this simply really usually works, contrary to other things), I understand you have performed the necessary steps to verify that it is not a server configuration issue. At this point, only someone with access to Cloudflare’s infrastructure could provide more insight I am afraid.
Just to clarify, you checked this server-side, not just client-side, correct? I am asking, as there was recently a similar validation issue, where the server actually kept sending valid certificates to every client but Cloudflare, which also threw a 526 of course. That was impossible to debug from a non-Cloudflare perspective.
Would you have any idea how to draw their attention to this issue?
Yes, the Wireshark capture I mentioned earlier refers to CF IPs, e.g. 162.158.159.184. The certificate was correctly sent and 162.158.159.184 dropped the connection with alert number 42.
Sorry for asking again, but you verified this for this particular connection, that it really is the right certificate, right? Not just assuming by the apparent configuration for this virtual hostname?