What feature, service or problem is this related to?
DNS records
What are the steps to reproduce the issue?
You have a subdomain with multiple A records in Round Robin. Say one record is pointing to an online server, one to an offline one.
Every client: browsers, nginx proxy, etc. is checking the potential hosts and choosing one which is online.
Cloudflare on the other hand selects hosts totally randomly. It doesn’t make absolutely any check before selecting the host, so if one if the hosts are down then it’ll randomly fail.
This is a really weird behaviour that only Cloudflare is doing. Literally no other implementation would select an offline host with Round Robin DNS.
Yes, they are. I’ll try to make a reproducible, live scenario.
Also, this Zero Downtime Failover in the docs is really not clear. I’ve looked everywhere, it’s totally missing from the Pricing, totally missing from the Dashboard.
The only reference I found it was here, but it’s really not saying if this is something on every account or something on paid plans or do I need to turn this on?
Each server returns an 1x1 pixel PNG image with color blue/red/green.
Connecting to the direct works perfectly! If one server is offline then the browser connects to the nearest online server. Once it comes back, it detects the closest one after a while.
Meanwhile the CF implementation totally breaks. In my case it’s obsessed with only connecting to the US server. If the US server is down, everything is down in my case.
How is the US server set to be “down”? I’m getting a 520 error which, assuming I’m hitting your US server, means it is returning an invalid response to Cloudflare. As in the link I gave, that will not result in a retry. Only 521, 522, 523, 525 and 526 conditions will do that.
[add]
Seems the 520 is coming as a result of connection to the “up” servers (curl is complaining about an empty response) and the US is not responding on port 443 so should appear as “down”… needs more digging…
https://rr-direct.hyperknot.com doesn’t have a valid SSL certificate (it is for example.com). In Cloudflare you likely have “Always HTTPS enabled”, so your test to http://rr-cf.hyperknot.com gets redirected to HTTPS which is what I see…
So your for 2 “up” servers, Cloudflare sees them as “down” (as the certificate error would be error 526), leaving the US server which fails to respond. So the round robin is working as it should, it’s just that all your servers are “down” as far as Cloudflare is concerned.
Likely if you turn off “Always use HTTPS” then the HTTP only test should work as you expect (ensure your SSL/TLS mode is “Full (strict)”).
I haven’t configured HTTPS yet, as it’s a pain in Round Robin, but I will. Yes, testing is currently on HTTP for the direct and HTTPS for the cf, using the Flexible settings. I don’t think it should matter at all.
By down, I mean service nginx stop. This is to simulate error with the server, network, datacenter, etc.
Returns the geographically closest server as it should.
curl https://rr-cf.hyperknot.com/server
Returns a server which is pinned to each client’s IP. So you randomly get a server US/EU/SG for your client. If I go from wifi to mobile phone it changes from US to SG for example.
Behaviour when one server is offline
curl https://rr-direct.hyperknot.com/server
Detects the offline server and Returnr the geographically closest online server, as it should.
curl https://rr-cf.hyperknot.com/server
If your client was pinned to the offline server, you get error code: 521, for everyone else it works.
So I can conclude that Cloudflare Round Robin handling:
Does not make Zero Downtime Failover, it keeps requesting the offline server
Does not select the geographically closest server (this one wasn’t promised, just noting it).
We have a 3 server cluster using round-robin DNS with Cloudflare and it works correctly. While we normally remove the DNS record during maintenance, I’ve just tried without doing that and we see no requests to the server that’s down in our tests.
Testing your rr-cf link from various sites and countries, both curl and browser, I’m now only seeing replies from test-sg and test-eu, no errors, so it seems to be working ok for me.
Are you using a free or paid plan? The introduction post for Zero-Downtime Failover mentioned that it was available on paid plans, though I have no idea if that information is still current.