Proxied Round Robin choosing offline servers

What is the name of the domain?

What is the issue you’re encountering

Proxied Round Robin choosing offline servers

What steps have you taken to resolve the issue?

Tried turning proxy on and off

What feature, service or problem is this related to?

DNS records

What are the steps to reproduce the issue?

You have a subdomain with multiple A records in Round Robin. Say one record is pointing to an online server, one to an offline one.
Every client: browsers, nginx proxy, etc. is checking the potential hosts and choosing one which is online.
Cloudflare on the other hand selects hosts totally randomly. It doesn’t make absolutely any check before selecting the host, so if one if the hosts are down then it’ll randomly fail.
This is a really weird behaviour that only Cloudflare is doing. Literally no other implementation would select an offline host with Round Robin DNS.

Are your A records all proxied?

Note that only host states that would result in specific Cloudflare error codes will trigger a retry…

Yes, they are. I’ll try to make a reproducible, live scenario.

Also, this Zero Downtime Failover in the docs is really not clear. I’ve looked everywhere, it’s totally missing from the Pricing, totally missing from the Dashboard.

The only reference I found it was here, but it’s really not saying if this is something on every account or something on paid plans or do I need to turn this on?

OK, I made a reproducible test case.

Records on hyperknot.com

Servers:

  • test-us - 5.161.84.115 - green
  • test-eu - 167.235.77.115 - blue
  • test-sg - 5.223.46.55 - red

Each server returns an 1x1 pixel PNG image with color blue/red/green.

Connecting to the direct works perfectly! If one server is offline then the browser connects to the nearest online server. Once it comes back, it detects the closest one after a while.

Meanwhile the CF implementation totally breaks. In my case it’s obsessed with only connecting to the US server. If the US server is down, everything is down in my case.

Here is a little script for checking this:

<!DOCTYPE html>
<html>
<head>
    <style>
        body, html {
            margin: 0;
            padding: 0;
        }
        .grid-container {
            display: flex;
            flex-wrap: wrap;
            width: 200px;
            height: 200px;
        }
        img {
            width: 20px;
            height: 20px;
            display: block;
        }
    </style>
</head>
<body>
    <div class="grid-container" id="gridContainer"></div>

    <script>
        const container = document.getElementById('gridContainer');

        for (let i = 0; i < 100; i++) {
            const img = document.createElement('img');
            img.src = `http://rr-direct.hyperknot.com/${Math.floor(Math.random() * 1000000)}/${Math.floor(Math.random() * 1000000)}`;
            container.appendChild(img);
        }
    </script>
</body>
</html>

Currently I’m leaving the US server down for the moment.

How is the US server set to be “down”? I’m getting a 520 error which, assuming I’m hitting your US server, means it is returning an invalid response to Cloudflare. As in the link I gave, that will not result in a retry. Only 521, 522, 523, 525 and 526 conditions will do that.

[add]
Seems the 520 is coming as a result of connection to the “up” servers (curl is complaining about an empty response) and the US is not responding on port 443 so should appear as “down”… needs more digging…

https://rr-direct.hyperknot.com doesn’t have a valid SSL certificate (it is for example.com). In Cloudflare you likely have “Always HTTPS enabled”, so your test to http://rr-cf.hyperknot.com gets redirected to HTTPS which is what I see…

curl -i http://rr-cf.hyperknot.com/
HTTP/1.1 301 Moved Permanently
Date: Fri, 25 Oct 2024 06:34:35 GMT
Content-Type: text/html
Content-Length: 167
Connection: keep-alive
Cache-Control: max-age=3600
Expires: Fri, 25 Oct 2024 07:34:35 GMT
Location: https://rr-cf.hyperknot.com/
Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v4?s=LIRBQG4V6s53H1CcDLrlnyNwJfff20lQIk9OL7svv7RVkknnirZHn2pkrZf%2B1FXghiXdMyPgmBEhL7VrTV5MaxUqGLuX0lvrnTV9%2FX5YYItLeYJbGJtCGhH%2FzxYZ%2Fb72doUzOgOs96NTojQT3FZk0O%2FF"}],"group":"cf-nel","max_age":604800}
NEL: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
Server: cloudflare
CF-RAY: 8d802e82292263ef-LHR
alt-svc: h3=":443"; ma=86400
server-timing: cfL4;desc="?proto=TCP&rtt=0&sent=0&recv=0&lost=0&retrans=0&sent_bytes=0&recv_bytes=0&delivery_rate=0&cwnd=0&unsent_bytes=0&cid=0000000000000000&ts=0&x=0"
Cf-Team: 2363fb6554000063efdae75400000001

<html>
<head><title>301 Moved Permanently</title></head>
<body>
<center><h1>301 Moved Permanently</h1></center>
<hr><center>cloudflare</center>
</body>
</html>

So your for 2 “up” servers, Cloudflare sees them as “down” (as the certificate error would be error 526), leaving the US server which fails to respond. So the round robin is working as it should, it’s just that all your servers are “down” as far as Cloudflare is concerned.

Likely if you turn off “Always use HTTPS” then the HTTP only test should work as you expect (ensure your SSL/TLS mode is “Full (strict)”).

1 Like

I haven’t configured HTTPS yet, as it’s a pain in Round Robin, but I will. Yes, testing is currently on HTTP for the direct and HTTPS for the cf, using the Flexible settings. I don’t think it should matter at all.

By down, I mean service nginx stop. This is to simulate error with the server, network, datacenter, etc.

OK, I fixed everything with HTTPS, here are the configs (+ certbot creates the duplicate for 443)

server {
    server_name rr-direct.hyperknot.com rr-cf.hyperknot.com;

    access_log /data/access.log;
    error_log /data/error.log;

    location / {
        root /data;
        rewrite ^ /color.png break;
    }

    location /server {
        alias /etc/hostname;
        default_type text/plain;
    }

    listen 80;
}

Behaviour when everything is online

curl https://rr-direct.hyperknot.com/server

Returns the geographically closest server as it should.

curl https://rr-cf.hyperknot.com/server

Returns a server which is pinned to each client’s IP. So you randomly get a server US/EU/SG for your client. If I go from wifi to mobile phone it changes from US to SG for example.

Behaviour when one server is offline

curl https://rr-direct.hyperknot.com/server

Detects the offline server and Returnr the geographically closest online server, as it should.

curl https://rr-cf.hyperknot.com/server

If your client was pinned to the offline server, you get error code: 521, for everyone else it works.


So I can conclude that Cloudflare Round Robin handling:

  • Does not make Zero Downtime Failover, it keeps requesting the offline server
  • Does not select the geographically closest server (this one wasn’t promised, just noting it).

Please fix the Zero Downtime Failover part.

We have a 3 server cluster using round-robin DNS with Cloudflare and it works correctly. While we normally remove the DNS record during maintenance, I’ve just tried without doing that and we see no requests to the server that’s down in our tests.

Testing your rr-cf link from various sites and countries, both curl and browser, I’m now only seeing replies from test-sg and test-eu, no errors, so it seems to be working ok for me.

It’s definitely doing the following for me on a few IPs I’m testing from. Of course it works 2/3 of the case, but not all.

curl https://rr-cf.hyperknot.com/server
error code: 521

I’m now seeing more random behaviour so something has changed again, this time switching between connecting to Cloudflare via IPv4 or IPv6.

As our round robin setup is working ok and there’s no other reports coming in, whatever the issue it seems to be limited to you.

Just to confirm, you don’t have any AAAA records for those DNS records? Are you specifically allowing/blocking any Cloudflare IP ranges?

Others might be able to have a poke and report what they see or have other ideas.

1 Like

Only A records like on the screenshot, nothing else.

I wrote an article about the tests and posted this on HN:

1 Like

Interesting.

Are you using a free or paid plan? The introduction post for Zero-Downtime Failover mentioned that it was available on paid plans, though I have no idea if that information is still current.

1 Like

Free plan. From what I understand this has been deprecated and I guess it should be on all plans now, but it might have been forgotten?

jgrahamc will answer in the HN thread.

Our round-robin DNS runs on several Ent zones and works as it should, so looks like maybe someone broke something for free plans.

If I get time today I’ll connect the same 3 servers through a free zone in the same and a different account and see what happens.

1 Like

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.