Persisting connection drops between Madrid (MAD) and Amsterdam (AMS)

What is the name of the domain?

vandenboom.es

What is the error number?

522

What is the error message?

Origin Connection Time-out

What is the issue you’re encountering

1% of our Spanish traffic can’t reach the origin server. This only happens with Spanish traffic, not from other countries. There the number of 522s is negligible.

What steps have you taken to resolve the issue?

  • Checked website performance (plenty of resources)
  • Checked ufw firewall rules, all Cloudflare IPs are allowlisted
  • Checked ufw logs for blocked packages, no Cloudflare IPs
  • Ran MTR between ISP and Cloudflare, between Cloudflare and Origin Server. Also direct MTRs from ISP to Origin Server. We see some package loss for one of the routers (be2325.ccr32.bio02.atlas.cogentco.com)

What are the steps to reproduce the issue?

Opening our website https://www.vandenboom.es (hosted in Amsterdam) from a Spanish client

Screenshot of the error

We’re experiencing the same issue. A few URLs get stuck and fail to reach the origin server. After testing with VPNs, we confirmed this problem only occurs with the Madrid node, while all other nodes we tested function properly. Changing the origin to IPv6 resolves the issue.

That’s very helpful. How can I change the origin to IPv6. Is this an easy fix?

Experiencing the same issue, the connections from our hosts to some of MAD IPs are not passing too (ICMP requests, from one host it can access certain IPs from other not and vice-versa), not resolved yet.

Hi Eveline,

In our case, it was very easy because our origin supports both IPv4 and IPv6. We simply changed the proxied A record to an AAAA record with our IPv6 address.

Our hypothesis is that the Madrid node uses a different path for IPv6, which works around the underlying problem.

hope this helps,

Eveline,

this is how our record looks like now:

Thanks! It really seems to help for Madrid traffic. I replaced the A for an AAAA. Did you keep the IPv4 record?

Because now it looks that I’ve replaced one problem for another…

Hi Eveline,

I removed the A records and left only the AAAA record. This hack is keeping our affected sites running fine since then. I probably have 15 sites impacted.

What problem are you experiecing because of that?

It seems to work fine for 2 of our servers. There I have implemented your fix. With a third one (see the graph) I have reverted to the old A records, because the IPv6 would cause even more 522s. Not from Spain specifically, but from a list of different countries.

Oh, I misinterpreted the graph (missed the filter)—my apologies.

I’ll keep you updated if I notice any improvements. For now, I’ve disabled Cloudflare on many sites and have a few others running with IPv6. This issue is causing significant trouble for your Spanish customers.

Eveline, how did you run the MTR between Cloudflare and Origin? I will do the same and open a support request.

Using Linux on the origin? :thinking:

Helpful article:

Not sure if it’s accurate. But I ran a couple of ping’s from my desktop to my website and noted the IPs it resolved to. It alternated between 2 IPs.
Then I pinged these IPs from the origin. My hosting company was looking with me, at first we didn’t see packet loss. But I forgot to end the MTR and a few hours later I found there was a consistent 1% package drop over all hops up until the router mentioned in my opening post.

Yes, that would give a traceroute between origin (source) and destination (cloudflare).

I needed the reverse to see if the Madrid node is having packet loss when going to my origin.

Hi all.

I can see the origin has already been moved to IPv6 only. Can you please DM me the previous IPv4 address and the results you captured with the MTR?

Thanks

@bwalters if it helps, this is what I noticed:

Some websites experience issues loading static files from the origin. On the browser side, these resources appear stalled in Cloudflare’s cache and eventually time out after 19.2 seconds, returning a 522 error.

Observations:

• Once a static file stalls, it remains stalled until the cache is cleared.

• Clearing the cache temporarily resolves the issue, but after a few hours, another random static file starts returning a 522 error.

• Adding a query string to the static file (modifying the URL slightly) makes it work again, and the origin becomes reachable.

• The issue occurs only when the Cloudflare node is Madrid and the origins are in London or Ireland. Origins in Paris work fine.

• No issues occur with any other Cloudflare node.

• The problem persists on both Pro and Free accounts.

• Extensive testing using VPNs and checking the node with RayID confirms this behavior in our case.

• Switching from IPv4 to IPv6 resolved the issue permanently in our case. But I have tens of potentially affected sites where ipv6 is not an option.

Would appreciate any insights or guidance on resolving this issue.

Our problem is similar to dani1’s. But we have our origin in Amsterdam.
I have 2 out of 3 servers moved to IPv6. We tried for the third server but it made the problems even worse: we got 522s from around Europe instead of only from Spain (Madrid node).

IPv6 vs. IPv4 should not be the issue. It’s just a way dani1 found to be effective to circumvent the actual problem. There is a bad hop in the connection and I’m baffled how Cloudflare monitoring doesn’t notice this and reroutes traffic, even after weeks.

I don’t see where I can send a DM. There is no link in your profile?

Hi @dani1, thanks.

Can you please DM me some of the IPv4 origin IPs and site names?

Hi @eveline, thanks for the response.

Just to clarify here, we need the IPv4 addresses of the origins so we can trace them to diagnose where the reachability issues are, and ensure the egress path is shifted if necessary.

The IPv4 and IPv6 prefixes have different routing policies applied.

The zones mentioned currently only resolve to an IPv6 origin when I check (because of the workaround applied) so I kindly ask you to DM me the old IPv4 address and any of the MTRs you took.

We have extensive metrics and automated systems to detect, alert, and automatically route around general connectivity & origin reachability issues (where possible).

There is no on-going loss on the path that is significant enough to register in our metrics.

This is not to say there isn’t an issue, however I suspect it’s isolated to traffic taking a certain intermediately hop on the return path, which may only affect a very limited number of total flows on it.

You should be able to message me from my community profile https://community.cloudflare.com/u/bwalters/summary