Argo Tunnel Outage

Hi,

We experienced an outage with Argo tunnels yesterday afternoon Pacific/Auckland time. Sydney and Melbourne were unreachable and was wondering if anyone else had issues with Argo tunnels around this period?
I did see on the Argo service that there was degradation on the service and wondering if this was related.
I also noticed our tunnels did not try to connect to region1 infrastructure as region2 infrastructure seemed unreachable.
Or was this an issue with the Auckland Node which I also wonder routes all Argo tunnel traffic through to Australia?

Thanks,

You are presumably talking about https://community.cloudflare.com/t/argo-tunnel-availability-issues/246514, which was a general Argo issue and not Australia-specific.

Also, you might have to update your cloudflared binary.

1 Like

We have numerous customers running Argo Tunnel service. Why was this required update not announced earlier? We had over 30 customers we had to contact to have them manually update their servers. It’s fine that a new cloudflared.exe is needed but this requires notice and planning with customers.

For that you will need to open a support ticket.

Thanks for the reply. How often has this happened as I have production systems running through Argo tunnels and this doesn’t instill a lot of confidence for an approximate 11 hour outage.

Should not be a regular thing.

1 Like

Hello @online4 ,

The incident linked by @sandro was very likely the culprit for what you observed.
This was an unfortunate problem that should not have happened and we apologize for that.

For what it’s worth, we have emailed out to owners of accounts with tunnels running on old enough versions about the need to update since November last year. That was by no means a justification for the problem found yesterday, but it is a good reminder to avoid getting stuck in versions that are more than 1 year old (in this case, versions 2019.11.2 or older need to update, and more recent ones should have not had any problems yesterday, or in cases where connections were lost, then they were re-established after we rolled the fix through our edge).

As to the occurrence of such issues, I can tell you that the past 12 months have been quite peaceful with almost no incidents in Argo Tunnel at all. The product has really caught up with a lot of reliability improvements recently and it’s really worth to use the most recent versions.

2 Likes

Hi,

Thanks for the response. We were using the newer version 2021.2.2 and still lost the connections to our tunnels.
I did want to know if Argo fails over to region1 from 2 if region 2 is down in my part of the world or if that’s just an Enterprise plan based service.

Thanks.

There was a general issue with Argo, I’d assume the fix for that was one of the factors which now makes it necessary to upgrade to a newer version.

Losing connections is an expected reality of persistent connections over the internet. That is why cloudflared establishes 4 connections to 2 different Cloudflare colos, so that the likelihood of all 4 going down being very, very small. As long as 1 is still up, Cloudflare will still route incoming requests to your origins.

What happened yesterday during that incident was that connections could not be (re-)established due to an embedded certificate in cloudflared that had expired. That was the unfortunate problem that was lying in there for several years.
We could roll a fix across the edge but only for sufficiently recent versions (2019.11.3 onwards).

During that period, cloudflareds would still receive traffic normally, they just would not be able to re-establish their connections if lost. Most cloudflareds had at least 1 connection still up and did not notice the incident therefore.

I hope this helps understand what happened. Having said that, we’ve obviously learnt a hard lesson and will roll out new checks across our automations to check for embedded certificates so that we can upgrade them well in time so that releases have them always with expiration far in the future.

2 Likes

I understand what the OP specifically referred to here was whether Cloudflare chooses different PoPs in case the two default ones cannot be reached and from your answer I’d deduce that would not be the case.

That is correct. Cloudflared will try various PoPs.

The way this is setup is simple: the whole world is split into 2 anycast regions, and cloudflared establishes 2 connections to each of those anycast regions. Each anycast region resolves to various IPs and cloudflared cycles through them when non-recoverable errors happen when connecting.

3 Likes

Fair enough, that should address the OP’s question then. cloudflared does connect to two PoPs, but in case they are both not reachable, it won’t try others.

I would not render that conclusion @sandro

Cloudflared attempts to establish 4 connections. For each 1, it can connect to half of Cloudflare PoPs (since they are split into 2 anycast regions). For each of those connections, it will try the closest PoP (due to the nature of anycast). Naturally, if Cloudflare takes a PoP offline, anycast will route that connection to another PoP.

Does this help?

1 Like

How come? Using anycast it is limited to what’s in the vicinity and if Argo does not work in these two datacentres, for whatever reason, cloudflared can’t and won’t connect elsewhere.

In this case it was MEL and SYD and if there are issues in these two PoPs, it won’t go to e.g. SIN, unless the routing is changed but that was not the case yesterday AFAIK.

Now should the routing generally be changed, then yes, of course, it will connect to other PoPs, but that’s a routing question. The OP was referring to connectivity issues with Argo.

That’s why I said that “if Cloudflare takes a PoP offline, anycast will route that connection to another PoP.”

I.e. if Cloudflare takes a PoP offline, it will stop advertising the anycast IPs. In that case, cloudflareds will connect to other, farther away PoPs.

1 Like

thanks for the explanation, that explains tunnel lists Connections output as

cloudflared tunnel list | awk '{print $4,$5,$6}'
CONNECTIONS  
2xMCI, 2xYYZ 
2xIAD, 2xMIA 

which means 2xMCO one anycast group and 2xYYZ the other anycast group ?

I have seen though

2xXXX, 1xXXX, 1xXXX

does that mean there’s a 3rd location ? I am on CF Enterprise plan.

which means 2xMCO one anycast group and 2xYYZ the other anycast group ?

Correct.

does that mean there’s a 3rd location ?

No, there’s only 2 anycast groups, that’s for sure.
Note that anycast does not guarantee 100% for sure that you will always be connecting to the same nearest PoP. It is very likely, but many factors in the network can dictate that for some reason you connect to another PoP (still in that anycast region). That may be what happened when you observed connections to 3 PoPs (1 in an anycast region, 2 in another anycast region).

2 Likes

Thanks for the explanation :slight_smile:

1 Like

Having read the discussion it has helped me understand more how Argo works and also the reasons behind the outage (as I was asked for a specific cause), so thank you for the explaination.

3 Likes