Argo Tunnel is a single point of failure

It appears that Argo Tunnel is a single point of failure. When Cloudflare was having issues today, we were unable to access several of our servers. Fortunately, most of them were development servers.

I was considering using Argo Tunnel for our production servers, but will have to rethink this. If our production systems were on Argo Tunnel, they would have been inaccessible for several hours.

I’ve been using AT on our dev systems for several months and this is not the first time I’ve noticed problems (but this is the longest downtime I’ve seen). I tried using the tunnel for SSH connections, but it continues to drop the connection (despite having TCP keep-alive enabled on my Mac Iterm client). I can keep a direct SSH connection open for days or weeks, but I’m lucky if AT keeps SSH open for more than an hour or two. Argo Tunnel may be fine for web connections that are constantly renewed, but for us it’s unusable for SSH.

Is Cloudflare doing anything to address the SPOF problem with Argo Tunnel?

With my experience over the past few months, I’m not confident in using it on our production websites. Even with production servers in multiple geographic locations, had I been using Argo Tunnel to connect, all of them would have been down. I have developers working in multiple locations and all of them had problems accessing our systems on Argo Tunnel.

Not sure it is designed for ssh but we have been using it in prod for some sites for more than 6 months. This is the 3rd significant outage since then. Most have been related to API this or API that. Given the per GB pricing, I definitely expect better than a SPOF API service. Hopefully this outage results in some significant rearchitecting.

1 Like

Don’t know what apps or systems are you using, in my case I use it for payment solution. The nodes have failover queues, that are only triggered in situations like these. Believe the issue with Argo was API handshake.

True:

My logs:

info {"connectionID":2,"level":"info","msg":"Muxer shutdown","time":"2020-04-15T16:28:18Z"}
info {"connectionID":2,"level":"info","msg":"Retrying in 1s seconds","time":"2020-04-15T16:28:18Z"}
info {"connectionID":2,"level":"info","msg":"Connected to AMS","time":"2020-04-15T16:28:19Z"}

All our tunnels are still down. It has been 5 hours for us. It seems like bad design for API access being critical to keeping tunnels up given no underlying network interruptions. Obviously API access would be critical for tunnel setup.

1 Like

All my nodes are up now:

info {"connectionID":0,"level":"info","msg":"Route propagating, it may take up to 1 minute for your new route to become functional","time":"2020-04-15T21:34:21Z"}
info {"connectionID":1,"level":"info","msg":"Connected to FRA","time":"2020-04-15T21:34:21Z"}
info {"connectionID":2,"level":"info","msg":"Connected to AMS","time":"2020-04-15T21:34:22Z"}
info {"connectionID":3,"level":"info","msg":"Connected to FRA","time":"2020-04-15T21:34:23Z"}

Mine have been endlessly retrying to connect with no success sadly

can you force a cloudflared restart?

I’ve restarted cloudflared, I’ve restarted the hosts, nothing. One of them “connects” but the DNS record never gets created, and the other is stuck retrying connecting to the API.

Same for me. Restarted everything. Also updated the daemons. Cloudflare dashboard shows all my tunnels have been up for 2 days but none of my daemons connect. They are timing out to ORD.

Possibility regions are coming up online slowly.

Finally 3/4 of my tunnels are back up… Hopefully the last one will come back up soon.

100% of my tunnels are back now. 05:42 TTR

1 Like

According to my monitoring, my tunnels were down from 12:31-18:12 EST

have you check machine logs? What are they showing?

Pretty much our exact same outage too.

Looks like our tunnels are back now. You’re right, Cloudflare needs to redo their architecture to prevent this kind of failure from happening again. Argo Tunnel appears to retry if there are issues connecting to a particular POP, but it looks like the tunnel authentication is a SPOF. If doesn’t do much good to retry if all the tunnel endpoints are dependent on a single authentication source that can fail.

2 Likes

Just to inform everyone, here is the post-mortem.

2 Likes

My entire app is based on Argo Tunnels, so yesterday I was 100% offline for my users. I really hope this is one of the things that happens every 10 years, but reading this doesn’t help.

Until CF has a clue about how to move forward towards a fault-tolerant Argo Tunnels, maybe everybody should have a backup plan. I was thinking on using another tunnel service like https://ngrok.com/. It works great, but it’s super slow for production, maybe they have a faster plan or do you know any other reliable tunnel service we can use as a backup?

I should probably also stop using Argo Access and any other services hard to replicate in a backup solution, maybe the key is moving most of the things at the application level and just use Argo as a tunnel. If it’s up, nice, let’s have the extra security that CF offers, otherwise with a backup plan, at least you’re tunneled but additional security checks must be implemented out of CF.

WDYT?

Guys, what’s your backup plan?

Luca

From what I gather from the post-mortem, this was mostly due to human error and unplugging the wrong cables in a patch panel. It was a long time ago, but I once made a similar mistake and knocked an American Express server offline for over an hour. I was a new contractor and they sent me down to the server room to a fix a problem. I was unfamiliar with how everything was configured and accidentally disconnected the wrong system.

The human element can never be totally eliminated and complex systems can fail in unexpected ways, but hopefully CF will redo things to reduce this possibility in the future.