Argo Tunnel as Kubernetes sidecar has multi-minute downtime during deploys

Hi, we are using Argo as a sidecar in our Kubernetes cluster.

We are running a deployment of size 2. The Argo tunnel is running with the following arguments

cloudflared tunnel \
  --url=http://127.0.0.1:8080 \
  --hostname=${REDACTED} \
  --lb-pool=${REDACTED} \
  --origincert=/etc/cloudflared/cert.pem \
  --no-autoupdate

During deploys (via updating Kubernetes deployment), we notice heavily degraded performance for multiple minutes. Below is a period of time where two deploys occurred in quick succession.

Note that from Kubernetes perspective, our deploys take 30 - 40 seconds which includes new pods coming up and passing their health checks.

I expect that updating our backend pods would not be so disruptive and would certainly not last for multiple minutes after the deploy completes.

I am wondering if this is related to something I noticed in CloudFlare’s dashboard. When I examine the load balancer referenced above, I see a total of 8 origins all of rank 1, each receiving 13% of traffic. My understanding is that I should only see 2 origins, one for each currently running pod.

Thanks for your time!

New findings:

  • Each pod counts as an origin
  • Origins continue to count against your usage quota for multiple minutes after they go live (based on the graph above, we believe this value to be 10 minutes)
  • An origin that cannot come online because of usage quotas does not crash or otherwise fail Kubernetes health checks (aka. prevent a rolling deploy from proceeding)

In our case, we were at our origin quota.

What we believe happened is:

  1. 2 Kubernetes pods running. 0 origins available
  2. Rolling deploy starts. New pod comes online and does not become active since no origins are available. Kubernetes does not realize that this pod is in an unhealthy state
  3. Rolling deploy completes. None of the currently alive pods are active. Old origins are noticed to have died and timeout starts. No traffic is being served at this time
  4. 10 minutes later, the old origins time out. 2 origins available. The existing pods become active. Traffic starts being served

The Solution

We increased our origin quota to double what we require on an ongoing basis.

Since then, our deploys have been zero-downtime and we can’t see any degredation on our dashboard during deploy windows.

Hi @sam.myers, I’m glad you found out a solution. Please checkout Argo Tunnel - "Named Tunnel" beta, we have a new model that keeps your load balancer origins persistent across deployments. We are planning to launch load balancer support with this new model late September or early October.

3 Likes