Apparent Memory Leak In Argo Tunnel since 2021.1.3

I have observed that the cloudflared process memory utilisation has not been stable since cloudflared upgraded itself to 2021.1.3 on 2021-01-15 16:30 UTC. We were previously using 2020.11.11. The attached screen capture shows the memory utilisation metric. Prior to 2021.1.3, memory utilisation was stable, but after 2021.1.3, it grows over a period of 6 days until it saturates and causes performance issues that need to be addressed by restarting the task that contains the cloudflared process.

This may be a different issue, but errors like the following are happening extremely frequently now, but used not to be so frequent.

| 1:40AM ERR Serve tunnel error error="Application error: 3004 timeout waiting for initial response headers" connIndex=3                                  |
| 1:40AM INF Retrying connection in 16s seconds error="Application error: 3004 timeout waiting for initial response headers" connIndex=3                          |
| 1:40AM INF mux - metrics: Stopping mux metrics updater                                                                                                                                     |
| 1:40AM ERR Serve tunnel error error="Application error: 3004 timeout waiting for initial response headers" connIndex=0                                  |
| 1:40AM INF Retrying connection in 2s seconds error="Application error: 3004 timeout waiting for initial response headers" connIndex=0                           |
| 1:40AM ERR supervisor: Tunnel disconnected error="Application error: 3004 timeout waiting for initial response headers"                                          |
| 1:40AM INF mux - metrics: Stopping mux metrics updater                                                                                                                                     |
| 1:40AM ERR Serve tunnel error error="tls: use of closed connection" connIndex=2                                                                         |
| 1:40AM INF Retrying connection in 1s seconds error="tls: use of closed connection" connIndex=2                                                                  |
| 1:40AM ERR supervisor: Tunnel disconnected error="already connected to this server, trying another address"                                                      |
| 1:40AM INF mux - metrics: Stopping mux metrics updater                                                                                                                                     |
| 1:42AM ERR Serve tunnel error error="Application error: 3004 timeout waiting for initial response headers" connIndex=2                                  |
| 1:42AM INF Retrying connection in 2s seconds error="Application error: 3004 timeout waiting for initial response headers" connIndex=2                           |
| 1:43AM INF mux - metrics: Stopping mux metrics updater                                                                                                                                     |
| 1:43AM INF Retrying connection in 1s seconds error="DialContext error: dial tcp 198.41.200.193:7844: i/o timeout" connIndex=0                                   |
| 1:43AM INF Retrying connection in 1s seconds error="DialContext error: dial tcp 198.41.192.47:7844: i/o timeout" connIndex=1                                    |
| 1:43AM INF Retrying connection in 1s seconds error="DialContext error: dial tcp 198.41.192.107:7844: i/o timeout" connIndex=3                                   |
| 1:44AM INF Retrying connection in 4s seconds error="DialContext error: dial tcp 198.41.200.13:7844: i/o timeout" connIndex=2                                    |
| 1:44AM INF Retrying connection in 2s seconds error="DialContext error: dial tcp 198.41.200.193:7844: i/o timeout" connIndex=0                                   |
| 1:44AM INF Retrying connection in 2s seconds error="DialContext error: dial tcp 198.41.192.47:7844: i/o timeout" connIndex=1                                    |
| 1:44AM ERR Connection handshake with edge server failed error="Handshake error: 1000 handshake timeout" connIndex=2                                     |
| 1:44AM INF Retrying connection in 8s seconds error="Handshake error: 1000 handshake timeout" connIndex=2                                                        |
| 1:44AM ERR Connection handshake with edge server failed error="Handshake error: 1000 handshake timeout" connIndex=0                                     |
| 1:44AM INF Retrying connection in 4s seconds error="Handshake error: 1000 handshake timeout" connIndex=0                                                        |
| 1:44AM ERR Connection handshake with edge server failed error="Handshake error: 1000 handshake timeout" connIndex=1                                     |
| 1:44AM INF Retrying connection in 4s seconds error="Handshake error: 1000 handshake timeout" connIndex=1                                                        |
| 1:45AM INF Retrying connection in 8s seconds error="TLS handshake with edge error: read tcp 172.17.10.146:43334->198.41.192.47:7844: i/o timeout" connIndex=1   |
| 1:46AM INF mux - read: EOF: connection closed unexpectedly                                                                                                                                 |
| 1:46AM INF Retrying connection in 16s seconds error="DialContext error: dial tcp 198.41.200.13:7844: i/o timeout" connIndex=2                                   |
| 1:46AM ERR mux - write: error writing window update: tls: use of closed connection: streamID: 1                                                                                    |
| 1:46AM ERR writing finish: EOF                                                                                                                                                     |
| 1:46AM ERR writing finish: EOF                                                                                                                                                     |
| 1:46AM ERR mux - write: error writing data: write tcp 172.17.10.146:56618->198.41.192.107:7844: write: connection reset by peer: streamID: 1                                       |
| 1:46AM INF mux - metrics: Stopping mux metrics updater                                                                                                                                     |
| 1:46AM ERR supervisor: Tunnel disconnected error="DialContext error: dial tcp 198.41.200.13:7844: i/o timeout"                                                   |
| 1:46AM ERR Serve tunnel error error="Application error: 3004 timeout waiting for initial response headers" connIndex=3                                  |
| 1:46AM INF Retrying connection in 2s seconds error="Application error: 3004 timeout waiting for initial response headers" connIndex=3                           |
| 1:46AM ERR writing release: EOF                                                                                                                                                    |
| 1:46AM ERR Serve tunnel error error="tls: use of closed connection" connIndex=0                                                                         |
| 1:46AM INF Retrying connection in 8s seconds error="tls: use of closed connection" connIndex=0                                                                  |
| 1:46AM INF Retrying connection in 16s seconds error="DialContext error: dial tcp 198.41.192.47:7844: i/o timeout" connIndex=1                                   |
| 1:46AM INF mux - read: EOF: connection closed unexpectedly                                                                                                                                 |
| 1:46AM ERR supervisor: Tunnel disconnected error="DialContext error: dial tcp 198.41.192.47:7844: i/o timeout"                                                   |
| 1:47AM INF mux - metrics: Stopping mux metrics updater                                                                                                                                     |
| 1:47AM INF Retrying connection in 16s seconds error="DialContext error: dial tcp 198.41.200.193:7844: i/o timeout" connIndex=0                                  |
| 1:47AM ERR supervisor: Tunnel disconnected error="DialContext error: dial tcp 198.41.200.193:7844: i/o timeout"                                                  |
| 1:47AM ERR Connection handshake with edge server failed error="Handshake error: 1000 handshake timeout" connIndex=3                                     |
| 1:47AM INF Retrying connection in 4s seconds error="Handshake error: 1000 handshake timeout" connIndex=3                                                        |
| 1:48AM ERR Serve tunnel error error="failed to register connection with backend" connIndex=3                                                            |
| 1:48AM INF Retrying connection in 8s seconds error="failed to register connection with backend" connIndex=3                                                     |
| 1:48AM INF mux - metrics: Stopping mux metrics updater                                                                                                                                     |
| 1:48AM INF Connection cfcbd672-0153-4e59-8cd5-c7b081eaed39 registered connIndex=3 location=SYD                                                                           |
| 1:49AM INF Retrying connection in 1s seconds error="DialContext error: dial tcp 198.41.200.233:7844: i/o timeout" connIndex=0                                   |
| 1:49AM INF Retrying connection in 1s seconds error="DialContext error: dial tcp 198.41.200.113:7844: i/o timeout" connIndex=2                                   |
| 1:49AM INF mux - metrics: Stopping mux metrics updater                                                                                                                                     |
| 1:49AM INF Retrying connection in 1s seconds error="DialContext error: dial tcp 198.41.192.167:7844: i/o timeout" connIndex=1                                   |
| 1:49AM ERR Serve tunnel error error="muxer shutdown" connIndex=3                                                                                        |
| 1:49AM INF Retrying connection in 1s seconds error="muxer shutdown" connIndex=3                                                                                 |
| 1:50AM INF Retrying connection in 2s seconds error="DialContext error: dial tcp 198.41.192.167:7844: i/o timeout" connIndex=1                                   |
| 1:50AM INF Retrying connection in 2s seconds error="TLS handshake with edge error: read tcp 172.17.10.146:33086->198.41.200.233:7844: i/o timeout" connIndex=0  |
| 1:50AM INF Retrying connection in 2s seconds error="DialContext error: dial tcp 198.41.192.107:7844: i/o timeout" connIndex=3                                   |
| 1:50AM ERR Connection handshake with edge server failed error="Handshake error: 1000 handshake timeout" connIndex=1                                     |
| 1:50AM INF Retrying connection in 4s seconds error="Handshake error: 1000 handshake timeout" connIndex=1                                                        |
| 1:50AM ERR Serve tunnel error error="failed to register connection with backend" connIndex=0                                                            |
| 1:50AM INF Retrying connection in 4s seconds error="failed to register connection with backend" connIndex=0                                                     |
| 1:50AM INF mux - metrics: Stopping mux metrics updater                                                                                                                                     |
| 1:50AM ERR Serve tunnel error error="cannot replace existing connection" connIndex=2                                                                    |
| 1:50AM INF Retrying connection in 2s seconds error="cannot replace existing connection" connIndex=2                                                             |
| 1:50AM INF mux - metrics: Stopping mux metrics updater                                                                                                                                     |
| 1:50AM ERR Serve tunnel error error="failed to register connection with backend" connIndex=3                                                            |
| 1:50AM INF Retrying connection in 4s seconds error="failed to register connection with backend" connIndex=3                                                     |
| 1:50AM INF mux - metrics: Stopping mux metrics updater                                                                                                                                     |
| 1:51AM INF Retrying connection in 8s seconds error="TLS handshake with edge error: read tcp 172.17.10.146:33188->198.41.200.233:7844: i/o timeout" connIndex=0  |
| 1:52AM ERR Serve tunnel error error="Application error: 3004 timeout waiting for initial response headers" connIndex=1                                  |
| 1:52AM INF mux - metrics: Stopping mux metrics updater                                                                                                                                     |
| 1:52AM ERR Serve tunnel error error="write tcp 172.17.10.146:50756->198.41.200.113:7844: write: broken pipe" connIndex=2                                |
| 1:52AM INF Retrying connection in 4s seconds error="write tcp 172.17.10.146:50756->198.41.200.113:7844: write: broken pipe" connIndex=2                         |
| 1:52AM INF mux - metrics: Stopping mux metrics updater                                                                                                                                     |
| 1:52AM INF Retrying connection in 16s seconds error="DialContext error: dial tcp 198.41.200.233:7844: i/o timeout" connIndex=0                                  |
| 1:52AM INF Retrying connection in 8s seconds error="Application error: 3004 timeout waiting for initial response headers" connIndex=1                           |
| 1:52AM INF mux - read: read tcp 172.17.10.146:57018->198.41.192.107:7844: read: connection reset by peer: frame read error                                                                 |
| 1:52AM INF mux - metrics: Stopping mux metrics updater                                                                                                                                     |
| 1:52AM ERR supervisor: Tunnel disconnected error="DialContext error: dial tcp 198.41.200.233:7844: i/o timeout"                                                  |
| 1:52AM INF Retrying connection in 8s seconds error="DialContext error: dial tcp 198.41.200.113:7844: i/o timeout" connIndex=2                                   |
| 1:52AM ERR Serve tunnel error error="Application error: 3005 connection closed while waiting for initial response headers" connIndex=3                  |
| 1:52AM INF Retrying connection in 8s seconds error="Application error: 3005 connection closed while waiting for initial response headers" connIndex=3           |
| 1:53AM INF Retrying connection in 16s seconds error="DialContext error: dial tcp 198.41.192.167:7844: i/o timeout" connIndex=1                                  |
| 1:53AM ERR supervisor: Tunnel disconnected error="DialContext error: dial tcp 198.41.192.167:7844: i/o timeout"                                                  |
| 1:53AM INF Retrying connection in 16s seconds error="TLS handshake with edge error: read tcp 172.17.10.146:57138->198.41.192.107:7844: i/o timeout" connIndex=3 |
| 1:53AM ERR supervisor: Tunnel disconnected error="TLS handshake with edge error: read tcp 172.17.10.146:57138->198.41.192.107:7844: i/o timeout"                 |
| 1:54AM INF mux - metrics: Stopping mux metrics updater                                                                                                                                     |
| 1:55AM ERR Serve tunnel error error="cannot replace existing connection" connIndex=2                                                                    |
| 1:55AM INF Retrying connection in 16s seconds error="cannot replace existing connection" connIndex=2                                                            |
| 1:55AM ERR supervisor: Tunnel disconnected error="cannot replace existing connection"                                                                            |
| 1:55AM INF Retrying connection in 1s seconds error="DialContext error: dial tcp 198.41.192.7:7844: i/o timeout" connIndex=3                                     |
| 1:55AM INF Retrying connection in 1s seconds error="DialContext error: dial tcp 198.41.200.53:7844: i/o timeout" connIndex=0                                    |
| 1:56AM INF Retrying connection in 1s seconds error="DialContext error: dial tcp 198.41.192.227:7844: i/o timeout" connIndex=1                                   |
| 1:56AM ERR Connection failed error=EOF connIndex=3                                                                                                      |
| 1:56AM ERR supervisor: Tunnel disconnected error=EOF                                                                                                             |
| 1:56AM INF Retrying connection in 2s seconds error="TLS handshake with edge error: read tcp 172.17.10.146:58252->198.41.192.227:7844: i/o timeout" connIndex=1  |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Try update to 2021.1.5.

Blockquote
Try update to 2021.1.5.

Thanks. Actually, it has been running 2021.1.5 since the containing task was restarted at around 2021-01-21 05:00 UTC, but the same trend has been evident since then (as seen in original graph)

1 Like

I think I should put @SamRhea in the loop?

1 Like

Hi @jon14,

The error dial tcp 198.41.200.233:7844: i/o timeout suggests there is a connectivity problem between your origin and our network. Have you tried running traceroute to our server IPs? You can also try our new protocol implementation. To opt in to this feature, add protocol: http2 to your config.

2 Likes

Hi,

I can’t explain why the connectivity issue existed since nothing has changed about the network in that regard for a while and since restarting the tunnel there have been no errors of that kind.

Since restarting the tunnels on 2021-01-25 the log messages have not reappeared and the memory utilisation now appears stable as per the updated graph

Some additional notes: the tunnel is running inside a docker image. Previously the image had been built with 2020.11.11 and so when it restarted the tunnel would update itself automatically to the current cloudflared version. Now that image has 2021.1.5 it does not do that (for now, at least).

The other thing is, I run with two origin end points, but I don’t have a deterministic way to assign a particular tunnel id to each end point, so I list all the tunnel ids and choose one that is not in use. This may lead to some race conditions that contributed to the issue, particularly in the software upgrade case. When I started the servers this time, I shutdown both origins completely, then restarted both simultaneously. If race conditions were the cause (and that is big if), It appears that the race conditions may have been avoided in this case, so I got a clean restart.

I will see what happens next time cloudflared updates itself automatically since this will more closely resemble the original circumstance.

Have you seen https://developers.cloudflare.com/argo-tunnel/configuration/ingress? You might be able to use our new multi origin support to serve both endpoints with just one tunnel.

Thanks Chungting.

Actually, I have an nginx proxy which acts as the origin for each tunnel. I need two tunnel instances to provide availability of the tunnel service itself across multiple AWS AZ’s - each tunnel needs to run in its own AZ so that if I lose an AZ, I don’t lose access to the tunnel service.