WARP Reconnecting Loop After Entering Modern Sleep

It appears that after the machine enters modern sleep for some time, WARP will stop working when the machine comes back from sleep. Here’s the current configuration:

  • Cloudflare Zero Trust
  • Split Tunnel (Include only with CIDR ranges with DNS resolving to specific zones)
  • Windows 11 23H2 22631.3447
  • WARP Version 2024.3.409.0

When this happens, any sustained data transfer over a few kilobytes will cause the TUN driver to crash and reconnect. If the tray icon is visible, the cloud will go from orange to gray, and orange again. The WARP device in Network Connections will also disappear and reappear momentarily, causing any existing connections to be lost.

It will stay connected until sustained data transfer goes through again, which will cause the tunnel to die and reconnect. Small packets like DNS will go through without any issues, but things like TLS handshakes will generally cause the tunnel to die off (hence the tunnel in this state is almost unusable.)

The only way to stop the tunnel from dying this way is either to:

  • Under Preferences → click Connection → and Reset Encryption Keys (this appears to work most of the time, although I have had times when this will fail to work)
  • Go to Windows Services and restart the Cloudflare WARP service (this seems to always work even when the rekeying does not)

This looks like a bug that may be related to some stuck buffer that needs to be handled properly, but it’s not doing so properly.

When the tunnel disconnects, messages like this appears in in the daemon log:

WARN main_loop: warp::warp::util: Cancelled Tunnel task experienced error task_name=“Tunnel in/out loop” err=TunnelError(UdpSend(UdpFailContext { inner: Os { code: 10040, kind: Uncategorized, message: “A message sent on a datagram socket was larger than the internal message buffer or some other network limit, or the buffer used to receive a datagram into was smaller than the datagram itself.” }, chunk_sizes: Some([16, 1280, 16]), total_size: 1312 }))
2024-04-20T17:10:18.002Z DEBUG main_loop: warp::warp_service: Reconnecting on connection error error=TunnelError(TunDriverStopped)
power_state=None disconnect_reason=Some(InternalError(Inflight(TunnelError(TunDriverStopped))))
DEBUG tunnel_loop{protocol=“wireguard”}: warp_tun::win: Shutting down the wintun tunnel
DEBUG warp_tun::win: Stopping drive_read_wait_handle due to shutdown
DEBUG main_loop: warp::warp_service: Entering main loop arm arm=“tunnel_taskset_errors_fut”

When the tunnel disconnects, the Some values will change slightly, but otherwise most of the disconnect/reconnection loop appears to be the same.

Here were the chunk sizes that appear in the log:
chunk

Noting that the MTU of the CloudflareWARP device is 1280, I am guessing that any packets greater than size 1248 will cause the tunnel to crash. I am not quite sure where the extra 16 bytes chunks are coming from, but I am guessing it has to do with some headers or pings that were stuck in the queue that couldn’t be processed properly.

I was playing around with this further, and it seems like the packet size was even smaller than expected. Using scapy under WSL2, I was able to craft up SYN packets of various sizes:

sr1(IP(dst="9.99.[ IP(dst=“9.99.0.1”)/TCP(dport=3389)/Raw(‘\x00’*x) for x in range(1000,1280) ], timeout=0.5, inter=0.5)

In Wireshark, I can see that the packet size stopped at around 1161 bytes (+40 byte header) = 1201 bytes overall

before the TUN adapter crashes. Interestingly, this overall size is still less than the 1280 bytes MTU.

It also doesn’t take that long for this problem to manifest itself. Putting the machine to sleep for more than an hour or two should be able to reproduce the issue. Not sure if that matters, but the DHCP lease is also quite short (~1 hour on IPv4 and 30 mins on IPv6).

Don’t know why the command was mangled here, but it should look like:

sr1([ IP(dst=“9.99.0.1”)/TCP(dport=3389)/Raw(‘\x00’*x) for x in range(1000,1280) ], timeout=0.5, inter=0.5)

If I rekey the tunnel or restart the service, or sometimes seemingly get these error to happen multiple times in a short period of time, the errors will go away until the next time the machine goes to sleep (i.e. I could send up to the MTU again without the tunnel crashing).

Here’s an example of a session with Scapy:

The first SYN request receives a SYN, ACK response back that has 1160 bytes of null padding. On the 2nd attempt where 1161 bytes of null padding was adding, the tunnel dies and the SYN packet does not receive any response. When this happens, the cloud goes from orange to grey and then back to orange. Each subsequent reconnect results in the same behavior. After 10 retries in a short period of time (within the same minute or so), the SYN request again receives a reply, and any subsequent packets of that size does not disrupt the connection until the machine is put back to sleep.

I’m experiencing the same behavior with same messages in cfwarp_service_log. The only difference to your config is I’m running Split Tunnel in Exclude Mode.

When WARP is reconnecting there are two events logged in System Eventlog from Source NetBT with EventID 4311

Initialization failed because the driver device could not be created. Use the string "000000000100320000000000D71000C013010000250200C006000000000000000000000000000000" to identify the interface for which initialization failed. It represents the MAC address of the failed interface or the  Globally Unique Interface Identifier (GUID) if NetBT was unable to  map from GUID to MAC address. If neither the MAC address nor the GUID were  available, the string represents a cluster device name. 

and 

Initialization failed because the driver device could not be created. Use the string "000000000100320000000000D71000C011010000250200C007000000000000000000000000000000" to identify the interface for which initialization failed. It represents the MAC address of the failed interface or the  Globally Unique Interface Identifier (GUID) if NetBT was unable to  map from GUID to MAC address. If neither the MAC address nor the GUID were  available, the string represents a cluster device name. 

After restarting Cloudflare WARP service WARP connects succesfully.

I believe the cfwarp_service_log is the same log that is exported when the warp-diag tool is used. That being said, I don’t think the NetBT stuff has any relevance here. You (and I) are getting these messages in the System log since the Wintun interface created during the connection isn’t designed for Layer 2 addressing (e.g. broadcasts) and NetBT couldn’t be initialized. I believe this message would appear regardless of whether or not the connection is working.

The most likely culprit seems to be that whenever the connection is paused for a reconnect, something causes data to be stuck in the buffer, which chews up some of the available buffer space. When a big enough packet tries to consume the entire MTU, it cannot send the data on top of whatever was stuck in the buffer, so the TUN driver crashes and WARP tries to reconnect.

The issue is that it doesn’t seem like the reconnection itself flushes the buffer, so the connection is stuck in a connection loop until somehow that buffer is flushed. I am guessing rekeying also had the side effect of flushing the buffer queues, and of course, restarting the service will do so as well.

Yup, that’s the same log.

I downgraded to 2024.2.187.0 yesterday. This version works fine for me and connects without issues when my computer comes back from sleep.

I’ll open a support request with Cloudflare tomorrow

Interesting you were able to find a version that works. Did you happen to have a list of older versions somewhere?

sure, I got this version from App Center

Awesome - thanks for the tip.

I haven’t tried the old version myself yet, but if you are right, I am guessing the bug probably lies within this particular change:

If I had to venture a guess, the change probably introduced a bug where old TCP connection states persisted in the buffer, but in the case of a long system sleep where the original connections are no longer valid, it caused certain data structures not to be freed and therefore stuck in the buffer. This has the side effect of causing packets smaller than the 1280 WARP MTU to cause the wintun driver to crash and reconnect.

Hopefully, the Cloudflare team could figure this out and fix the issue. Thanks for testing this on your side.

good catch!

The issue just happened a few minutes ago on my machine. I gathered a fresh warp-diag set and sent this over to Cloudflare. I’m referencing this thread in the support request, too.

Are you seeing it on the old version too? Or do you still have machines on the new one?

So it appears I am seeing the same issue on the old version as well. I think the difference between the old and the new is that the old version flushes the buffer when reconnecting, so it will crash only once and then the subsequent connection will work, whereas the new client will just get stuck in a loop.

Support pointed out MTU size on Ethernet Interface was set to 1280 for IPv6, for IPv4 it is 1500.
I changed it to 1500 for AddressFamily IPv6 and WARP connects fine.

However, still figuring out why it is set to 1280. On Wi-Fi adapter both MTU values are on 1500 and when using Wi-Fi WARP connects when coming out of sleep

The MTU was set to 1280 probably to reduce the likelihood of IP fragmentation issues where some networks will block IP fragments. Since Wireguard is stateless, this will cause the tunnel to stall even when the display may show it’s connected. I’m not sure if Cloudflare does some kind of MTU discovery to set this value, or if it was arbitrarily picked out of the blue…