The following plot describes what we are seeing. At approximately 9:26 we began proxying 5% of our traffic through CloudFlare (95% is DNS only). At approximately 9:58 we disabled IPv6 support in the Cloudflare dashboard. This greatly reduces the quantity of errors, but does not completely fix the problem.
We know that most IPv4/IPv6 traffic resolves just fine. But for some reason we are getting quite a few 499s (client closed the connection before the server answered) – roughly 0.25% of all proxied requests.
This seems to predominantly be an IPv6 issue, but even with IPv6 disabled, we see 499s much more frequently than expected – they were extremely rare before we orange-clouded 5% of our traffic.
Here is a sample nginx error log entry:
2021/07/27 16:53:45 [warn] 2217#2217: *106430290 upstream server temporarily disabled while reading response header from upstream, client: 10.8.0.38, server: localhost, request: "PUT /api/some/endpoint HTTP/1.1", upstream: "http://127.0.0.1:2993/api/some/endpoint", host: "proxied.sub.domain", referrer: "https://domain.com/path"
2021/07/27 16:53:45 [error] 2217#2217: *106430290 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.8.0.38, server: localhost, request: "PUT /api/some/endpoint HTTP/1.1", upstream: "http://127.0.0.1:2993/api/some/endpoint", host: "proxied.sub.domain", referrer: "https://domain.com/path"
Here the corresponding nginx access log entry:
10.8.0.38 - - [27/Jul/2021:16:53:45 +0000] "PUT /api/some/endpoint HTTP/1.1" 499 0 "https://domain.com/path" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) GSA/170.0.386351093 Mobile/15E148 Safari/604.1" "174.210.XXX.11, 162.158.XXX.16" "174.210.XXX.11" "174.210.XXX.11" "proxied.sub.domain" sn="localhost" rt=60.000 ua="127.0.0.1:2993, 127.0.0.1:2994" us="504, -" ut="60.003, 0.000" ul="0, 0" cs=-