Increased NGINX 499s when proxying IPv6 (and IPv4) through CloudFlare

The following plot describes what we are seeing. At approximately 9:26 we began proxying 5% of our traffic through CloudFlare (95% is DNS only). At approximately 9:58 we disabled IPv6 support in the Cloudflare dashboard. This greatly reduces the quantity of errors, but does not completely fix the problem.

We know that most IPv4/IPv6 traffic resolves just fine. But for some reason we are getting quite a few 499s (client closed the connection before the server answered) – roughly 0.25% of all proxied requests.

This seems to predominantly be an IPv6 issue, but even with IPv6 disabled, we see 499s much more frequently than expected – they were extremely rare before we orange-clouded 5% of our traffic.

Here is a sample nginx error log entry:

2021/07/27 16:53:45 [warn] 2217#2217: *106430290 upstream server temporarily disabled while reading response header from upstream, client: 10.8.0.38, server: localhost, request: "PUT /api/some/endpoint HTTP/1.1", upstream: "http://127.0.0.1:2993/api/some/endpoint", host: "proxied.sub.domain", referrer: "https://domain.com/path"
2021/07/27 16:53:45 [error] 2217#2217: *106430290 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.8.0.38, server: localhost, request: "PUT /api/some/endpoint HTTP/1.1", upstream: "http://127.0.0.1:2993/api/some/endpoint", host: "proxied.sub.domain", referrer: "https://domain.com/path"

Here the corresponding nginx access log entry:

10.8.0.38 - - [27/Jul/2021:16:53:45 +0000] "PUT /api/some/endpoint HTTP/1.1" 499 0 "https://domain.com/path" "Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) GSA/170.0.386351093 Mobile/15E148 Safari/604.1" "174.210.XXX.11, 162.158.XXX.16" "174.210.XXX.11" "174.210.XXX.11" "proxied.sub.domain" sn="localhost" rt=60.000 ua="127.0.0.1:2993, 127.0.0.1:2994" us="504, -" ut="60.003, 0.000" ul="0, 0" cs=-

May I ask are these request over the Nginx upstream on 2993 and/or 2994 port proxied from inside to outside network over some of the supported and compatible Cloudflare ports, or not?

I am not 100% sure, but I suppose the “PUT” request should be available through :orange: cloud.

Furthermore, 499 is an Nginx error.
Do you have some “submit” form or I do not know, that maybe the user double-clicked?

May I also ask are you using some JS framework like Angular, or maybe using a Python?

May I ask do you have only A or AAAA, or both A and AAAA records setup at DNS tab on Cloudflare dashboard for your domain?

Maybe you would need to check if the requests can be passed in and out via IPv6 on the origin host/server, if so, over the correct port? (like IPv6 forwarding, etc.)

What timeout value have you got setup?
Maybe the request was executing longer than 100 seconds as typically Cloudflare waits 100 seconds for an HTTP response?
Or, between the upstreams over ports, if request like timeouted on one, gone to second and so on, meaning it should be fixed on the origin host/server rather than on Cloudflare?

How about specifing a proxy_pass http://127.0.0.1:the_port_here; or proxy_read_timeout 120;?

May I also ask, is the IPv4 alongside IPv6 being enabled at your server?
If so, maybe it’s in confusion as the IPs could count as few different servers instead of the single one? (or maybe you are actually using few different on a different IPs)

From another thinking regarding a “submit form”, if the client sends the data and is not interested in what will happen to them and what will be the response, but maybe the application actually needs and should process the data?. So, the data simply somehow does not have time to reach your application?
Meaning, not the best way to go, but have you tried with proxy_ignore_client_abort on;?

I am just guessing.

Yeah, to describe the system a bit more, we have a web-based client (Angular 10) that makes XHR requests to proxied.sub.domain where proxied.sub.domain is CNAME DNS record which points to an AWS Load Balancer. The load-balancer provides TLS termination and forwards the request to a server that is running NGINX configured as a reverse proxy to multiple node instances running on ports 2993, 2994, etc.

So:
Angular → CF → AWS ALB → EC2 NGINX → localhost:(2993, 2994, etc)

We use a cookie for sticky session management – so I am not sure why I am seeing multiple upstream nodes in the “upstream address” log (ua=“127.0.0.1:2993, 127.0.0.1:2994”)

I should mention that the setup without errors works like:
Angular → AWS ALB → EC2 NGINX → localhost:(2993, 2994, etc)

May I ask, have you already tried looking into this?:

Not sure I’m going to figure this one out. But here is an update: with about 50% of our traffic orange-clouded through CloudFlare, we consistently have a few 499s per minute where we used to have 0.

For perspective, we are doing about 14k requests/minute through CF which puts our 499 error rate is at about 0.02%. This is probably fine, and given the retry/failover capabilities we have in place, impact to the user should be minimal, but naturally, it would feel a lot better if we stayed at 0 errors.

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.