Argo Kubernetes ingress intermittent 502 and connection refused


#1

Hi,

I’ve been testing Argo Ingress for Kubernetes on and off for a day or so one of our nginx pods, and seeing frequent 502 CF error pages which clear after a browser refresh or two in quick succession. Digging into the ingress and nginx logs, it seems that the 502s correspond to the connection refused entries, which are in turn coming after the keep alive connection is closed.

Argo ingress is 0.5.2, installed using Helm following the docs on the Cloudflare developers site. Ingress definition as per the example given on the docs with alterations to match the nginx service name, port, etc. I’ve run a port forward from the nginx pod to my local machine and no similar issues in the browser after a connection is closed so I don’t think there’s any issues with the nginx pod itself - plus It was also happening with the httpbin pod deployed as per the docs.

Logs below from the Argo pod and nginx pod, what I don’t have is accurate logs of me hitting refresh on the browser but you can see where I get a successful response.

time=“2018-08-24T20:49:01Z” level=error msg=“HTTP request error” error=“dial tcp 10.3.0.167:8080: connect: connection refused”
time=“2018-08-24T20:51:48Z” level=error msg=“HTTP request error” error=“dial tcp 10.3.0.167:8080: connect: connection refused”
time=“2018-08-24T20:51:50Z” level=error msg=“HTTP request error” error=“dial tcp 10.3.0.167:8080: connect: connection refused”
time=“2018-08-24T20:58:37Z” level=error msg=“HTTP request error” error=“dial tcp 10.3.0.167:8080: connect: connection refused”
time=“2018-08-24T20:58:39Z” level=error msg=“HTTP request error” error=“dial tcp 10.3.0.167:8080: connect: connection refused”
time=“2018-08-24T21:06:29Z” level=error msg=“HTTP request error” error=“dial tcp 10.3.0.167:8080: connect: connection refused”

10.2.1.14 - - [24/Aug/2018:20:49:01 +0000] “GET /robots.txt HTTP/1.1” 304 0 “-” “[REDACTED]”
10.2.1.14 - - [24/Aug/2018:20:49:02 +0000] “GET / HTTP/1.1” 302 5 “-” “[BROWSER REDACTED]”
10.2.1.14 - - [24/Aug/2018:20:49:03 +0000] “GET /login HTTP/1.1” 200 3244 “-” “[BROWSER REDACTED]”
10.2.1.14 - - [24/Aug/2018:20:49:04 +0000] “GET /login HTTP/1.1” 200 3244 “-” “[BROWSER REDACTED]”
10.2.1.14 - - [24/Aug/2018:20:49:06 +0000] “GET /login HTTP/1.1” 200 3244 “-” “[BROWSER REDACTED]”
10.2.1.14 - - [24/Aug/2018:20:49:07 +0000] “GET /login HTTP/1.1” 200 3244 “-” “[BROWSER REDACTED]”
10.2.1.14 - - [24/Aug/2018:20:49:21 +0000] “GET /login HTTP/1.1” 200 3244 “-” “[BROWSER REDACTED]”
2018/08/24 20:50:33 [info] 30#30: *379 client 10.2.1.14 closed keepalive connection
2018/08/24 20:50:51 [info] 30#30: *382 client 10.2.1.14 closed keepalive connection
10.2.1.14 - - [24/Aug/2018:20:51:51 +0000] “GET /login HTTP/1.1” 200 3244 “-” “[BROWSER REDACTED]”
10.2.1.14 - - [24/Aug/2018:20:51:52 +0000] “GET /login HTTP/1.1” 200 3244 “-” “[BROWSER REDACTED]”
10.2.1.14 - - [24/Aug/2018:20:51:55 +0000] “GET /login HTTP/1.1” 200 3244 “-” “[BROWSER REDACTED]”
2018/08/24 20:53:25 [info] 30#30: *387 client 10.2.1.14 closed keepalive connection
10.2.1.14 - - [24/Aug/2018:20:58:41 +0000] “GET /login HTTP/1.1” 200 3244 “-” “[BROWSER REDACTED]”
2018/08/24 21:00:11 [info] 30#30: *391 client 10.2.1.14 closed keepalive connection
10.2.1.14 - - [24/Aug/2018:21:03:33 +0000] “GET /login HTTP/1.1” 200 3244 “-” “[BROWSER REDACTED]”
10.2.1.14 - - [24/Aug/2018:21:03:34 +0000] “GET /login HTTP/1.1” 200 3244 “-” “[BROWSER REDACTED]”
10.2.1.14 - - [24/Aug/2018:21:04:02 +0000] “GET /login HTTP/1.1” 200 3244 “-” “[BROWSER REDACTED]”
2018/08/24 21:05:32 [info] 30#30: *393 client 10.2.1.14 closed keepalive connection
10.2.1.14 - - [24/Aug/2018:21:06:31 +0000] “GET /login HTTP/1.1” 200 3244 “-” “[BROWSER REDACTED]”

Any help appreciated.

Thanks,
Alex


#2

I think I’ve figured it out - the keepalive closing was a red herring. A bit more digging and looking at IPs revealed we had a single pod deployed, but the service had two endpoints - and with no health check or readiness probe I’m assuming the service was randomly routing to the wrong IP once in a while.

Have fixed that now, so far so good.