Argo tunnel kubernetes ingress Server error: Cannot authenticate user credentials


#1

new error today, tunnels all down

time=“2018-09-11T18:02:40Z” level=info msg=“Connected to CDG”
time=“2018-09-11T18:02:40Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-11T18:02:40Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: Cannot authenticate user credentials”
time=“2018-09-11T18:02:40Z” level=info msg=“Connected to CDG”
time=“2018-09-11T18:02:40Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-11T18:02:40Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: Cannot authenticate user credentials”
time=“2018-09-11T18:02:41Z” level=info msg=“Connected to CDG”
time=“2018-09-11T18:02:41Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-11T18:02:41Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: Cannot authenticate user credentials”
time=“2018-09-11T18:02:44Z” level=info msg=“Connected to CDG”
time=“2018-09-11T18:02:44Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-11T18:02:44Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: Cannot register internal hostname”


#2

We saw similar problems around this time connecting to LHR. Two k8s clusters, one died then about an hour and a bit later the other went with all sorts of errors. It’s late here now but I’ll dig out the logs in the morning and paste them to this thread.

Argo tunnel on cloudflared into a docker container with no load balancers has been rock solid. Ingress with load balancers - seemingly quite unstable. Would be interesting to compare notes offline or understand who at CF can help with this (already tried support but didn’t get very far).


#3

ps: are you having random disconnect and no reconnect issues too like this thread: https://github.com/cloudflare/cloudflare-ingress-controller/issues/40


#4

We saw this a little earlier than your logs from LHR:

time=“2018-09-11T17:10:25Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-11T17:10:25Z” level=info msg=“Muxer shutdown” connectionID=0
time=“2018-09-11T17:10:25Z” level=info msg=“Retrying in 1s seconds”
time=“2018-09-11T17:10:36Z” level=error msg=“Unable to dial edge” error=“Handshake with edge error: read tcp 10.2.2.13:39330->198.41.192.227:7844: read: connection reset by peer”
time=“2018-09-11T17:10:36Z” level=info msg=“Retrying in 2s seconds”
time=“2018-09-11T17:10:43Z” level=info msg=“Connected to LHR”
time=“2018-09-11T17:10:44Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-11T17:10:44Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: Cannot authenticate user credentials”
time=“2018-09-11T17:10:44Z” level=info msg=“Retrying in 4s seconds”
time=“2018-09-11T17:10:53Z” level=info msg=“Connected to LHR”
time=“2018-09-11T17:10:53Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-11T17:10:53Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: Cannot authenticate user credentials”
time=“2018-09-11T17:10:53Z” level=info msg=“Retrying in 8s seconds”
time=“2018-09-11T17:11:07Z” level=info msg=“Connected to LHR”
time=“2018-09-11T17:11:07Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-11T17:11:07Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: Cannot authenticate user credentials”
time=“2018-09-11T17:11:07Z” level=info msg=“Retrying in 16s seconds”
time=“2018-09-11T17:11:29Z” level=info msg=“Connected to LHR”
time=“2018-09-11T17:11:29Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-11T17:11:29Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: Cannot authenticate user credentials”

That ingress controller then gave up trying to connect as per reports about retries failing after disconnects. Later on a second ingress controller on a different k8s cluster but in the same pool:

time=“2018-09-11T18:58:45Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-11T18:58:45Z” level=info msg=“Muxer shutdown” connectionID=0
time=“2018-09-11T18:58:45Z” level=info msg=“Retrying in 1s seconds”
time=“2018-09-11T18:59:57Z” level=info msg=“Connected to LHR”
time=“2018-09-11T18:59:57Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-11T18:59:57Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: error adding origin to existing pool XXXXX: authentication error: response: {\n “result”: null,\n “success”: false,\n “errors”: [\n {\n “code”: 1002,\n “message”: “no DNS records returned: validation failed”\n }\n ],\n “messages”: []\n}\n”
time=“2018-09-11T18:59:57Z” level=info msg=“Retrying in 2s seconds”
time=“2018-09-11T19:01:03Z” level=info msg=“Connected to LHR”
time=“2018-09-11T19:01:03Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-11T19:01:03Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: error adding origin to existing pool XXXXX: authentication error: response: {\n “result”: null,\n “success”: false,\n “errors”: [\n {\n “code”: 1002,\n “message”: “no DNS records returned: validation failed”\n }\n ],\n “messages”: []\n}\n”
time=“2018-09-11T19:01:03Z” level=info msg=“Retrying in 4s seconds”
time=“2018-09-11T19:01:23Z” level=info msg=“Connected to LHR”
time=“2018-09-11T19:01:23Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-11T19:01:23Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: Cannot register internal hostname”
time=“2018-09-11T19:01:23Z” level=info msg=“Retrying in 8s seconds”
time=“2018-09-11T19:02:43Z” level=info msg=“Connected to LHR”
time=“2018-09-11T19:02:43Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-11T19:02:43Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: error adding origin to existing pool XXXXX: authentication error: response: {\n “result”: null,\n “success”: false,\n “errors”: [\n {\n “code”: 1002,\n “message”: “no DNS records returned: validation failed”\n }\n ],\n “messages”: []\n}\n”
time=“2018-09-11T19:02:43Z” level=info msg=“Retrying in 16s seconds”
time=“2018-09-11T19:04:04Z” level=info msg=“Connected to LHR”
time=“2018-09-11T19:04:04Z” level=info msg=“Tunnel ID: XXXXX”
time=“2018-09-11T19:04:04Z” level=info msg=“Route propagating, it may take up to 1 minute for your new route to become functional”

There seems to be two different sets of issues here - one being the ingress controller is giving up too quickly on retries after a tunnel drops (which appears to be a known issue with 0.5.2), and the second is something related to argo and load balancers when creating an entry in a pool after disconnect. Last week, we saw entries being created with a weight of zero - which apparently can result in the load balancer not sending any traffic to the pool entry.

Happy to provide further logs etc if someone from CF wants to reach out via email etc.


#5

Ingress with load balancers - seemingly quite unstable

yes in my experience running ingress without cloudflare load balancing, quite unstable. I have a watcher script that deletes the pod when it detects my service being unavailable. brutal but improves uptime by reducing outages to less than a minute.

In my case it will not drop all tunnels at the same time, so I’m considering running an argo instance per service to reduce services exposure to the restarts.


#6

Interesting - so ingress is unstable, period? Not great. Is this a supported product yet?


#7

Hi, PM for Argo Tunnels here! The original issue in this thread corresponds to an outage of our API reported here: https://www.cloudflarestatus.com/incidents/twq5h7r9p0b3

We know that customers depend on the reliability of both our API and Argo Tunnels. Both teams are working to improve reliability and ensure the same issue never occurs twice.


#8

Thanks Zack!


#9

Just had another major outage - see logs below. No mention of any issues on status page but dash was throwing all sorts of errors suggesting a major API outage around billing and LB configuration. We’re also seeing intermittent loss of service on our regular (non-load-balanced, non-ingress, backup) Argo tunnel as well.

time=“2018-09-17T18:18:28Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-17T18:18:28Z” level=info msg=“Muxer shutdown” connectionID=0
time=“2018-09-17T18:18:28Z” level=info msg=“Retrying in 1s seconds”
time=“2018-09-17T18:19:59Z” level=info msg=“Connected to LHR”
time=“2018-09-17T18:19:59Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-17T18:19:59Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: registration error”
time=“2018-09-17T18:19:59Z” level=info msg=“Retrying in 2s seconds”
time=“2018-09-17T18:21:31Z” level=info msg=“Connected to LHR”
time=“2018-09-17T18:21:31Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-17T18:21:31Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: registration error”
time=“2018-09-17T18:21:31Z” level=info msg=“Retrying in 4s seconds”
time=“2018-09-17T18:23:05Z” level=info msg=“Connected to LHR”
time=“2018-09-17T18:23:05Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-17T18:23:05Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: registration error”
time=“2018-09-17T18:23:05Z” level=info msg=“Retrying in 8s seconds”
time=“2018-09-17T18:24:43Z” level=info msg=“Connected to LHR”
time=“2018-09-17T18:24:43Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-17T18:24:43Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: registration error”
time=“2018-09-17T18:24:44Z” level=info msg="Retrying in 16s seconds”

[deleted pod]

time=“2018-09-17T18:28:14Z” level=info msg=“Connected to LHR”
time=“2018-09-17T18:28:14Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-17T18:28:14Z” level=error msg=“Register tunnel error from server side” connectionID=0 error=“Server error: Cannot register internal hostname”
time=“2018-09-17T18:28:14Z” level=info msg=“Retrying in 1s seconds”
time=“2018-09-17T18:29:45Z” level=info msg=“Connected to LHR”
time=“2018-09-17T18:29:45Z” level=info msg=“Stopping mux metrics updater” dir=metrics subsystem=mux
time=“2018-09-17T18:29:45Z” level=error msg=“Register tunnel error from server side” connectionID=0 .=info msg=“Retrying in 2s seconds”

rinse, repeat, etc.

@zack any updates on this? Still no mention on the status page in the 20 minutes since this started, which is pretty poor.


#10

Hi Alex, You’re absolutely right there is an API issue at the moment we are working to resolve. You can find a link to the status page here: https://www.cloudflarestatus.com/incidents/q746ybtyb6q0


#11

Thanks Zack, yep I saw the status update was posted about an hour after we started seeing issues.