IPv6 timeouts appear to be racey


#1

Hello,

Unfortunately I don’t have pcaps or permission to post the domain for which this was happening, but I believe that the 1.1.1.1 resolver isn’t properly dealing with network timeouts of nameservers properly.

For some period of time these two nameservers had AAAA records that were not responsive to traffic:

ns1.syd2.hostingplatform.net.au
ns2.syd2.hostingplatform.net.au

Though the nameservers were not bound to their IPv6 interfaces, IPv4 was working fine at all times.

What was happening was that 1.1.1.1 was spuriously returning SERVFAIL responses for domains hosted with those nameservers. I observed this happening both in “sin” and “mel” POPs.

What should be happening is that 1.1.1.1 should be falling back to the IPv4 addresses in order to get an answer (or the query should go out in parallel to both IPv4 and IPv6 and the first response to arrive is used).

What is actually happening is SERVFAIL is being returned, which suggests that the IPv6 timeout is causing a context timeout for the entire query operation, not allowing an opportunity for the IPv4 query to succeed.

The problem of unresponsive IPv6 addresses with those nameservers is now fixed (as of less than an hour ago) but you should hopefully be able to replicate the basic scenario:

  1. Two nameservers
  2. Each allocated IPv4 and IPv6 addresses, both connected on the network
  3. Nameservers bound to the IPv4 interfaces
  4. Nameservers not bound to the IPv6 interfaces

Cannot resolve jli.st domain