Unfortunately I don’t have pcaps or permission to post the domain for which this was happening, but I believe that the 18.104.22.168 resolver isn’t properly dealing with network timeouts of nameservers properly.
For some period of time these two nameservers had AAAA records that were not responsive to traffic:
Though the nameservers were not bound to their IPv6 interfaces, IPv4 was working fine at all times.
What was happening was that 22.214.171.124 was spuriously returning SERVFAIL responses for domains hosted with those nameservers. I observed this happening both in “sin” and “mel” POPs.
What should be happening is that 126.96.36.199 should be falling back to the IPv4 addresses in order to get an answer (or the query should go out in parallel to both IPv4 and IPv6 and the first response to arrive is used).
What is actually happening is SERVFAIL is being returned, which suggests that the IPv6 timeout is causing a context timeout for the entire query operation, not allowing an opportunity for the IPv4 query to succeed.
The problem of unresponsive IPv6 addresses with those nameservers is now fixed (as of less than an hour ago) but you should hopefully be able to replicate the basic scenario:
- Two nameservers
- Each allocated IPv4 and IPv6 addresses, both connected on the network
- Nameservers bound to the IPv4 interfaces
- Nameservers not bound to the IPv6 interfaces