NXDOMAIN response on valid records

Hi,

Since a day we are experiencing failures using the public Cloudflare DNS recursors for our domains.

We have a test running to make sure we are not experiencing an general problem in our authoritative nameservers (ns1.worldstream.nl, ns2.worldstream.com and ns3.worldstream.net). In this test we query the following public recursors.
Cloudflare
Google
Quad9
Freenom
OpenDNS

Cloudflare is the only one returning a NXDOMAIN answer. This problem does only occur once in a while.

Any help the find the cause would be appreciated! Please let me know if I can provide more information about this issue.

Remi Frenay

$ dig TXT test.auth.worldstream.nl @1.1.1.1

; <<>> DiG 9.9.5-9+deb8u15-Debian <<>> TXT test.auth.worldstream.nl @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 26964
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1452
;; QUESTION SECTION:
;test.auth.worldstream.nl.      IN      TXT

;; AUTHORITY SECTION:
worldstream.nl.         543     IN      SOA     ns1.worldstream.nl. hostmaster.worldstream.nl. 1574169763 10800 3600 604800 3600

;; Query time: 2 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Tue Nov 19 16:04:03 CET 2019
;; MSG SIZE  rcvd: 104

$ dig TXT test.auth.worldstream.nl @1.0.0.1

; <<>> DiG 9.9.5-9+deb8u15-Debian <<>> TXT test.auth.worldstream.nl @1.0.0.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 64069
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1452
;; QUESTION SECTION:
;test.auth.worldstream.nl.      IN      TXT

;; ANSWER SECTION:
test.auth.worldstream.nl. 58    IN      TXT     "WorldStream"

;; Query time: 2 msec
;; SERVER: 1.0.0.1#53(1.0.0.1)
;; WHEN: Tue Nov 19 16:04:03 CET 2019
;; MSG SIZE  rcvd: 77

$ dig TXT test.auth.worldstream.nl @8.8.8.8

; <<>> DiG 9.9.5-9+deb8u15-Debian <<>> TXT test.auth.worldstream.nl @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 60651
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;test.auth.worldstream.nl.      IN      TXT

;; ANSWER SECTION:
test.auth.worldstream.nl. 59    IN      TXT     "WorldStream"

;; Query time: 21 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Tue Nov 19 16:04:03 CET 2019
;; MSG SIZE  rcvd: 77

$ dig +short CHAOS TXT id.server @1.1.1.1
"AMS"

$ dig +short CHAOS TXT id.server @1.0.0.1
"AMS"

$ dig @ns3.Cloudflare.com whoami.Cloudflare.com txt +short
"2a00:7c80::3"

$ traceroute 1.1.1.1
traceroute to 1.1.1.1 (1.1.1.1), 30 hops max, 60 byte packets
 1  93.190.136.30 (93.190.136.30)  0.235 ms 93.190.136.29 (93.190.136.29)  0.129 ms 93.190.136.30 (93.190.136.30)  0.273 ms
 2  109.236.95.226 (109.236.95.226)  1.029 ms 109.236.95.230 (109.236.95.230)  0.176 ms  0.236 ms
 3  109.236.95.106 (109.236.95.106)  1.284 ms  1.304 ms 109.236.95.167 (109.236.95.167)  1.152 ms
 4  ams-ix.as13335.net (80.249.211.140)  2.628 ms  2.682 ms  2.608 ms
 5  one.one.one.one (1.1.1.1)  1.681 ms  1.570 ms  1.704 ms

$ traceroute 1.0.0.1
traceroute to 1.0.0.1 (1.0.0.1), 30 hops max, 60 byte packets
 1  93.190.136.30 (93.190.136.30)  0.169 ms 93.190.136.29 (93.190.136.29)  0.193 ms  0.254 ms
 2  109.236.95.224 (109.236.95.224)  0.154 ms 109.236.95.230 (109.236.95.230)  0.145 ms  0.165 ms
 3  109.236.95.167 (109.236.95.167)  1.099 ms 109.236.95.106 (109.236.95.106)  1.289 ms 109.236.95.108 (109.236.95.108)  1.313 ms
 4  ams-ix.as13335.net (80.249.211.140)  2.705 ms  4.575 ms  4.561 ms
 5  one.one.one.one (1.0.0.1)  1.846 ms  1.854 ms  1.809 ms

I just attempted to purge the cache at https://1.1.1.1/purge-cache/ however a subsequent request did return the expected TXT record.

Can you reproduce it consistently?

@dane @irtefa

Hi,

No it can’t be reproduced consistently. We execute the query every 30 seconds in our monitoring system, and the last reponse NXDOMAIN was on 16:08:53 (GMT+1). The other responses returned the TXT record.

The number of Cloudflare request failures is already decreased, but the problem is not solved.

These are the last failures:
2019-11-20 20:29:07(GMT+1) Server: 2606:4700:4700::6400, Status: NXDOMAIN
2019-11-20 15:38:34(GMT+1) Server: 1.0.0.1, Status: NXDOMAIN
2019-11-20 08:28:34(GMT+1) Server: 1.0.0.1, Status: NXDOMAIN
2019-11-19 16:08:23(GMT+1) Server: 1.1.1.1, Status: NXDOMAIN

We are running tcpdump on all nameservers to investigate this and found out we don’t receive any request from Cloudflare on these servers while the NXDOMAIN answer is returned.

See for example below. We captured the query from 2606:4700:4700::6400 before and after the failure. But during the NXDOMAIN result we didn’t receive a query.

20:28:37.835117 IP6 2400:cb00:20:1024::8d65:40d9.15414 > 2a01:7c8:d005:42a::1.53: 32800% [1au] TXT? cloudflare-4.auth.worldstream.nl. (61)

20:29:36.648838 IP6 2400:cb00:20:1024::a29e:6d44.24843 > 2a01:7c8:d005:42a::1.53: 46766% [1au] TXT? cloudflare-4.auth.worldstream.nl. (61)

I can’t explain this behaviour. If Cloudflare was unable to reach our authoritative nameservers I wouldn’t expect a NXDOMAIN result. And if there was a configuration issue on our side I would also expect failures using other recursors like Google.

Regards,
Remi