SERVFAIL / No reachable authority errors for common domains

As of this morning, I’m finding Cloudflare DNS returning SERVFAIL indiscriminately for a range of well known domains, for example:

  • mail.google.com
  • cloud.feedly.com

I get SERVFAIL for both plain port 53 queries as well as TLS queries (port 853). I’m in Hobart, Australia (Telstra). Interestingly, if I connect over cellular, the SERVFAIL errors go away. My cellular connection exits in Melbourne, Australia, so could be hitting different Cloudflare DNS servers.

Example queries:

mail.google.com:

% dig mail.google.com @1.1.1.1      

; <<>> DiG 9.16.44-Debian <<>> mail.google.com @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 13229
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; EDE: 22 (No Reachable Authority)
;; QUESTION SECTION:
;mail.google.com.		IN	A

;; Query time: 8 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Fri Oct 20 10:40:00 AEDT 2023
;; MSG SIZE  rcvd: 50

cloud.feedly.com:

dig cloud.feedly.com @1.1.1.1  

; <<>> DiG 9.16.44-Debian <<>> cloud.feedly.com @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 30379
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; EDE: 22 (No Reachable Authority)
;; QUESTION SECTION:
;cloud.feedly.com.		IN	A

;; Query time: 12 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Fri Oct 20 10:47:42 AEDT 2023
;; MSG SIZE  rcvd: 51

But google.com is fine:

dig google.com @1.1.1.1

; <<>> DiG 9.16.44-Debian <<>> google.com @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 22066
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;google.com.			IN	A

;; ANSWER SECTION:
google.com.		61	IN	A	142.250.70.238

;; Query time: 8 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Fri Oct 20 10:48:18 AEDT 2023
;; MSG SIZE  rcvd: 55

Normally, Cloudflare DNS is bullet proof. I can’t see any reports of issues - is there something that needs investigating?

That could very well be hitting different Cloudflare locations.

You could try including “+nsid” with your dig query, which would also return an unique ID for the deployment you reach.

Alternatively, also dig for the “CH TXT id.server”, to get the location code you’re reaching.

E.g. dig +nsid CH TXT id.server:

$ dig +nsid CH TXT id.server @1.1.1.1

; <<>> DiG 9.16.42-Debian <<>> +nsid CH TXT id.server @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44517
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; NSID: 36 35 6d 38 31 ("65m81")
;; QUESTION SECTION:
;id.server.                     CH      TXT

;; ANSWER SECTION:
id.server.              0       CH      TXT     "CPH"

;; Query time: 4 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Fri Oct 20 02:12:22 CEST 2023
;; MSG SIZE  rcvd: 63

In this case, the query reaches CPH (Copenhagen/DK), with the NSID “65m81”.

$ dig +nsid CH TXT id.server @1.0.0.1

; <<>> DiG 9.16.42-Debian <<>> +nsid CH TXT id.server @1.0.0.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47746
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; NSID: 33 37 36 6d 35 33 ("376m53")
;; QUESTION SECTION:
;id.server.                     CH      TXT

;; ANSWER SECTION:
id.server.              0       CH      TXT     "LHR"

;; Query time: 44 msec
;; SERVER: 1.0.0.1#53(1.0.0.1)
;; WHEN: Fri Oct 20 02:12:36 CEST 2023
;; MSG SIZE  rcvd: 64

Here, by routing 1.0.0.1 through a VPN connection, the query reaches LHR (London/UK), with the NSID “376m53”.

That would help solve the concerns about whether or not you’re reaching multiple (different) locations, whether they are within or outside Australia, or similar.

And, it would allow you to dig through to see if there are a consistent pattern through specific deployments.

To go further into your outputs, the EDE code 22 you see here:

If e.g. Google has four name servers, ns1.google.com, ns2.google.com, ns3.google.com, and ns4.google.com, it typically means - in a very generic way, that the Cloudflare resolver was not able to reach, or otherwise gather any kind of response, from any of Google’s four name servers.

I would therefore start by checking, using the above information, if there are any consistent pattern in regards to which location of Cloudflare you end up on that (appears) to have issues, and then share the location code / NSID’s for the problematic ones.

I see no ongoing (or recent) activity in the Oceania area according to the Cloudflare Status Page, however, it isn’t impossible that your ISP takes you to a complete different country, or even continent, although, due to the quite low latency (e.g. “Query time” in your output), I would somehow doubt that in this specific situation.

Many thanks @DarkDeviL for your quick and helpful reply. This confirms my hypothesis: the SERVFAIL issues I am seeing appear to be isolated to the Hobart Cloudflare DNS server.

Connected by my fixed connection, I appear to be reaching a Hobart DNS server (I continue to get SERVFAIL errors):

% dig +nsid CH TXT id.server @1.1.1.1

; <<>> DiG 9.10.6 <<>> +nsid CH TXT id.server @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51136
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; NSID: 34 36 39 6d 32 ("469m2")
;; QUESTION SECTION:
;id.server.			CH	TXT

;; ANSWER SECTION:
id.server.		0	CH	TXT	"HBA"

;; Query time: 40 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Fri Oct 20 13:42:46 AEDT 2023
;; MSG SIZE  rcvd: 63

But when connected over cellular, I am reaching a Melbourne server (from which I’m not getting any SERVAILs):

% dig +nsid CH TXT id.server @1.1.1.1

; <<>> DiG 9.10.6 <<>> +nsid CH TXT id.server @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48066
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; NSID: 34 37 6d 31 35 36 ("47m156")
;; QUESTION SECTION:
;id.server.			CH	TXT

;; ANSWER SECTION:
id.server.		0	CH	TXT	"MEL"

;; Query time: 55 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Fri Oct 20 13:43:25 AEDT 2023
;; MSG SIZE  rcvd: 64

Like you, I can’t see any reported issues, but it seems pretty likely to me that something is broken in the Hobart DNS server.

If you’re attempting to make the queries you mentioned, but including the “+nsid”, are you always seeing the same (or a very identical) NSID?

E.g.

dig +nsid mail.google.com @1.1.1.1
dig +nsid cloud.feedly.com @1.1.1.1  

So it looks like the Hobart DNS server is working again. I can’t find any domains that return SERVFAILs anymore. For posterity, I can confirm that I was reaching a Hobart server (at least I think so). See examples below.

It’s a pity not to know whether there was a fault / known issue. But fixed = good.

cloud.feedly.com flags the "469m2" server:

% dig +nsid cloud.feedly.com @1.1.1.1

; <<>> DiG 9.10.6 <<>> +nsid cloud.feedly.com @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55133
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; NSID: 34 36 39 6d 32 ("469m2")
;; QUESTION SECTION:
;cloud.feedly.com.		IN	A

;; ANSWER SECTION:
cloud.feedly.com.	260	IN	A	104.20.60.241
cloud.feedly.com.	260	IN	A	104.20.59.241

;; Query time: 51 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Sat Oct 21 19:41:13 AEDT 2023
;; MSG SIZE  rcvd: 86

As does mail.google.com:

dig +nsid mail.google.com @1.1.1.1 

; <<>> DiG 9.10.6 <<>> +nsid mail.google.com @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 25269
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; NSID: 34 36 39 6d 32 ("469m2")
;; QUESTION SECTION:
;mail.google.com.		IN	A

;; ANSWER SECTION:
mail.google.com.	90	IN	A	142.250.70.133

;; Query time: 59 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Sat Oct 21 19:40:58 AEDT 2023
;; MSG SIZE  rcvd: 69

The id.server flags "469m3", which admittedly is not the same, but suspiciously similar. Tasmania being an island, it’s impossible to get similar latencies for any other location (13 ms between Hobart and Melbourne, the closest next hop), so I’m reasonably confident that I was always getting a Hobart server when the issue was present.

dig +nsid CH TXT id.server @1.1.1.1                                   

; <<>> DiG 9.10.6 <<>> +nsid CH TXT id.server @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52851
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; NSID: 34 36 39 6d 33 ("469m3")
;; QUESTION SECTION:
;id.server.			CH	TXT

;; ANSWER SECTION:
id.server.		0	CH	TXT	"HBA"

;; Query time: 57 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Sat Oct 21 19:41:39 AEDT 2023
;; MSG SIZE  rcvd: 63

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.