DNS issue with domain

On-going problem with resolving some of my domains through cloudflare dns servers - more than a half of trying results in SERVFAIL response.


; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.8 <<>> @1.1.1.1 docker-registry.eastwood.com.ru
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49635
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; OPT=15: 00 16 ("..")
;; QUESTION SECTION:
;docker-registry.eastwood.com.ru. IN    A

;; Query time: 6 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Tue Dec 21 21:19:20 UTC 2021
;; MSG SIZE  rcvd: 66

and

and another one:

# dig @1.1.1.1 gitlab.eastwood.com.ru

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.8 <<>> @1.1.1.1 gitlab.eastwood.com.ru
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 60573
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; OPT=15: 00 16 ("..")
;; QUESTION SECTION:
;gitlab.eastwood.com.ru.                IN      A

;; Query time: 5 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Tue Dec 21 21:30:22 UTC 2021
;; MSG SIZE  rcvd: 57
# dig @1.1.1.1 docker-registry.eastwood.com.ru

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.8 <<>> @1.1.1.1 docker-registry.eastwood.com.ru
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 51928
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; OPT=15: 00 16 ("..")
;; QUESTION SECTION:
;docker-registry.eastwood.com.ru. IN    A

;; Query time: 5 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Tue Dec 21 21:41:01 UTC 2021
;; MSG SIZE  rcvd: 66
# dig @1.0.0.1 docker-registry.eastwood.com.ru

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.8 <<>> @1.0.0.1 docker-registry.eastwood.com.ru
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 14009
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; OPT=15: 00 16 ("..")
;; QUESTION SECTION:
;docker-registry.eastwood.com.ru. IN    A

;; Query time: 5 msec
;; SERVER: 1.0.0.1#53(1.0.0.1)
;; WHEN: Tue Dec 21 21:41:31 UTC 2021
;; MSG SIZE  rcvd: 66
# dig @8.8.8.8 docker-registry.eastwood.com.ru

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.8 <<>> @8.8.8.8 docker-registry.eastwood.com.ru
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14118
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;docker-registry.eastwood.com.ru. IN    A

;; ANSWER SECTION:
docker-registry.eastwood.com.ru. 3600 IN A      116.203.147.97

;; Query time: 6 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Tue Dec 21 21:41:57 UTC 2021
;; MSG SIZE  rcvd: 76
# dig +short CHAOS TXT id.server @1.1.1.1
"FRA"
# dig +short CHAOS TXT id.server @1.0.0.1
"FRA"

That hostname isn’t resolving in a lot of places:

This is because the nameservers of com.ru. are very unresponsive.

~> dig ns com.ru. +short
ns3-com.nic.ru.
ns4-com.nic.ru.
ns8-com.nic.ru.

I tried some queries and sometimes it takes around 50ms, while other times it takes up to 12 seconds or it times out:

; <<>> DiG 9.16.22-Debian <<>> @ns8-com.nic.ru. eastwood.com.ru.
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
~> dig @ns3-com.nic.ru. eastwood.com.ru.

; <<>> DiG 9.16.22-Debian <<>> @ns3-com.nic.ru. eastwood.com.ru.
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36982
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 4, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;eastwood.com.ru.		IN	A

;; AUTHORITY SECTION:
eastwood.com.ru.	345600	IN	NS	ns-148.awsdns-18.com.
eastwood.com.ru.	345600	IN	NS	ns-1425.awsdns-50.org.
eastwood.com.ru.	345600	IN	NS	ns-1978.awsdns-55.co.uk.
eastwood.com.ru.	345600	IN	NS	ns-958.awsdns-55.net.

;; Query time: 55 msec
;; SERVER: 193.232.146.170#53(193.232.146.170)
;; WHEN: Tue Dec 21 22:53:14 CET 2021
;; MSG SIZE  rcvd: 184

Cloudflare has a timeout of around 4000ms I believe, that’s why it will SERVFAIL if it does not get a reply in time. I also see queries failing on Quad9 and OpenDNS. Cloudflare has multiple servers which make DNS requests, you will be served round-robin to one of them. Sometimes this server might get a successful response, other times a different server might get a timeout and thus serve a SERVFAIL. The particular server will then use that cached response always, even if it’s a SERVFAIL. Google’s DNS will never reply with a SERVFAIL (unless queries fail 100% of time), because they will use the cached query of a previous successful response—doesn’t matter which one of their hundred servers got that response. On Cloudflare you will always get the response of one particular server. Perhaps Cloudflare should also implement this for bad behaving nameservers like the ones of com.ru.. @mvavrusa

2 Likes

I’ve seen this discussion appear before when Facebook went down. Initially only their nameservers went down sporadically. facebook.com still worked on Google DNS, but failed on most other providers. This was because Google was still serving the last known successful response, even when some or all of the nameservers of Facebook were not responding anymore. Other DNS providers provided the results that Facebook gave them—nothing. One might argue that this does not follow the spirit of DNS neutrality and Google is injecting their own responses. I personally think it’s better that the enduser gets an actual reponse over none at all, however, it’s the question of how long Google should then serve these “last known” responses, 5 minutes? 1 hour? a whole day? That’s up for discussion.

Even though this is not a problem of the public DNS provider and rather patchwork to fix for the faulty upstream nameservers by that particular domain, I think it’s important to have this feature of “last known valid responses” implemented for the enduser, as these situations of domains resolving only 50% of the time can arrive and will leave users puzzled—ultimately blaming the error on the public DNS provider, in this case Cloudflare.

2 Likes

For this particular issue, the nameservers are flaky as mentioned here. I tried to flag some of them as unreliable, so it should be a bit better, but seems like it’s not the same set everywhere.

For the other things - we’re working on two things separately:

  1. Improve the returned EDE codes, here the EDE code is returned and it makes sense to me, but it’s not obvious without a prior knowledge what is the issue. This should help the end-user visibility.
  2. 1.1.1.1 does support “serve stale” as well, but it’s not always as effective. This is due to several reasons - number of backends and PoPs, and the size allowance for cache.
3 Likes

But why use NS servers of com.ru, if domains gitlab.eastwood.com.ru and docker-registry.eastwood.com.ru have NS servers on domain eastwood.com.ru ?

dig +short eastwood.com.ru SOA
ns-148.awsdns-18.com. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400

DNS is hierarchical. We need to use all the nameservers in the chain.

I want to know the IP of gitlab.eastwood.com.ru., so I will ask the nameservers of one higher up the chain: eastwood.com.ru., but I don’t know what the nameservers of eastwood.com.ru. are, so I need to ask com.ru.s nameservers what the nameservers of eastwood.com.ru. are, but I don’t know com.ru. nameservers… so I ask ru.s nameservers what com.ru.s nameservers are, but I don’t know what ru.s nameservers are, so I ask .s nameservers, but I don’t know wh– oh wait that one we do: we can ask the root nameservers, as these IP addresses never change.

This is what a recursive DNS server does, but it starts from the top. First the root (.) and then it relies on replies of further nameservers down the chain. You cannot just ask the nameservers of eastwood.com.ru because we simply do not know what they are yet. If one nameserver in the chain does not give a reply, we will not have an answer (unless you “serve stale” like Google’s DNS).

2 Likes

I tested today and com.ru.s nameservers seem to be responding fine again without any timeouts.

1 Like

I contacted with nic.ru support at 22 december, told them about problem with NS server and that it causing problems for entire com.ru zone.
23 december i got response about founded and resolved problem on their side.

So thanks to you all for diagnosis of that case and help in resolving problem.

2 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.