I have customers that have switched to your 18.104.22.168 and 22.214.171.124 for their DNS needs and now they are complaining to me that certain sites are unreachable.
I looked in to this and determined that Cloudflare’s DNS service has issues working with authoritative DNS servers that have a moderate amount of network latency (>30ms). I looked further into this and found an interesting article that discusses DNS Recursion Timeout Vulnerability’s. See https://pdfs.semanticscholar.org/e1a2/d5d279a3238f5a52052318c3179253c28260.pdf for details.
My tests shows Google’s DNS at 126.96.36.199 never gave a false SERVFAIL nor NXDOMAIN whereas Cloudflare 188.8.131.52 would from time to time. Based on the above article, it appears that Cloudflare’s resolving DNS servers are suffering from one or more of the following:
- Insufficient memory
- Insufficient pending recursive queue depth
- Active Denial-of-Service (DoS) or Distributed Denial-of-Service (DDoS) attack
So using Cloudflare’s DNS at 184.108.40.206 and/or 220.127.116.11 may subject the end user to an occasional SERVFAIL or NXDOMAIN DNS query response if the domain they are reaching out to exists on an authoritative DNS server with network latency that exceeds Cloudflare’s response parameters.
Hopefully Cloudflare is aware of this issue and are working on a solution.
Hi @dbarker. 18.104.22.168 has a timeout of 3000ms for slow NSs. It can produce a SERVFAIL if some of the upstream NSs are unreachable or poor quality. Typically some of the nameservers are unresponsive in certain location and don’t support TCP (which can exceed client configured timeout), or nameservers are non-compliant for some queries which eliminates them from NS election.
Can you share for which domains do you see resolution failures and which Cloudflare PoP are you hitting?
See Have problems with 22.214.171.124? *Read Me First*
Can knot serve cached, albeit expired responses during a grace period instead of returning a
SERVFAIL when this happens?
It can, but it’s not yet enabled. We’re looking into it in a week or two.
Please look at support case #1521382 for screen shots and additional details. Your support ticket response stated that I needed to post here to enable others to answer my question as if I was asking a question.
This is a real issue with your service responding with a negative response before your servers even queried my DNS servers. I ran a packet trace and could see that your server responded very quickly with either a NXDOMAIN or SERVFAIL before your servers actually queried my DNS servers.
Interestingly enough, if I moved the domain in question to a very low latency authoritative DNS server, your service worked each and every time.
You state that you using a 3000MS timeout but test queries fail much sooner.
Based on the PDF file listed above (DNS Recursion Timeout Vulnerability), I can see the possibility that your resolver dumps the query too soon if your pending query queue depth is saturated.
All I can tell you is that no resolver should be returning negative results unless the DNS server are truly offline or exceed the 3000ms (three second) timeout window you have established.
Please let me know if you need more details.
Thanks, I looked at #1521382. The problem with your nameserver isn’t timeout, but its noncompliance. It doesn’t support EDNS0 (and it either drops incoming messages or returns FORMERR). The resolver will work around this eventually, but you’ll still see timeouts when the resolver will be stuck waiting for the nameserver to respond before flagging it as unresponsive (then it will be eliminated from NS selection for a few seconds).
You can either fix the nameserver, or I can add overrides to downgrade to basic DNS mode for this zone, but honestly there’s nothing much resolver can do if nameserver doesn’t respond to queries.
The nameserver in question is 126.96.36.199 and 188.8.131.52 (dig with edns hangs, dig with +noedns works).