I’ve been struggling with this issue for at least a month. Initially I thought dnscrypt-proxy was implicated, but I’ve since been able to reproduce it using just basic DNS queries against 1.1.1.1. I believe I now have enough evidence of a bug (somewhere) to warrant a post here.
What is happening is the following pattern.
My system sends a query to 1.1.1.1 for example.org/A
The 1.1.1.1 resolver I hit, not having a cache entry for the name in question, is incorrectly returning a NOERROR response with no answer RRs, and additionally with referrals to the authoritative nameservers (in my case, to Cloudflare’s will and jean NS).
I send an additional query to 1.1.1.1 (repeat step 1)
Cloudflare will respond correctly with a response containing answer RRs (as a recursive resolver should).
Additional queries will hit the cached response and be correct.
After some time (presumably after the cache entry drops from the 1.1.1.1 resolver), we’re back to the behavior in step 1/2.
I have also observed that every name that this has happened to is using Cloudflare’s CNAME flattening feature. I’ve never seen this affecting any names that are direct (not indirect).
Please tell me I’m not insane!
Edit: just to preempt one thing: the queries are not being intercepted. Same thing happens with DoH, and I’ve reproduced this from a few vantage points across the world.
FWIW, someone reported a similar issue a week ago. With one non-Cloudflare domain. Their DNS may have been intercepted, because they said 1.1.1.1 failed and 1.0.0.1 worked.
I managed to get another reproduction with a domain that wasn’t using CNAME flattening, so that’s probably not the determining factor (it was just a regular CNAME to a cloudfront.net distribution).
Thanks for the packet captures @_az. I’m trying to trace this in the query logs now, I suspect it might be a collision with late arriving answer from delegation lookup.
Thanks for a detailed report, it’s super helpful! I suspect the issue was with timing of delegation lookup query (it retransmits queries if answer doesn’t arrive around the usual RTT). When the answer to first query arrived, resolver sent one more query to final authority with the same query name and type (NS because of query name minimization enabled), but received answer from previous query if open port number matched. I added one more step to give upstream slightly more chance to respond, and re-randomize message id and query name letter case to make collisions much less likely.
@mnordhoff that sounds vaguely like a viable theory of mechanism for the qname minimization + capsforid fallback problems Let’s Encrypt was having. Collision between the A and CAA/whatever responses.
It’s not. I don’t think this is directly exploitable (at least not any more than inherent risk for every unsigned record), as the second answer is still in the same bailiwick. It could make the first uncached response to be empty in certain conditions, because in DNS the referral and negative answers are ambiguous if the zone cut doesn’t change. It wouldn’t poison cached records though.
I don’t know the code, and I was partly wondering if this might have been a visible facet of a worse bug.
Edit: Ugh, I forgot about the “certain criteria which we decided not to disclose at the moment” part of the CVE description, and now I’m embarrassed I even asked.