1.1.1.1 giving answer-less responses on first (uncached) query, correct on second query

Hello,

I’ve been struggling with this issue for at least a month. Initially I thought dnscrypt-proxy was implicated, but I’ve since been able to reproduce it using just basic DNS queries against 1.1.1.1. I believe I now have enough evidence of a bug (somewhere) to warrant a post here.

What is happening is the following pattern.

  1. My system sends a query to 1.1.1.1 for example.org/A
  2. The 1.1.1.1 resolver I hit, not having a cache entry for the name in question, is incorrectly returning a NOERROR response with no answer RRs, and additionally with referrals to the authoritative nameservers (in my case, to Cloudflare’s will and jean NS).
  3. I send an additional query to 1.1.1.1 (repeat step 1)
  4. Cloudflare will respond correctly with a response containing answer RRs (as a recursive resolver should).
  5. Additional queries will hit the cached response and be correct.
  6. After some time (presumably after the cache entry drops from the 1.1.1.1 resolver), we’re back to the behavior in step 1/2.

I have a pcap of this behavior happening for two domains (and the domain hosting this file is one of the domains that I’ve observed to be affected).

I have also observed that every name that this has happened to is using Cloudflare’s CNAME flattening feature. I’ve never seen this affecting any names that are direct (not indirect).

Please tell me I’m not insane!

Edit: just to preempt one thing: the queries are not being intercepted. Same thing happens with DoH, and I’ve reproduced this from a few vantage points across the world.

FWIW, someone reported a similar issue a week ago. With one non-Cloudflare domain. Their DNS may have been intercepted, because they said 1.1.1.1 failed and 1.0.0.1 worked.

I managed to get another reproduction with a domain that wasn’t using CNAME flattening, so that’s probably not the determining factor (it was just a regular CNAME to a cloudfront.net distribution).

Bump. Some further investigation, I’ve run a test against 1.0.0.1 from 3 diverse vantage points for about 12 hours.

Every minute it would perform two queries:

  1. One for the A record of my domain (DNS hosted with Cloudflare)
  2. One for id.server/TXT
/usr/bin/dig @1.0.0.1 <domain>
/usr/bin/dig +tcp @1.0.0.1 id.server CH TXT

From all three vantage points, the issue occurred regularly over the test period.

The three vantage points being hit:

id.server Host # Bad Responses Session Pcap
ewr01 Linode Newark 24 pcap
iad02 OVH Canada 23 pcap
syd01 Binary Lane 15 pcap

Here is an example of a decoded bad response for your convenience (unredacted domain in pcaps):

Frame 8695: 143 bytes on wire (1144 bits), 143 bytes captured (1144 bits)
    Encapsulation type: Ethernet (1)
    Arrival Time: Jul 26, 2018 09:25:01.417845000 AEST
    [Time shift for this packet: 0.000000000 seconds]
    Epoch Time: 1532561101.417845000 seconds
    [Time delta from previous captured frame: 0.024120000 seconds]
    [Time delta from previous displayed frame: 0.024120000 seconds]
    [Time since reference or first frame: 43440.186141000 seconds]
    Frame Number: 8695
    Frame Length: 143 bytes (1144 bits)
    Capture Length: 143 bytes (1144 bits)
    [Frame is marked: False]
    [Frame is ignored: False]
    [Protocols in frame: eth:ip:udp:dns]
Ethernet II, Src: SuperMic_45:89:57 (0c:c4:7a:45:89:57), Dst: Xensourc_e0:c1:9a (00:16:3e:e0:c1:9a)
    Destination: Xensourc_e0:c1:9a (00:16:3e:e0:c1:9a)
        Address: Xensourc_e0:c1:9a (00:16:3e:e0:c1:9a)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Source: SuperMic_45:89:57 (0c:c4:7a:45:89:57)
        Address: SuperMic_45:89:57 (0c:c4:7a:45:89:57)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Type: IP (0x0800)
Internet Protocol Version 4, Src: 1.0.0.1 (1.0.0.1), Dst: 43.229.63.55 (43.229.63.55)
    Version: 4
    Header length: 20 bytes
    Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00: Not-ECT (Not ECN-Capable Transport))
        0000 00.. = Differentiated Services Codepoint: Default (0x00)
        .... ..00 = Explicit Congestion Notification: Not-ECT (Not ECN-Capable Transport) (0x00)
    Total Length: 129
    Identification: 0xb250 (45648)
    Flags: 0x02 (Don't Fragment)
        0... .... = Reserved bit: Not set
        .1.. .... = Don't fragment: Set
        ..0. .... = More fragments: Not set
    Fragment offset: 0
    Time to live: 62
    Protocol: UDP (17)
    Header checksum: 0x1dff [validation disabled]
        [Good: False]
        [Bad: False]
    Source: 1.0.0.1 (1.0.0.1)
    Destination: 43.229.63.55 (43.229.63.55)
User Datagram Protocol, Src Port: domain (53), Dst Port: 53716 (53716)
    Source port: domain (53)
    Destination port: 53716 (53716)
    Length: 109
    Checksum: 0x8af0 [validation disabled]
        [Good Checksum: False]
        [Bad Checksum: False]
Domain Name System (response)
    [Request In: 8694]
    [Time: 0.024120000 seconds]
    Transaction ID: 0xcbc8
    Flags: 0x8180 Standard query response, No error
        1... .... .... .... = Response: Message is a response
        .000 0... .... .... = Opcode: Standard query (0)
        .... .0.. .... .... = Authoritative: Server is not an authority for domain
        .... ..0. .... .... = Truncated: Message is not truncated
        .... ...1 .... .... = Recursion desired: Do query recursively
        .... .... 1... .... = Recursion available: Server can do recursive queries
        .... .... .0.. .... = Z: reserved (0)
        .... .... ..0. .... = Answer authenticated: Answer/authority portion was not authenticated by the server
        .... .... ...0 .... = Non-authenticated data: Unacceptable
        .... .... .... 0000 = Reply code: No error (0)
    Questions: 1
    Answer RRs: 0
    Authority RRs: 2
    Additional RRs: 1
    Queries
        <mydomain.redacted>: type A, class IN
            Name: <mydomain.redacted>
            Type: A (Host address)
            Class: IN (0x0001)
    Authoritative nameservers
        <mydomain.redacted>: type NS, class IN, ns jean.ns.Cloudflare.com
            Name: <mydomain.redacted>
            Type: NS (Authoritative name server)
            Class: IN (0x0001)
            Time to live: 15 minutes
            Data length: 24
            Name Server: jean.ns.Cloudflare.com
        <mydomain.redacted>: type NS, class IN, ns will.ns.Cloudflare.com
            Name: <mydomain.redacted>
            Type: NS (Authoritative name server)
            Class: IN (0x0001)
            Time to live: 15 minutes
            Data length: 7
            Name Server: will.ns.Cloudflare.com
    Additional records
        <Root>: type OPT
            Name: <Root>
            Type: OPT (EDNS0 option)
            UDP payload size: 1452
            Higher bits in extended RCODE: 0x0
            EDNS0 version: 0
            Z: 0x0
            Data length: 0

What’s the best way I can go about figuring this one out?

This doesn’t help, and I don’t have solid information or pcaps, but I think I may have just hit this too.

2018-07-31 05:58:20.286767 IP 127.0.0.1.43718 > 127.0.0.1.53: 12712+ A? d348hmg1ylg9cx.cloudfront.net. (47)
2018-07-31 05:58:20.286879 IP 172.31.0.204.26951 > 1.1.1.1.53: 41305+ A? d348hmg1ylg9cx.cloudfront.net. (47)
2018-07-31 05:58:20.409850 IP 1.1.1.1.53 > 172.31.0.204.26951: 41305 0/4/0 (187)
2018-07-31 05:58:20.409969 IP 127.0.0.1.53 > 127.0.0.1.43718: 12712 0/4/0 (187)

2018-07-31 05:58:24.180572 IP 127.0.0.1.43459 > 127.0.0.1.53: 59581+ A? d348hmg1ylg9cx.cloudfront.net. (47)
2018-07-31 05:58:24.180666 IP 172.31.0.204.56927 > 1.1.1.1.53: 58726+ A? d348hmg1ylg9cx.cloudfront.net. (47)
2018-07-31 05:58:24.224963 IP 1.1.1.1.53 > 172.31.0.204.56927: 58726 4/0/0 A 13.35.112.31, A 13.35.112.50, A 13.35.112.93, A 13.35.112.153 (111)
2018-07-31 05:58:24.225074 IP 127.0.0.1.53 > 127.0.0.1.43459: 59581 4/0/0 A 13.35.112.31, A 13.35.112.50, A 13.35.112.93, A 13.35.112.153 (111)

Yay. :slightly_frowning_face:

(MIA, probably.)

Edit: Yeah, I was able to reproduce it with dig later. I did hit this issue. Yay? :slightly_smiling_face:

Passed this along to the team, will update if I get an answer, but really appreciate the detailed capture/info.

3 Likes

Thanks for the packet captures @_az. I’m trying to trace this in the query logs now, I suspect it might be a collision with late arriving answer from delegation lookup.

2 Likes

@_az let me know if you still encounter this on your test domains

2 Likes

Thank you.

The last occurrences I observed were uniformly ~6 hours ago on ewr01, syd01 and iad02, which is a much larger interval than ever observed before.

I’ll keep the monitors running for a few days further, but hopefully this is fixed for good! I deeply appreciate your help.

Thanks for a detailed report, it’s super helpful! I suspect the issue was with timing of delegation lookup query (it retransmits queries if answer doesn’t arrive around the usual RTT). When the answer to first query arrived, resolver sent one more query to final authority with the same query name and type (NS because of query name minimization enabled), but received answer from previous query if open port number matched. I added one more step to give upstream slightly more chance to respond, and re-randomize message id and query name letter case to make collisions much less likely.

1 Like

@mnordhoff that sounds vaguely like a viable theory of mechanism for the qname minimization + capsforid fallback problems Let’s Encrypt was having. Collision between the A and CAA/whatever responses.

@mvavrusa I’m just wondering, was this CVE-2018-10920?

It’s not. I don’t think this is directly exploitable (at least not any more than inherent risk for every unsigned record), as the second answer is still in the same bailiwick. It could make the first uncached response to be empty in certain conditions, because in DNS the referral and negative answers are ambiguous if the zone cut doesn’t change. It wouldn’t poison cached records though.

Alright. Thanks for replying! :smile:

I don’t know the code, and I was partly wondering if this might have been a visible facet of a worse bug.

Edit: Ugh, I forgot about the “certain criteria which we decided not to disclose at the moment” part of the CVE description, and now I’m embarrassed I even asked. :sweat: