Unbound forwarding to 1.1.1.1#853 stopped working

I have had a working configuration of Unbound forwarding DNS over TLS to 1.1.1.1 for the past few months but things started failing today. Not all domains/websites, but most of them started failing in resolving.

[Edit: This is from an AWS EC2 host. I have a similar setup at home on a raspberry pi, which is still working fine; no resolution issues]

An example below.

➜  ubuntu ~  dig m.schwab.com @127.0.0.1 -p5335                                                            

; <<>> DiG 9.16.1-Ubuntu <<>> m.schwab.com @127.0.0.1 -p5335
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 57228
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1472
;; QUESTION SECTION:
;m.schwab.com.                  IN      A

;; Query time: 420 msec
;; SERVER: 127.0.0.1#5335(127.0.0.1)
;; WHEN: Mon Jul 06 22:39:29 UTC 2020
;; MSG SIZE  rcvd: 41

Unbound logs

unbound[9737]: [1594074834] unbound[9737:0] info: reply from <.> 1.1.1.1#853
unbound[9737]: [1594074834] unbound[9737:0] info: query response was CNAME
unbound[9737]: [1594074834] unbound[9737:0] info: resolving lms-auth.schwab.com. A IN
unbound[9737]: [1594074834] unbound[9737:0] info: response for lms-auth.schwab.com. AAAA IN
unbound[9737]: [1594074834] unbound[9737:0] info: reply from <.> 1.1.1.1#853
unbound[9737]: [1594074834] unbound[9737:0] info: query response was CNAME
unbound[9737]: [1594074834] unbound[9737:0] info: resolving lms-auth.schwab.com. AAAA IN
unbound[9737]: [1594074834] unbound[9737:0] info: response for lms-auth.schwab.com. A IN
unbound[9737]: [1594074834] unbound[9737:0] info: reply from <.> 1.1.1.1#853
unbound[9737]: [1594074834] unbound[9737:0] info: query response was THROWAWAY
unbound[9737]: [1594074834] unbound[9737:0] info: response for lms-auth.schwab.com. A IN
unbound[9737]: [1594074834] unbound[9737:0] info: reply from <.> 1.0.0.1#853
unbound[9737]: [1594074834] unbound[9737:0] info: query response was THROWAWAY
unbound[9737]: [1594074834] unbound[9737:0] info: response for lms-auth.schwab.com. A IN
unbound[9737]: [1594074834] unbound[9737:0] info: reply from <.> 1.1.1.1#853
unbound[9737]: [1594074834] unbound[9737:0] info: query response was THROWAWAY
unbound[9737]: [1594074834] unbound[9737:0] info: response for lms-auth.schwab.com. A IN
unbound[9737]: [1594074834] unbound[9737:0] info: reply from <.> 1.1.1.1#853
unbound[9737]: [1594074834] unbound[9737:0] info: query response was THROWAWAY
unbound[9737]: [1594074834] unbound[9737:0] info: response for lms-auth.schwab.com. A IN
unbound[9737]: [1594074834] unbound[9737:0] info: reply from <.> 1.1.1.1#853
unbound[9737]: [1594074834] unbound[9737:0] info: query response was THROWAWAY
unbound[9737]: [1594074834] unbound[9737:0] info: response for lms-auth.schwab.com. A IN
unbound[9737]: [1594074834] unbound[9737:0] info: reply from <.> 1.0.0.1#853
unbound[9737]: [1594074834] unbound[9737:0] info: query response was THROWAWAY
unbound[9737]: [1594074834] unbound[9737:0] info: response for lms-auth.schwab.com. AAAA IN
unbound[9737]: [1594074834] unbound[9737:0] info: reply from <.> 1.1.1.1#853
unbound[9737]: [1594074834] unbound[9737:0] info: query response was THROWAWAY
unbound[9737]: [1594074834] unbound[9737:0] info: response for lms-auth.schwab.com. AAAA IN
unbound[9737]: [1594074834] unbound[9737:0] info: reply from <.> 1.0.0.1#853
unbound[9737]: [1594074834] unbound[9737:0] info: query response was THROWAWAY
unbound[9737]: [1594074834] unbound[9737:0] info: response for lms-auth.schwab.com. AAAA IN
unbound[9737]: [1594074834] unbound[9737:0] info: reply from <.> 1.1.1.1#853
unbound[9737]: [1594074834] unbound[9737:0] info: query response was THROWAWAY
unbound[9737]: [1594074834] unbound[9737:0] info: response for lms-auth.schwab.com. AAAA IN
unbound[9737]: [1594074834] unbound[9737:0] info: reply from <.> 1.0.0.1#853
unbound[9737]: [1594074834] unbound[9737:0] info: query response was THROWAWAY
unbound[9737]: [1594074834] unbound[9737:0] info: response for lms-auth.schwab.com. AAAA IN
unbound[9737]: [1594074834] unbound[9737:0] info: reply from <.> 1.0.0.1#853
unbound[9737]: [1594074834] unbound[9737:0] info: query response was THROWAWAY
unbound[9737]: [1594074834] unbound[9737:0] info: response for lms-auth.schwab.com. AAAA IN
unbound[9737]: [1594074834] unbound[9737:0] info: reply from <.> 1.1.1.1#853
unbound[9737]: [1594074834] unbound[9737:0] info: query response was THROWAWAY
unbound[9737]: [1594074834] unbound[9737:0] info: response for lms-auth.schwab.com. AAAA IN
unbound[9737]: [1594074834] unbound[9737:0] info: reply from <.> 1.0.0.1#853
unbound[9737]: [1594074834] unbound[9737:0] info: query response was THROWAWAY

Relevant unbound config

forward-zone:
    name: "."
    forward-addr: [email protected]#cloudflare-dns.com
    forward-addr: [email protected]#cloudflare-dns.com
    forward-ssl-upstream: yes

I can reach both 1.1.1.1 & 1.0.0.1 consistently (ping/traceroute), and I can telnet both the address at port 853.

Hi,

Can you increase the log verbosity, see what you can find there? I’m not able to reproduce it locally.

Better to follow the readme-first, to grab more information, especially the data center unbound connects to, by “dig CHAOS TXT id.server @1.1.1.1”.

Thank you for trying to reproduce this. I switched back to clouldflare (had pointed it to a different resolver) and can’t reproduce it at my end either anymore. FWIW, the issue persisted for more than 3 hours across 2020-07-06/2020-07-07 UTC. I might still have more verbose logs from unbound during that time, if you are still interested.

Also,

➜  ubuntu ~  dig CHAOS TXT id.server @1.1.1.1                           

; <<>> DiG 9.16.1-Ubuntu <<>> CHAOS TXT id.server @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12277
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;id.server.                     CH      TXT

;; ANSWER SECTION:
id.server.              0       CH      TXT     "IAD"

;; Query time: 0 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Thu Jul 09 00:00:05 UTC 2020
;; MSG SIZE  rcvd: 52

Detailed unbound log would be helpful to know what error it saw from 1.1.1.1. Unfortunately, we don’t know the root cause at the moment. Maybe you could try to collect some logs next time when it happen again(I wish not).

Happening again. Few more logs this time around

➜  ubuntu ~  dig lms-auth.schwab.com   @127.0.0.1 -p 5335                              

; <<>> DiG 9.16.1-Ubuntu <<>> lms-auth.schwab.com @127.0.0.1 -p 5335
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 23649
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1472
;; QUESTION SECTION:
;lms-auth.schwab.com.           IN      A

;; Query time: 343 msec
;; SERVER: 127.0.0.1#5335(127.0.0.1)
;; WHEN: Tue Aug 18 20:49:15 UTC 2020
;; MSG SIZE  rcvd: 48
➜  ubuntu ~  dig CHAOS TXT id.server @1.1.1.1                           

; <<>> DiG 9.16.1-Ubuntu <<>> CHAOS TXT id.server @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61514
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;id.server.                     CH      TXT

;; ANSWER SECTION:
id.server.              0       CH      TXT     "IAD"

;; Query time: 3 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Tue Aug 18 20:52:27 UTC 2020
;; MSG SIZE  rcvd: 52

unbound logs

unbound[114295]: [1597783755] unbound[114295:0] debug: SSL connection to cloudflare-dns.com authenticated ip4 1.0.0.1 port 853 (len 16)
unbound[114295]: [1597783755] unbound[114295:0] debug: comm point listen_for_rw 10 1
unbound[114295]: [1597783755] unbound[114295:0] debug: comm point stop listening 10
unbound[114295]: [1597783755] unbound[114295:0] debug: comm point start listening 10 (-1 msec)
unbound[114295]: [1597783755] unbound[114295:0] debug: Reading ssl tcp query of length 468
unbound[114295]: [1597783755] unbound[114295:0] debug: comm point stop listening 10
unbound[114295]: [1597783755] unbound[114295:0] debug: outnettcp cb
unbound[114295]: [1597783755] unbound[114295:0] debug: measured TCP-time at 8 msec
unbound[114295]: [1597783755] unbound[114295:0] debug: svcd callbacks start
unbound[114295]: [1597783755] unbound[114295:0] debug: worker svcd callback for qstate 0x55d337513520
unbound[114295]: [1597783755] unbound[114295:0] debug: mesh_run: start
unbound[114295]: [1597783755] unbound[114295:0] debug: iterator[module 2] operate: extstate:module_wait_reply event:module_event_reply
unbound[114295]: [1597783755] unbound[114295:0] info: iterator operate: query lms-auth.schwab.com. A IN
unbound[114295]: [1597783755] unbound[114295:0] info: iterator operate: chased to lms-auth.schwab.com.edgekey.net. A IN
unbound[114295]: [1597783755] unbound[114295:0] debug: process_response: new external response event
unbound[114295]: [1597783755] unbound[114295:0] info: scrub for . NS IN
unbound[114295]: [1597783755] unbound[114295:0] info: response for lms-auth.schwab.com. A IN
unbound[114295]: [1597783755] unbound[114295:0] info: reply from <.> 1.0.0.1#853
unbound[114295]: [1597783755] unbound[114295:0] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: SERVFAIL, id: 0
unbound[114295]: ;; flags: qr rd ra ; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
unbound[114295]: ;; QUESTION SECTION:
unbound[114295]: lms-auth.schwab.com.edgekey.net.#011IN#011A
unbound[114295]: ;; ANSWER SECTION:
unbound[114295]: ;; AUTHORITY SECTION:
unbound[114295]: ;; ADDITIONAL SECTION:
unbound[114295]: ;; MSG SIZE  rcvd: 49
unbound[114295]: [1597783755] unbound[114295:0] debug: iter_handle processing q with state QUERY RESPONSE STATE
unbound[114295]: [1597783755] unbound[114295:0] info: query response was THROWAWAY
unbound[114295]: [1597783755] unbound[114295:0] debug: iter_handle processing q with state QUERY TARGETS STATE
unbound[114295]: [1597783755] unbound[114295:0] info: processQueryTargets: lms-auth.schwab.com. A IN
unbound[114295]: [1597783755] unbound[114295:0] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 10
unbound[114295]: [1597783755] unbound[114295:0] info: DelegationPoint<.>: 0 names (0 missing), 2 addrs (0 result, 0 avail) parentNS
unbound[114295]: [1597783755] unbound[114295:0] debug:   [cloudflare-dns.com] ip4 1.0.0.1 port 853 (len 16)
unbound[114295]: [1597783755] unbound[114295:0] debug:   [cloudflare-dns.com] ip4 1.1.1.1 port 853 (len 16)
unbound[114295]: [1597783755] unbound[114295:0] debug: No more query targets, attempting last resort
unbound[114295]: [1597783755] unbound[114295:0] debug: configured stub or forward servers failed -- returning SERVFAIL
unbound[114295]: [1597783755] unbound[114295:0] debug: store error response in message cache
unbound[114295]: [1597783755] unbound[114295:0] debug: return error response SERVFAIL
unbound[114295]: [1597783755] unbound[114295:0] debug: mesh_run: iterator module exit state is module_finished
unbound[114295]: [1597783755] unbound[114295:0] debug: validator[module 1] operate: extstate:module_wait_module event:module_event_moddone
unbound[114295]: [1597783755] unbound[114295:0] info: validator operate: query lms-auth.schwab.com. A IN
unbound[114295]: [1597783755] unbound[114295:0] debug: validator: nextmodule returned
unbound[114295]: [1597783755] unbound[114295:0] debug: cannot validate non-answer, rcode SERVFAIL
unbound[114295]: [1597783755] unbound[114295:0] debug: mesh_run: validator module exit state is module_finished
unbound[114295]: [1597783755] unbound[114295:0] debug: subnet[module 0] operate: extstate:module_wait_module event:module_event_moddone
unbound[114295]: [1597783755] unbound[114295:0] info: subnet operate: query lms-auth.schwab.com. A IN
unbound[114295]: [1597783755] unbound[114295:0] debug: mesh_run: subnet module exit state is module_finished
unbound[114295]: [1597783755] unbound[114295:0] debug: query took 0.344303 sec
unbound[114295]: [1597783755] unbound[114295:0] info: mesh_run: end 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 28 recursion replies sent, 0 replies dropped, 0 states jostled out
unbound[114295]: [1597783755] unbound[114295:0] info: average recursion processing time 0.115852 sec
unbound[114295]: [1597783755] unbound[114295:0] info: histogram of recursion processing times
unbound[114295]: [1597783755] unbound[114295:0] info: [25%]=0.01536 median[50%]=0.0600747 [75%]=0.196608
unbound[114295]: [1597783755] unbound[114295:0] info: lower(secs) upper(secs) recursions
unbound[114295]: [1597783755] unbound[114295:0] info:    0.008192    0.016384 8
unbound[114295]: [1597783755] unbound[114295:0] info:    0.016384    0.032768 1
unbound[114295]: [1597783755] unbound[114295:0] info:    0.032768    0.065536 6
unbound[114295]: [1597783755] unbound[114295:0] info:    0.065536    0.131072 4
unbound[114295]: [1597783755] unbound[114295:0] info:    0.131072    0.262144 4
unbound[114295]: [1597783755] unbound[114295:0] info:    0.262144    0.524288 5
unbound[114295]: [1597783755] unbound[114295:0] debug: cache memory msg=80993 rrset=95878 infra=8402 val=75794 subnet=74504
unbound[114295]: [1597783755] unbound[114295:0] debug: svcd callbacks end
unbound[114295]: [1597783755] unbound[114295:0] debug: close fd 10

Hi @sometalk,

Thanks for providing the debug log. With my limited unbound knowledge, it seems unbound is not configured to forward all the requests to cloudflare-dns.com. I do see it shows there’s a SERVFAIL from upstream, but I don’t know what is the upstream. What’s your forward-zone setting like? I don’t see any error on this zone from our side though.

unbound[114295]: [1597783755] unbound[114295:0] info: iterator operate: query lms-auth.schwab.com. A IN
unbound[114295]: [1597783755] unbound[114295:0] info: iterator operate: chased to lms-auth.schwab.com.edgekey.net. A IN
...
unbound[114295]: [1597783755] unbound[114295:0] info: reply from <.> 1.0.0.1#853
unbound[114295]: [1597783755] unbound[114295:0] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: SERVFAIL, id: 0

Thank you for looking into this. I have pasted the forward-zone configuration being used.

forward-zone:
    name: "."
    forward-addr: [email protected]853#cloudflare-dns.com
    forward-addr: [email protected]#cloudflare-dns.com
    forward-ssl-upstream: yes

Couple of things to note. One, as before the behavior lasted for only a few hours. I could also reproduce the same behavior from another EC2 instance in the same region (IAD) during the same time window. However, I could not reproduce the problem from an instance in a different region (LHR) and my local Raspberry Pi setup was also working fine (again, during the same time window).

Your config looks good. Unbound will scrub CNAME chain:

CNAMEs are chased by unbound itself, asking the
remote server for every name in the indirection chain, to protect the
local cache from illegal indirect referenced items.

I’ll try to figure out why the lms-auth.schwab.com.edgekey.net query failed. If it bother you, maybe you could add forward-first: yes under the forward-zone, let unbound resolve it by itself when upstream failed.