Intermittent SERVFAILs: "failed to verify signatures for [DOMAIN]. opt-out proof"

Hello,

Sometime within the past week I started to have some weird DNS issues. Tracked it down to an issue with 1.1.1.1 (and 1.0.0.1).

https://one.one.one.one/help/#eyJpc0NmIjoiTm8iLCJpc0RvdCI6Ik5vIiwiaXNEb2giOiJObyIsInJlc29sdmVySXAtMS4xLjEuMSI6IlllcyIsInJlc29sdmVySXAtMS4wLjAuMSI6IlllcyIsInJlc29sdmVySXAtMjYwNjo0NzAwOjQ3MDA6OjExMTEiOiJZZXMiLCJyZXNvbHZlcklwLTI2MDY6NDcwMDo0NzAwOjoxMDAxIjoiWWVzIiwiZGF0YWNlbnRlckxvY2F0aW9uIjoiTVNQIiwiaXNXYXJwIjoiTm8iLCJpc3BOYW1lIjoiR29vZ2xlIiwiaXNwQXNuIjoiMTUxNjkifQ==

Here’s how to reproduce this reliably (using the MSP data center anyway):

for i in `seq 1 1000`; do dig @1.1.1.1 ab$i.newgrounds.com; sleep .25; done

Maybe 1-2/100 queries this will be returned:

; <<>> DiG 9.10.6 <<>> @1.1.1.1 ab45.newgrounds.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 50317
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; OPT=15: 00 0a 66 61 69 6c 65 64 20 74 6f 20 76 65 72 69 66 79 20 73 69 67 6e 61 74 75 72 65 73 20 66 6f 72 20 6e 65 77 67 72 6f 75 6e 64 73 2e 63 6f 6d 2e 20 6f 70 74 2d 6f 75 74 20 70 72 6f 6f 66 ("..failed to verify signatures for newgrounds.com. opt-out proof")
;; QUESTION SECTION:
;ab45.newgrounds.com.		IN	A

;; Query time: 128 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Mon Jul 24 13:17:45 CDT 2023
;; MSG SIZE  rcvd: 115

This caused intermittent DNS resolution errors on my end. This happened with several sites but I manage newgrounds.com and related domains so I noticed it there first. Switching my upstream to Google DNS has resolved this.

This only appears to be happening with the MSP data center. I tested YYZ and could not replicate it there.

2 Likes

I’ve seen the exact same error in the “BNE” area.

❯ dig @1.1.1.1 links.ebw.ebgames.com.au

; <<>> DiG 9.10.6 <<>> @1.1.1.1 links.ebw.ebgames.com.au
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 46335
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; OPT=15: 00 0a 66 61 69 6c 65 64 20 74 6f 20 76 65 72 69 66 79 20 73 69 67 6e 61 74 75 72 65 73 20 66 6f 72 20 6d 6b 74 34 31 2e 6e 65 74 2e 20 6f 70 74 2d 6f 75 74 20 70 72 6f 6f 66 ("..failed to verify signatures for mkt41.net. opt-out proof")
;; QUESTION SECTION:
;links.ebw.ebgames.com.au.	IN	A

;; Query time: 62 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Tue Jul 25 10:14:55 AEST 2023
;; MSG SIZE  rcvd: 115

then immediately a few moments later

❯ dig @1.1.1.1 links.ebw.ebgames.com.au

; <<>> DiG 9.10.6 <<>> @1.1.1.1 links.ebw.ebgames.com.au
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40034
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;links.ebw.ebgames.com.au.	IN	A

;; ANSWER SECTION:
links.ebw.ebgames.com.au. 300	IN	CNAME	recp.mkt41.net.
recp.mkt41.net.		300	IN	A	52.206.57.53

;; Query time: 365 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Tue Jul 25 10:14:58 AEST 2023
;; MSG SIZE  rcvd: 97
❯ dig +short CHAOS TXT id.server @1.1.1.1
"BNE"
1 Like

We are experiencing similar issues. We had a vendor reporting “connectivity” issues to one of our services (very intermittent) starting about 2 weeks ago or so.

Over the past few days, our internal alerts have notified us intermittently of errors with other services. We have traced it to this specific issue

Note these 2 responses - Queries issued within a second of each other

The SERVFAILS are random - It might work 1000 times, and then issue a SERVFAIL -

Working

$ dig track.lucit.app

; <<>> DiG 9.11.20-RedHat-9.11.20-5.el8 <<>> track.lucit.app
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 43961
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;track.lucit.app.               IN      A

;; ANSWER SECTION:
track.lucit.app.        213     IN      A       159.89.253.42

;; Query time: 15 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Mon Jul 24 23:01:40 UTC 2023
;; MSG SIZE  rcvd: 60

Not a fail, but, note the information

$ dig track.lucit.app

; <<>> DiG 9.11.20-RedHat-9.11.20-5.el8 <<>> track.lucit.app
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 396
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; EDE: 3 (Stale Answer)
; EDE: 10 (RRSIGs Missing): (failed to verify signatures for lucit.app. opt-out proof)
;; QUESTION SECTION:
;track.lucit.app.               IN      A

;; ANSWER SECTION:
track.lucit.app.        0       IN      A       159.89.253.42

;; Query time: 71 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Mon Jul 24 23:01:40 UTC 2023
;; MSG SIZE  rcvd: 128

Failure


; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.3 <<>> @1.1.1.1 track.lucit.app
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 34240
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; OPT=15: 00 0a 66 61 69 6c 65 64 20 74 6f 20 76 65 72 69 66 79 20 73 69 67 6e 61 74 75 72 65 73 20 66 6f 72 20 6c 75 63 69 74 2e 61 70 70 2e 20 6f 70 74 2d 6f 75 74 20 70 72 6f 6f 66 ("..failed to verify signatures for lucit.app. opt-out proof")
;; QUESTION SECTION:
;track.lucit.app.               IN      A

;; Query time: 10 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Mon Jul 24 22:33:41 UTC 2023
;; MSG SIZE  rcvd: 106

I found that if I waited the TTL I’d get the error, so if the first query failed, the second query a second later would work, but then if the TTL was 300 if I waited 5 minutes and queried again, it would do the same, fail then work.

Intermittently I would get the error along with a successful response, not sure what caused that.

I noticed it as well for some of our hostnames between sites, but I also noticed I’d been having a problem with emails on my mobile first attempt to open a page would fail, click reload and it works, tracked it down to the same issue.

It’s been happening since at least the 21st of July, starting around 12-4am local time.

I would also add the following:

  • We saw the issue from “EWR” and “FSD”
  • We do NOT use DNSSEC
  • It is definately “intermittent” - But over the course of a few hours. I can make it happen with
for i in `seq 1 1000`; do dig @1.1.1.1 track.lucit.app | grep failed; sleep .25; done
  • Our registrar is GoDaddy (if that matters)
  • Our nameservers are at Digital Ocean (if that matters)
  • Has happened for domains lucit.app and lucit.cc (any of their various subdomains can trigger it)

Have you noticed it when the TTL expires ?
Like does it seem to happen about every hour ?

track.lucit.app worked for me and it seemed to be a fresh query (returned a TTL of 3600), subsequent requests returned a decreasing TTL, so I’ll need to wait an hour to re-test.

It could be - Some things that I notice is that If I run the script below, it might output say 15 fails in a row, and then, no fails at all

I will try to be more mindful of TTL times when running my tests.

Linking to related report in another thread: CNAME DNSSEC RRSIGs Missing

TTL’s appear to have little effect from my testing, I have observed this on .co.uk and .net domains.

They’re looking into this now, the reports were helpful

4 Likes

Can everyone confirm that the issue is resolved? -

It seems to be resolved for our domains as of today

All my domains will once again resolve today.

1 Like

Thanks all for reporting the issue. We had a bad software release that caused this error on specific domains that use NSEC3 with opt-out.

The problem should be gone after we applied a fix (see the status page posted by @Chaika for reference). Sorry for the trouble it had caused.

5 Likes

Appears to be fixed. Thanks.

Thanks - Do you know when this release started to have this issue? - I just want to tidy the dates up with some other issues we were having to see if they match.

Thanks!

Eric

Hi Eric,

The release process was done in several stages. The first small set of servers get the bad version was started at around 2023-07-19 18:32 UTC.

We let it stay in a subset of our data centers over the weekend and released it to the rest of world started at around 2023-07-24 18:39 UTC.

Hope that’s helpful to you :pray:

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.