Intermittent 524 origin timeout on 1% of requests

I’m receiving Intermittent 524 origin timeout errors on ~1% of requests, even though my application logs that it did send a response well within the 100 seconds timeout.

Recent example:

client call (curl):
curl -I https://****/someurl

HTTP/2 524
date: Wed, 27 May 2020 23:25:12 GMT
content-type: text/html
HttpOnly; SameSite=Lax; Secure
cache-control: no-store, no-cache
cf-cache-status: MISS
cf-request-id: 02fa0bceae0000002a00002200000001
expect-ct: max-age=604800, report-uri=“https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct
server: cloudflare
cf-ray: 59a37bf77d02002a-LHR

My server log shows a 200 response after 43 seconds:

May 27 23:24:16 nginx default[api-dosfo-5bb79d4d48-22b8j] 10.244.8.59 - - [27/May/2020:23:24:16 +0000] “GET /someurl HTTP/1.1” 200 30499 “-” “curl/7.64.1” “2a01:4b00:864d:7a00:2cc4:5fe7:f6ac:9313” “59a37bf77d02002a-LHR” “GB” 44.031 43.796 : 0.228 .

Other ray-id’s with same issue (client gets a 524, but server logs say that it processed the request before the timeout):

59a37bf77d02002a-LHR
59a351f82b3a002a-LHR
59a344ff4ade002a-LHR
59a33e581e20002a-LHR
599a03195ad7e0a2-IAD

What could be the issue here?

We’re running digitalocean kubernetes behind cloudflare, and only getting these issues when cloudflare proxy is enabled.

It’s very difficult for me at the moment to recreate the issue on demand, as I don’t know what’s causing it - there doesn’t seem to be any pattern to the errors, and it’s only affecting 1% of requests.

The request flow is cloudflare => digitalocean load balancer => kubernetes haproxy ingress. The log above is from a kubernetes pod on the “edge” of our cluster. The cluster is using cilium and core-dns.

If it is a networking issue such as packet loss between cloudflare and the origin, how is it possible to diagnose further?

Thanks for any help!

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.