Issue
We are using Google Cloud Platform VM instance with SSD (us-central1-f) and Cloudflare free plan.
Since April 15, 2020, we have started getting a lot of Error 524.
First, we noticed that social media could not get shared content OG meta data.
Later, we have gotten a lot of 5xx errors in Google Search Console with the list of de-indexed pages.
We have analyzed logs, the errors time was random, but we’ve noticed outage spikes the time content is shared on social media.
Therefore, we tested website performance after sharing the post, and received Error 524. We also tested website using sitemeer.com, see screenshots below.
Nginx
According to Cloudflare, Error 524 indicates that Cloudflare made a successful TCP connection to the origin web server, but the origin did not reply with an HTTP response before the connection timed out.
We have checked website logs, and noticed that we do receive request from Cloudflare, that it was processed, but Cloudflare refuses to receive response.
First we thought that website do not respond quickly enough, so we have optimized website local cache to send cached content after the first request.
However, it did not help.
Nginx Proxy Cache
The second thought was that nginx for some other reason cannot send response. So, we additionally installed nginx proxy in front of webserver, with cache enabled using ‘proxy_cache_background_update’ option. That allows starting a background subrequest to update an expired cache item, while a stale cached response is returned to the client. That improved website response even more.
Unfortunately, new tests showed that webserver works fine, and now timeout is at the nginx proxy server.
Traefik
To make sure it’s not Nginx problem, we decided to install some not-nginx proxy in front of current setup.
We have chosen Traefik. It is written on Go (Nginx uses C), fast and has built-in Let’s Encrypt wildcard certificates support.
Alas, new tests showed that webserver and nginx proxy work fine, and now timeout is at Traefik.
Conclusions
At last, on Friday, May 8, 2020, we have disabled Cloudflare Proxy. Soon, Traefik was able to receive Let’s Encrypt certificate via tlsChallenge.
Since then, website works flawless, no more connection and response issues.
Therefore, we suppose, this might be some Cloudflare bug, that does not allow webserver to send response.
P.S.
In addition, we noticed significant non-human traffic drop, maybe just coincidence.
Website: https://watchward.com/