Issue
Some of our website visitors occasionally enter a state where most of their requests to our API (which is protected by Cloudflare) see around 3 seconds of latency on every request. When loading a webpage in our SPA, which takes 15 API calls, every single call takes > 3 seconds. Simultaneously, another user loading the same page from the same office incurs no additional latency.
Diagnostics
We have run a variety of diagnostic tools to pinpoint the issue as being in the Cloudflare layer. Some notes:
Long running ping tests by these users do not show any additional latency (100-300ms response).
Traceroutes by these users show no additional latency (< 15ms per node)
Random HTTP requests they make using Curl occasionally incur this latency. Running back-to-back curl requests, sometimes 1 of 5 requests will see a response time spike to > 3 seconds. See the below Curl log and note the the timestamp difference of 3.5 seconds between when the request is complete, and the response starts to be received:
21:27:23.255094 == Info: We are completely uploaded and fine
21:27:23.262965 == Info: Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
21:27:26.999045 <= Recv header, 13 bytes (0xd)
0000: HTTP/2 401
21:27:26.999391 <= Recv header, 37 bytes (0x25)
0000: date: Fri, 10 Feb 2023 05:27:27 GMT
Our service is hosted on Heroku. Reviewing Heroku logs, indicates that the Heroku router receives the request ~3 seconds after it’s made by the client, and from Heroku’s perspective the total response time is < 300ms. This indicates that the latency is being incurred upstream. Similarly, Datadog traces that stitch together website and server traces indicate the latency as being entirely incurred on the path into the API server.
Any help diagnosing would be greatly appreciated. Thank you!
Using Cloudflare to protect your dynamic (non-cache) request is definitely going to add latency. However, I don’t think Cloudflare will add up to 3 seconds in latency, though it depends on the server vs Cloudflare edge locations.
If you have a test URL, I can try to run some tests.
I completely agree that putting our API behind Cloudflare is going to add latency. That latency has, however, been reasonable until January when this started happening.
Interestingly, the post that spbb quotes summarizes exactly what we’ve experienced. This problem started in mid January, but it affected far more than 0.05% of our requests. Every request to render a page was sometimes affected. We don’t have Argo smart routing enabled, though, as this particular service sits behind the free tier of Cloudflare (though we plan to upgrade to Business shortly).
Additionally, after almost a month of facing the issue, mysteriously, nobody here seems to be able to reproduce it as of this morning. I hesitate to close this post in case the issue persists, but I will if it seems to be resolved.