We’re running this code in a cloud worker. It is retrieving the page from our server, and goes on to make some small modifications to contentResponse.
var date1 = new Date();
console.log("fetching", url, " on ", date1);
contentResponse = await fetch(request);
var date2 = new Date();
console.log("got fetch - took", date2 - date1);
The problem is that around 20% of the requests end like this, taking over 60 seconds to respond despite being a Cache HIT
fetching / on Mon Mar 18 2019 04:00:15 GMT+0000 (Coordinated Universal Time)
VM31 worker.js:133 got fetch - took 60035
Note that the data is returned successfully on this request; the page content loads fine, it just takes a long time.
Here’s a normal fetch that ran without the delay:
fetching / on Mon Mar 18 2019 03:59:18 GMT+0000 (Coordinated Universal Time)
VM31 worker.js:133 got fetch - took 39
So interestingly the fetch seems to have a 60s delay - it looks like the fetch is a cache hit (I’ve confirmed this by checking my server logs, the fetch never hits the server) where it took 35ms for the fetch, but then didn’t complete the promise for an extra 60000ms.
Here’s a dump of the details from the cloudworker debugger - I’ve put the 1min request at the top, then the response headers and request headers
I wonder whether that could be a network timeout. You say it does not hit the origin, but could it be that it tries to but cannot connect (for whatever reason) and runs into a timeout?
I would expect a timeout to raise an exception, which should then never reach your second log statement, but I am not all that familiar with workers to provide a definite statement here, so it is mostly speculation I am afraid.
That was my thought too, but the request is actually succeeding - the content is being returned. Also I adjusted the timeout of my origin to 30s, and it still takes 60s to complete the request. Finally, the logs that I attached in the image show that the CF-Cache-Status from the response headers is “HIT” - which supports my belief that the server is not being accessed.
Definitely a possibility, though as I say I can see the server logs and they aren’t seeing anything. Could be that the request never makes it to the server though, which would be most unusual given the intermittent nature of the problem, and the fact that I can reach the server with no issues if I bypass the CDN
Another interesting component of the problem - if I make a request to the endpoint after clearing the cache, the request takes around 61500ms to complete and it’s a cache MISS, if I make the request with a valid cache the request takes around 60050. This is consistent as my origin server does take around 1.5s to respond. It does seem like 60s is just being added to the round trip for some reason.
I’ve checked when the server request happens in the logs and it’s at the start of the fetch, so it makes the fetch, comes back with the result, then waits 60s before continuing.
Cloudflare team told me that Debugger is a totally different stack than Production.
Cache API shouldn’t work properly.
All request are from ORD Chicago colo, as opposite to a colo closer to your origin.
It sounds like somewhere in the system, something is getting stuck, and then after 60s the whole thing times out.
My guess is that some component of the system is failing to finalize an HTTP request or response.
Is it possible that your origin server is generating a chunked response and is not actually sending the final zero-sized chunk? For example, if you use node.js, this could happen if you forgot to call response.end() to finish the response. It’s possible that in this case our cache actually caches the response with the final chunk missing, so the same problem occurs even when responding from cache.
This doesn’t totally make sense, though: if your code is exactly as you wrote above, then the await fetch(request) should only wait for headers, not for the response body.
Could you possibly go to Cloudflareworkers.com and write a script that reproduces this problem, then share it with us (copy/paste the URL after saving)? It would be a lot easier to debug then.
Note that staging-web.mamamia.com.au has a page rule to “Cache Everything” and an edge TTL of 30 mins, and the issue doesn’t happen when accessing the server direct, only when visiting through the staging-web CDN enabled access point
I suspect the problem is in your code, but not where you think it is.
Here’s the thing: When you use the preview service, and it does a fetch() to your origin, that fetch() request actually goes into Cloudflare’s “front door”, treated like any request coming from the internet. And because you have a worker deployed in production, that worker runs. In other words, two copies of your worker end up running on the request – one in preview and one in production.
(This is one of the problems with our preview setup that we’d like to address, but fixing it is complicated.)
So, in preview, you’ve narrowed down the delay to the fetch() call. But that fetch() is itself actually calling the production version of your worker, which means the delay could actually be happening anywhere in that worker. We don’t actually know that the delay comes from the point where your production worker fetch()es to its own origin.
It looks like your production worker is quite complicated, doing many other fetches before and after the main fetch. I’d recommend modifying the code to return timing information in headers, so that you can narrow down where exactly the delay happens (while running in production).
@KentonVarda Heads up that I’ve rewritten the cloud worker to modify the request in chunks rather than after a .text() call, which appears to have fixed the problem. It’s possible that the last chunk wasn’t completing, but I’m sceptical of that as I’m unable to reproduce any issues with the chunks when accessing the server direct.
Either way the issue is resolved, though if you’re interested in trying to find what was wrong let me know if I can help