Fetch takes 60s in cloud worker

We’re running this code in a cloud worker. It is retrieving the page from our server, and goes on to make some small modifications to contentResponse.

var date1 = new Date();
console.log("fetching", url, " on ", date1);
contentResponse = await fetch(request);
var date2 = new Date();
console.log("got fetch - took", date2 - date1);

The problem is that around 20% of the requests end like this, taking over 60 seconds to respond despite being a Cache HIT

fetching /  on  Mon Mar 18 2019 04:00:15 GMT+0000 (Coordinated Universal Time)
VM31 worker.js:133 got fetch - took 60035

Note that the data is returned successfully on this request; the page content loads fine, it just takes a long time.

Here’s a normal fetch that ran without the delay:

fetching /  on  Mon Mar 18 2019 03:59:18 GMT+0000 (Coordinated Universal Time)
VM31 worker.js:133 got fetch - took 39

So interestingly the fetch seems to have a 60s delay - it looks like the fetch is a cache hit (I’ve confirmed this by checking my server logs, the fetch never hits the server) where it took 35ms for the fetch, but then didn’t complete the promise for an extra 60000ms.

Here’s a dump of the details from the cloudworker debugger - I’ve put the 1min request at the top, then the response headers and request headers

You can see there that it’s a CF-Cache-Status: Hit which somehow takes 60s. Very strange.

Appreciate any help or suggestions!

I wonder whether that could be a network timeout. You say it does not hit the origin, but could it be that it tries to but cannot connect (for whatever reason) and runs into a timeout?

I would expect a timeout to raise an exception, which should then never reach your second log statement, but I am not all that familiar with workers to provide a definite statement here, so it is mostly speculation I am afraid.

Tagging @KentonVarda

Thanks for the suggestion!

That was my thought too, but the request is actually succeeding - the content is being returned. Also I adjusted the timeout of my origin to 30s, and it still takes 60s to complete the request. Finally, the logs that I attached in the image show that the CF-Cache-Status from the response headers is “HIT” - which supports my belief that the server is not being accessed.

It might be the worker will default to the cache in case of a timeout. I am afraid I am really just speculating here at this point.

I would try with a support ticket or - possibly more promising - wait until @KentonVarda can chime in :slight_smile:

Have got a ticket open but no responses yet.

Definitely a possibility, though as I say I can see the server logs and they aren’t seeing anything. Could be that the request never makes it to the server though, which would be most unusual given the intermittent nature of the problem, and the fact that I can reach the server with no issues if I bypass the CDN

That is precisely what I was referring to with the timeout.

Another interesting component of the problem - if I make a request to the endpoint after clearing the cache, the request takes around 61500ms to complete and it’s a cache MISS, if I make the request with a valid cache the request takes around 60050. This is consistent as my origin server does take around 1.5s to respond. It does seem like 60s is just being added to the round trip for some reason.

I’ve checked when the server request happens in the logs and it’s at the start of the fetch, so it makes the fetch, comes back with the result, then waits 60s before continuing.

Is it possible that the cloud worker isn’t running linearly? Do cloud workers split the code execution into multiple threads or something? I know this shouldn’t be possible with Javascript unless you’re using Web Workers, but I have experienced red herrings with logging before where the logs don’t match up to the order of execution

Cloudflare team told me that Debugger is a totally different stack than Production.
Cache API shouldn’t work properly.
All request are from ORD Chicago colo, as opposite to a colo closer to your origin.

Have you tested fully in production?

I haven’t seen this problem before.

It sounds like somewhere in the system, something is getting stuck, and then after 60s the whole thing times out.

My guess is that some component of the system is failing to finalize an HTTP request or response.

Is it possible that your origin server is generating a chunked response and is not actually sending the final zero-sized chunk? For example, if you use node.js, this could happen if you forgot to call response.end() to finish the response. It’s possible that in this case our cache actually caches the response with the final chunk missing, so the same problem occurs even when responding from cache.

This doesn’t totally make sense, though: if your code is exactly as you wrote above, then the await fetch(request) should only wait for headers, not for the response body.

Could you possibly go to Cloudflareworkers.com and write a script that reproduces this problem, then share it with us (copy/paste the URL after saving)? It would be a lot easier to debug then.


Thanks @KentonVarda ! I’ve reproduced the problem here:


It seems to happen much more regularly on a new worker (i.e. after clicking the “Update” button)

You might need to click update a few times to see it.

In regards to chunked mode, you can hit our server directly and skip the CDN at http://mamamia-pwa-staging.firebaseapp.com - everything looks to be performing normally to me

Note that staging-web.mamamia.com.au has a page rule to “Cache Everything” and an edge TTL of 30 mins, and the issue doesn’t happen when accessing the server direct, only when visiting through the staging-web CDN enabled access point

@adaptive Thanks for the reply, I’m afraid the issue happens with normal requests that don’t use the debugger as well

I temporarily reproduced it, but the problem seems to come and go, and it went away before I could narrow it down… :confused:

1 Like

Thanks for looking @KentonVarda - FYI there’s a ticket #1658199 for this - they’re currently stumped too by the sound of it though

Hi @tech45,

I suspect the problem is in your code, but not where you think it is.

Here’s the thing: When you use the preview service, and it does a fetch() to your origin, that fetch() request actually goes into Cloudflare’s “front door”, treated like any request coming from the internet. And because you have a worker deployed in production, that worker runs. In other words, two copies of your worker end up running on the request – one in preview and one in production.

(This is one of the problems with our preview setup that we’d like to address, but fixing it is complicated.)

So, in preview, you’ve narrowed down the delay to the fetch() call. But that fetch() is itself actually calling the production version of your worker, which means the delay could actually be happening anywhere in that worker. We don’t actually know that the delay comes from the point where your production worker fetch()es to its own origin.

It looks like your production worker is quite complicated, doing many other fetches before and after the main fetch. I’d recommend modifying the code to return timing information in headers, so that you can narrow down where exactly the delay happens (while running in production).

Thanks @KentonVarda, have done as you requested and got this response in the headers:

https://pwa.mamamia.com.au/getRedirects - 11
https://staging-web.mamamia.com.au/ - 8
https://fonts.googleapis.com/css?family=Lora:700|Montserrat:500,600 - 21
Promises - 0
contentResponse.text() - 59979

So it was the contentResponse.text() function which took 60s! Any ideas why this would take 60s?

This is the line of code

const html = await contentResponse.text();

NOTE: We are modifying the html before sending this response (to add the google fonts css to the HEAD) so we can’t just stream the response (as I understand it)


Note that our Google Fonts code is based on this repo

EDIT: It looks like this repo has become significantly more advanced since we implemented. I’ll see if we can reproduce the chunked modification - would be faster and might fix our issue.

I’m still interested in finding out if anyone knows why the .text() function is taking 60s though, it should just be a very short wait

@KentonVarda Heads up that I’ve rewritten the cloud worker to modify the request in chunks rather than after a .text() call, which appears to have fixed the problem. It’s possible that the last chunk wasn’t completing, but I’m sceptical of that as I’m unable to reproduce any issues with the chunks when accessing the server direct.

Either way the issue is resolved, though if you’re interested in trying to find what was wrong let me know if I can help