How to debug random CPU spikes?

My first year with Workers I had the same goal, everything in workers. But it’s still too limited, I’ve moved quite a lot of the workload to AWS Lambdas. It’s easy keeping the execution under 100ms there (since I already optimized for Workers), so costs can be closer to Workers. Just have to keep in mind that their API gateway will incur costs for every request and bandwidth have to be calculated. I expect around 100-200% higher cost.

If you consider Lambdas, use the Serverless Framework to manage it or you’ll be pulling your hair out in no-time…

3 Likes

Yeah I think this spike is definitely caused by workers waking up.

You can create a random worker ID and store it in a global variable with a start time, and log that with each request. That will let you see if it’s a fresh worker or not… We do this with Logflare so you can actually see the invocation of each instance of a worker.

And even chart that:

And then even chart the avg origin_time (which I think would include your processing time):

2 Likes

To compare I opened the CPU metrics of another worker and lo and behold I get similar spikes:

The difference is that this worker does practically nothing. It’s like 20 lines of code. It reads the cache API and if there is no cached response it gets the response from KV.

Also this one which is a Workers Site doing the same thing although with lower spikes.

I have no idea what’s going on, but this seems endemic to the workers runtime and not related to my code.

What worries me is that what if down the road my API and HTML renderer workers end up causing spikes higher than 50ms?

Thanks for the idea @chasers .

I guess I could then compare that start time with the spikes and see if there is a correlation.

Yeah and it should be easy enough to wrap functions around a timer and log that too. I don’t think workers support the performance API still. We just use Date.now which I think is good enough.

1 Like

I think there might be some reporting shenanigans going on too…

In the “last 30 mins” view:

But in the “last 6 hours” view:

So yes, there is definitely a correlation between cold starts and these spikes.

These are the start times as recorded to KV in GMT time:

And these are the latest spikes triggered at different data centers (these ones are in my local time):

My current hypothesis is that this only happens on the first start after deploy on each data center.

Can someone from Cloudflare comment on this?

If these initialization spikes go beyond 50ms are these requests going to return an error?

Is there a way to prevent those spikes? I mean, these even happen with very minimalistic workers as show on a previous post.

So in this comment @KentonVarda says there is a grace time for exceeding CPU time.

Would these initialization spikes fall within that grace time?

(if they exceeded 50 ms that is)

Yes they will, which is why you can see in my discussions about JS-crypto vs native Crypto - why it works the first 20 requests and then fails.

2 Likes

Would be great to have an official answer :slight_smile:

I think only @KentonVarda can answer that.

1 Like

I understand your worry. Cloudflare monitoring system might throw errors when it is persistent, and for that reason the percentile is there.

The two graphs below, show a big problem. The first graph killing the problem, the second a developer introducing back the problem.

The problem was using luxon dependency to do date/time manipulation, instead what is available in V8.

1 Like

Our CPU limit enforcement is designed to be lenient with random spikes – in part because, practically speaking, there may be no way to avoid them. On a machine that is doing lots and lots of other things at the same time, random noise can cause the same computation to take much more CPU time sometimes.

Basically, you are only at risk of errors if your running average CPU usage is over the limit, or if a single request goes way over, like 10x.

I hesitate to explain the mechanism in more detail because we’re likely to change it soon. But, in short, you don’t need to worry about random single-request spikes.

As for why you’re seeing a spike, I’m not entirely sure. We don’t actually count script startup time in the first request time, so it’s not that. But I think another possibility is lazy parsing. V8 tries to avoid parsing function bodies until the function is first called. If you have a fairly large code footprint, a lot of which gets called on the first request, that could explain the first request running a bit slower, I think. That said, I’m speculating here; it’s hard to say without doing some profiling (which, unfortunately, at present, you wouldn’t be able to do yourself – we’d like to fix that).

6 Likes

Thanks a lot for your explanation @KentonVarda so there is nothing to worry about after all!

It’s not a very large worker. The minimized js file is 153kB which should be reduced to less than 100kB with the final compression, way below the 1MB limit.

It may be way below the limit, but 153kB of JavaScript is still quite a bit of code and may indeed take several milliseconds for V8 to compile. So I think that’s a likely explanation here.

4 Likes

Duly noted!

Ohh, sweet lovely V8. Reminds me of https://web.archive.org/web/20181006134205/https://top.fse.guru/nodejs-a-quick-optimization-advice-7353b820c92e?gi=9146aa8d76d0

JavaScript insanity :smile:

1 Like

Lazy parsing is great, though, otherwise cold start time would be much longer!

I get your point, nonetheless, JavaScript (or better V8) is doing some hacks which would be unheard of on other platforms.