How to debug random CPU spikes?

I have a worker that is rendering some HTML using SSR.

Usually it takes less than 5ms in rendering the HTML but I’ve seen some spikes that take up to 13ms reported by the dashboard metrics.

Neither the data used for the dynamic rendering has change nor the code.

I’m worried that if I’m already using 13ms which is more than 20% of my CPU budget I might run into CPU issues down the road.

Is there a way to debug this?

In general this problem is solved via an APM solution that profiles each function and shows what takes the longest, but CF doesn’t offer that and I’m not sure of any existing APM solutions that work with Workers.

But if it was something related to the worker code it would be more consistent if we consider the exact same conditions (same db data, same JS logic), no?

Jumping from 3ms to even 18ms is a 600% increase.

See:

Could it be a reporting error?

Is it possible this happens when the worker is not hot and this is some initialization CPU cost instead of request processing?

How large is the HTML?

I’ve been hitting that limit ever since I started with Workers. Only solution has been to optimize and use native features as much as possible, even skipping WASM in some cases.

I first started seeing the spikes with pretty small HTMLs (aprox 1kB).

My first hunch was that these spikes were related to the size of the HTML (more HTML = more CPU work) so I modified the worker to produce a much larger html (aprox 250kB) but the behavior is very similar. Only a very small percentage of requests go above 10ms.

In fact since I started using the worker the median CPU time is 1.7ms.

image

I don’t know, I’m starting to think Workers are not a good fit for this use case and I should probably move the HTML rendering somewhere else. It’s a shame since I would have preferred having all my infra with Workers.

My first year with Workers I had the same goal, everything in workers. But it’s still too limited, I’ve moved quite a lot of the workload to AWS Lambdas. It’s easy keeping the execution under 100ms there (since I already optimized for Workers), so costs can be closer to Workers. Just have to keep in mind that their API gateway will incur costs for every request and bandwidth have to be calculated. I expect around 100-200% higher cost.

If you consider Lambdas, use the Serverless Framework to manage it or you’ll be pulling your hair out in no-time…

3 Likes

Yeah I think this spike is definitely caused by workers waking up.

You can create a random worker ID and store it in a global variable with a start time, and log that with each request. That will let you see if it’s a fresh worker or not… We do this with Logflare so you can actually see the invocation of each instance of a worker.

And even chart that:

And then even chart the avg origin_time (which I think would include your processing time):

2 Likes

To compare I opened the CPU metrics of another worker and lo and behold I get similar spikes:

The difference is that this worker does practically nothing. It’s like 20 lines of code. It reads the cache API and if there is no cached response it gets the response from KV.

Also this one which is a Workers Site doing the same thing although with lower spikes.

I have no idea what’s going on, but this seems endemic to the workers runtime and not related to my code.

What worries me is that what if down the road my API and HTML renderer workers end up causing spikes higher than 50ms?

Thanks for the idea @chasers .

I guess I could then compare that start time with the spikes and see if there is a correlation.

Yeah and it should be easy enough to wrap functions around a timer and log that too. I don’t think workers support the performance API still. We just use Date.now which I think is good enough.

1 Like

I think there might be some reporting shenanigans going on too…

In the “last 30 mins” view:

But in the “last 6 hours” view:

So yes, there is definitely a correlation between cold starts and these spikes.

These are the start times as recorded to KV in GMT time:

And these are the latest spikes triggered at different data centers (these ones are in my local time):

My current hypothesis is that this only happens on the first start after deploy on each data center.

Can someone from Cloudflare comment on this?

If these initialization spikes go beyond 50ms are these requests going to return an error?

Is there a way to prevent those spikes? I mean, these even happen with very minimalistic workers as show on a previous post.

So in this comment @KentonVarda says there is a grace time for exceeding CPU time.

Would these initialization spikes fall within that grace time?

(if they exceeded 50 ms that is)

Yes they will, which is why you can see in my discussions about JS-crypto vs native Crypto - why it works the first 20 requests and then fails.

2 Likes

Would be great to have an official answer :slight_smile:

I think only @KentonVarda can answer that.

1 Like

I understand your worry. Cloudflare monitoring system might throw errors when it is persistent, and for that reason the percentile is there.

The two graphs below, show a big problem. The first graph killing the problem, the second a developer introducing back the problem.

The problem was using luxon dependency to do date/time manipulation, instead what is available in V8.

1 Like

Our CPU limit enforcement is designed to be lenient with random spikes – in part because, practically speaking, there may be no way to avoid them. On a machine that is doing lots and lots of other things at the same time, random noise can cause the same computation to take much more CPU time sometimes.

Basically, you are only at risk of errors if your running average CPU usage is over the limit, or if a single request goes way over, like 10x.

I hesitate to explain the mechanism in more detail because we’re likely to change it soon. But, in short, you don’t need to worry about random single-request spikes.

As for why you’re seeing a spike, I’m not entirely sure. We don’t actually count script startup time in the first request time, so it’s not that. But I think another possibility is lazy parsing. V8 tries to avoid parsing function bodies until the function is first called. If you have a fairly large code footprint, a lot of which gets called on the first request, that could explain the first request running a bit slower, I think. That said, I’m speculating here; it’s hard to say without doing some profiling (which, unfortunately, at present, you wouldn’t be able to do yourself – we’d like to fix that).

6 Likes