We built a worker that works just fine with low amount of traffic, but as soon as we start putting just a bit of production load on it we’re getting requests with exceeded cpu in the worker status dashboard. Is the amount of cpu available per request or is there a certain amount of cpu per worker?
It should be per request as far as I am aware… When you say that the low amounts of traffic work just fine is that in the Worker preview in the Dashboard or on a deployed instance?
We could see the the exceeded cpu in the worker status graph and when testing manually it failed to respond to the client. Once the load was removed it started responding again.
The edge worker we’re running is logging to our log service, but we can’t see any errors in that log so I assume it didn’t make it that far. We can’t see any errors in the worker status code graph so I guess that the backend services the edge calls work fine.
Is there any other logs we can look at to pin down what happens?
I’d have thought it was per request. The docs say is per worker per request, Limits · Cloudflare Workers docs. Limits vary by plan type.
An individual Worker script may consume up to:
5-50 milliseconds CPU time per request, depending on Cloudflare Plan
15 seconds real time per request
128MB memory at any given time
Note that the time quotas are per request, while the memory quota is per worker instance. Cloudflare runs worker instances on many machines in many regions, automatically scaling them as necessary.
With just a very little load on the service we get about 50% of the requests failing with exceeding cpu errors according to the dashboard. I can’t reproduce the error from my side, so I’m not sure how I should go about troubleshooting it? If it was an issue with the memory, would it still be labelled as exceeding cpu in the dashboard?
Is there anyway to know which requests failed?
One thing I could imagine could cause this would be if there’s some edge cases where the origin server would hang. Could something like that manifest itself as an exceeded cpu error in the dashboard. We more or less don’t see any errors in the status code graph for the workers.
Well theoretically no, since the actual limit per request is infinity. The limit is CPU time, so compute time (JSON parsing, image resizing…), 15s since the original request to have new external requests (this would be timeout, I suppose) and infinity to have an actual response. 128MB of RAM is shared though, not sure if that would cause CPU time issues and errors (it should restart the process, so maybe it could give that error…).
We do handle pretty hefty documents and currently don’t stream them so I could imagine that we would bump into the 128 limit at some point. Does each request start with a clean slate, like I understand it works in nginScript, or do we need to ensure that all memory is gc’ed?
So, it seems possible that we hit the memory limit. Is there anyway of telling for sure before we invest to much time optimizing the code? Is it possible to do something like process.memoryUsage() like in node?
@markus Can you give us more information on what your code is doing and how? Unless you are using global variables your request should be fully garbage collected. It can be easy to inadvertently forget a let or var and make something global in JavaScript.
The “exceeded CPU” error actually means “exceeded resource limits”. It could actually be that you went over the memory limit. It’s sometimes hard to distinguish these two as when you get close to the memory limit, the V8 garbage collector starts working really hard and that tends to exceed the CPU limit, but memory was the real problem.
Does your script load large response bodies into memory? It could be that one response fits into memory OK, but multiple simultaneous responses are enough to go over the limit.
This is quite worrying, seems to me that critial projects using workers need to be load-tested on each deployment in case they run into resource issues…
It’s easy to reason about parsin JSON and XML, but now we have to take into account the CPU time of the garbage collector too.
The worker we have is running regexes-replaces on documents with a maximum size of about 2 MB. It typically completes in a few ms, but as it’s not streaming I guess it could create a few copies of the entire document in memory before finishing. We’re expecting a pretty high load on this service when it’s in production so it’s important to know that it will scale and if we’re close to hitting any boundaries.
Is there anything we can do to get an idea of how much memory we’re using? Could we for instance use get some data through the worker preview?