Provide Performance Metrics within Durable Objects

tl;dr: Implement the option to request average CPU and memory utilization over the last few seconds to minutes in Durable Objects.

Workers are a flexible product. Yet they come with their own set of limitations.
In particular due security reasons timings and the like are more or less unavailable.

These limitations are both reasonable and understandable.

Yet for Durable Objects (DO), they provide a whole different set of challenges.

DO are persistent, and in many ways behave much more like actors.

Sadly, as great as the workers runtime is, it’s not BEAM. And the single source of truth runtime model doesn’t mesh well with auto-scaling like regular workers, either.

And that’s also where the issue comes from… DO are single threaded, and when they the saturate, they saturate hard, as there currently is no reliable method to react to getting close to CPU and/or memory limits.
Proven strategies for these situations would be applying backpressure, signalling saturation early (429/503), shedding load, … But right now the best one could hope for is trying to read timeouts correctly at the call site, and the situation is even more complicated for managed websockets.

To illustrate this by a practical example, let’s suppose we provide a chatroom for small and large audiences as part of a larger livestreaming solution. Even with this rather specific set up requirements, we may not know when a given DO saturates.

  • For a conference style situation, even 5000 participants could easily be handled.
  • Yet creators could create cases where even a 1000 participants would be too much, if the chat reaches Twitch levels of… engagement.

Now if we knew early enough that we are about to saturate, we could employ a number of different dynamic solutions, that slightly degrade the experience, yet keep the system as a whole stable.

  • Rate limit user messages
  • Used slightly delayed bulk message delivery
  • Etc etc…

To enable this, I propose that in DO you can request the average CPU and memory utilization for a given period of time (about seconds to minutes). By using only averages over long enough time frames, at least by CPU standards, even Spectre should not pose an issue here.
Access to those metrics could be by either by simple and direct function calls, or by configuring callbacks in wrangler.toml, or any other reasonable way.

Furthermore such metrics would also provide a solution for a few adjacent issues.

  • Allow better optimization of DO and the system at large by smartly logging those metrics.
  • Delaying compute intensive tasks to opportune times, similar to what requestIdleCallback() would to in a browser.