I’d like to use Cloudflare Workers with WebSockets, but I have a problem: I may need to broadcast to large numbers of WebSocket clients, but these clients are normally dormant. It’d be much easier for me to scale them in a model similarly to API Gateway:
- WebSocket connections are exposed via a global connection ID.
- On connect, disconnect, and message, an HTTP request is sent to an upstream server (or maybe a worker), containing the related connection ID.
If I were to attempt this with durable objects, it’d get complicated very quick, and messaging back would require a KV lookup + a durable object fetch. It’d also make messaging considerably more expensive. If this were to be implemented natively within Cloudflare’s platform, I could see major cost savings and latency reductions to be had all around:
- The sockets could be terminated at the edge, eliminating a number of possible network failures.
- Connect/disconnect/message requests could be issued similarly to standard requests.
- The hard part would be mapping a connection ID to (and from) edge location + socket to send requests back and ensure disconnect requests are fired off in case of node failure, but even that’s fairly straightforward (ex: using a distributed hash table).
It’s not technically an outright blocker, but it would be cost-prohibitive when starting, and so I’d have to start out on AWS and only move to Cloudflare later. (Fortunately, this model does allow for relatively easy switching between services.)
Here’s what I mean when I said creating such a gateway using durable objects would get complicated quick:
- I’d want to group connections within durable objects for cost efficiency. Each durable object costs about $0.135 per 24-hour day it’s active (for contrast, API Gateway charges $0.648 per 24-hour day worth of connection minutes), which can get pricy pretty quick.
- Those connection groups necessitate an additional durable object, to know how many durable objects exist, and more importantly, when a new one needs created vs when an existing one has an open slot for a new connection. Alternatively, I could handle this at the origin (or maybe a cheap cloud server), but it’d add even more latency to an already long connect duration. The actual state storage, if stored in a durable object, could just use storage keys and some caching. (The costs here would be relatively low.)
- Since durable objects could be reclaimed at any time, a reaper needs to exist to periodically scan the connection group allocation table for dead links, so it can remove them. This could technically be done via a cron worker, but it needs to be done wherever the allocation table is managed regardless.
- I’d have to shard these durable object namespaces for scalability and to reduce latency. Those allocators introduce significant latency in the act of creating a connection, and this latency scales with the number of durable objects (and indirectly, the number of connections) in each shard as well as the distance between the receiving worker and the allocator durable object (or server). These shards would also need to use geolocation info to keep the related durable objects in regions close to the client.