D1 Timeout Errors on Small Database

Hi Cole

David from the D1 team here.

Thanks for your detailed timeline. I was able to match up your incident 3 with our logs and get to the bottom of what’s happening with your D1 database.

Root cause: we think your database has been scheduled on a metal with a noisy neighbour problem. It is currently possible for expensive storage operations on one durable object to cause CPU starvation of storage operations on that machine.

We are in the process of changing the threading model for our durable object storage operations so that this can’t happen, but this is quite an involved process. In the meantime, I have moved your db to a different machine.

Answers to your questions:

  1. What causes the D1 internal “object reset”? The error message mentions an “object” being reset — is this D1’s backing Durable Object? Is it being migrated, evicted, or hitting an internal storage timeout?

This is the D1’s backing durable object. If a storage operation takes more than 30s then we reset the durable object and reject all pending requests with the same error.

  1. Why does the stall last 60–80 seconds? The documented query timeout is 30 seconds, but the total D1 unavailability window is 60–80 seconds. What internal timeout governs this?

The 30s timeout is enforced by the process that is being CPU starved, so there is lag between the request being submitted and it being picked up and rejected.

  1. Would enabling read replication help? If the D1 primary is being reset, would read replicas remain available? Most of our D1 queries during the stall are reads that don’t require read-your-own-writes semantics.

The D1 read replicas are unavailable while the primary is reset, so they wouldn’t help in this case.

  1. Is there any known issue with D1 primary DO placement/migration causing transient stalls? The spontaneous nature and low-traffic context suggest an infrastructure-level event rather than application-side contention.

Your durable object has been on the same machine all week, so this isn’t the problem in this case.

  1. Is there recommended retry/resilience guidance for this error? The D1 error reference recommends “Optimize the queries, send fewer requests, or shard the queries” — but this doesn’t apply when the failing queries are trivial indexed reads on a tiny database. Is there a retry strategy that would help with the underlying storage-layer timeout?

We generally don’t recommend retrying overload errors, because it can make things worse.

  1. Is the error description accurate for our case? The docs describe this error as “A query is trying to write a large amount of information (e.g. GBs)” — can this error also be triggered by D1 internal events (DO migration, storage backend issues) unrelated to query size?

All of your queries are pretty cheap, so I don’t think that’s what’s happening in this case.

Thanks again for the detailed report.