Workers KV .get() failures?

Is anyone else seeing intermittent, but consistent, failures for code similar to

    let v = await KVSTORE.get(`key_that_we_want/${other_variable}`)
    if (v === null) {
        /* report unexpected badness */

By intermittent, I mean that not all searches are failing, only for certain keys.

This is happening to us now for at least today, maybe a few days more (Our “report” was clearly not prominent enough)

[edit] Also meant to add, retrieving the from the KV works via the curl API, same key, expected value. :-/

And here’s a fun wrinkle, a .put() failure:

        INFO(rayid, `New device ${device.guid}`)

        /* store the new device record */
        var device_s = await JSON.stringify(device, null, '\t')
        e.waitUntil(CACHE.put(`registeredDeviceID/${deviceGUID}`, device_s))

This code has been running consistently for months, and now it starts failing without any change from us… By failing, I mean that there is no error generated, but this put() never happened. I’m unable to pull up the record via curl, and subsequent Worker invocations that look up this record always fail to .get() it.

WTF is going on?

Here’s the relevant log messages for two of the missing KVs:

2020-01-30T20:00:03.905Z EWR lnsubd-prod[55d607a85ec3eff5] INFO Version: v0.13-1-g49bb10fae1ce
2020-01-30T20:00:03.905Z EWR lnsubd-prod[55d607a85ec3eff5] INFO POST called by: 2600:IPv6_omitted
2020-01-30T20:00:04.168Z EWR lnsubd-prod[55d607a85ec3eff5] INFO New device UUID_omitted
2020-01-30T20:00:04.279Z EWR lnsubd-prod[55d607a85ec3eff5] INFO BAIL (200) OK
2020-01-30T20:01:03.735Z EWR statistics-prod[55d6091e3c0f91d4] INFO Version: v0.13
2020-01-30T20:01:03.843Z EWR statistics-prod[55d6091e3c0f91d4] INFO BAIL (415) ERROR: Unknown device: UUID_omitted


2020-01-30T20:13:10.987Z LAX lnsubd-prod[55d61adf9c57d392] INFO Version: v0.13-1-g49bb10fae1ce-dirty
2020-01-30T20:13:10.987Z LAX lnsubd-prod[55d61adf9c57d392] INFO POST called by: 2600:IPv6_omitted
2020-01-30T20:13:11.230Z LAX lnsubd-prod[55d61adf9c57d392] INFO New device UUID_omitted
2020-01-30T20:13:11.375Z LAX lnsubd-prod[55d61adf9c57d392] INFO BAIL (200) OK
2020-01-30T20:14:10.928Z LAX statistics-prod[55d61c563a09d35a] INFO Version: v0.13
2020-01-30T20:14:11.074Z LAX statistics-prod[55d61c563a09d35a] INFO BAIL (415) ERROR: Unknown device: UUID_omitted

Haven’t seen this yet, what region are you at? Maybe the sync to that region from the central KV is not working properly? I don’t see any issues related to this at cloudflarestatus site though.

Yeah, I thought about that, but as you can see in the logs, the failed .get() occurs at the same colo, at least a minute after the .put(). So, I don’t think that’s the problem here. Even if it is, it wouldn’t explain why it suddenly stopped working.

I’m beginning to wonder if CF changed the way they kill processes. Since I get the BAIL notification for each rayid, that tells me the script ran to completion. However, that is when we respond to the requestor. There are still async tasks running, e.g. e.waitUntil(CACHE.put(...)) that might be getting killed. Although I’m at a loss to explain why that started happening now…

Well waitUntil is explicitly not guaranteed. So if you’re dependant on that maybe do something different.

Do you have a source for that? I’ve always understood that the process would be killed if any part of it (including e.waitUntil()) exceeded the processor usage threshold. Are there any other reasons it might not be guaranteed?

Regardless, why the sudden change in behavior? It was rock-solid, no failures, until yesterday.

Bah! I found it. Somehow, when I uploaded a new version of the script yesterday (only one small unrelated change), it attached the development KV stores to the production version of the script (?!). Clearly, EBKAC.

I found the missing writes in the development version of the KV store… now to do a careful migration / reconstruction…