Worker and streaming JSON


#1

Hi,
I am currently considering workers instead of classic server, but not sure if scenario I have would be best fit.
I’d appreciate your feedback if it makes sense to use worker for that kind of work.
I have a lot of JSON files stored on S3 which I’d like to expose via API, but JSON in those files could optionally be filtered before returning to the client.
Simplified flow:

JWT validation -> fetch file from origin or edge cache -> stream JSON array & filter it by filters defined in request -> return filtered JSON array

Things I’d like to handle in worker:

  1. Authentication - that would be done via JWT and seen examples of that already so I assume it won’t be a problem.
  2. Filtering JSON arrays (I could switch to new line delimited JSON for ease of stream filtering and use something like https://canjs.com/doc/can-ndjson-stream.html)
  • does worker support automatic decompression of origin files when those are compressed with brotli compression? or is it only gzip handled this way? I prefer brotli and it’s what I use currently, but could potentialy give it up and use gzip.
  • Is decompression part of CPU time? My JSON files can be relatively large 25-35MB decompressed but highly compressible (up to 1/20th of that when using brotli) and I wonder how CPU time would be calculated in such situation. I guess another bottleneck would be calling JSON.parse on each JSON line, but perhaps I could figure out how to do that differently.
  • If response I’ll return from the worker will have ‘Content-Encoding’: ‘br’ will Cloudflare use brotli compression before returning it to the client?
  • do you plan to expose Cache API so I could store filtered result response in cache so it wouldn’t have to make filtering again in the future? I’d assume it could work in a way that I’d first validate JWT then check the cache and when response for cache id is already there return it, otherwise run streamed filtering.

#2

I can’t really answer any of your other questions, but Cloudflare has stated that they plan to add support for the cache API in the near future. (no specific dates announced, as far as I can tell)


#3

I’d try and simulate a 800kb payload of what you’re parsing first, into the playground and see if you get the CPU consumption error. I bet that those kinds of payloads will never be possible due to how much CPU time it would consume just to parse them.


#4

@thomas4, @user751 thanks!
I’ve made some tests in the playground with 10MB uncompressed file and it worked fine. I was able to use transform stream to stream that request back as a response with TextDecoding it along the way (TextDecoder.decode), but I have no idea what restrictions playground has in comparison to normal workers.


#5

Hi @tadeuszwojcik,

This definitely sounds like a good use case for worker, but I think the biggest stumbling block will be the JSON parsing.

Only gzip. We don’t currently have plans to support brotli decompression, though I believe this would be feasible.

Decompression is part of CPU time, but I don’t expect it would be a bottleneck. I expect JSON.parse() would hit the CPU limit for JSON data of this size. You might also hit the memory limit, too, or at least drive the garbage collector hard enough to further impact CPU time. Using NDJSON to stream-parse would help with the memory limit issue and might help a little with CPU time, though I’d still be concerned. I’d recommend trying it out with some realistic test data in production, with both gzip and identity Content-Encoding.

A separate issue is that Cloudflare Workers does not yet support custom ReadableStreams, which I suspect can-ndjson-stream requires. While we plan to implement custom ReadableStreams, I don’t have a timeline for it yet.

No, Cloudflare Workers will only do this for gzip encoding. While we could conceivably support brotli decompression, brotli compression is likely too CPU-intensive for us to provide right now.

Yes, we plan to open the Cache API beta next month. Look for an announcement on https://blog.cloudflare.com/.

Harris


#6

We’ve seen manual piping using TransformStreams hit the CPU limit after around 7MB in production, but around 35MB in the preview service / playground. However, these tests used 4KB intermediate buffers – using larger buffers with a “byob” reader could reduce CPU consumption significantly, though I haven’t tested this yet.


#7

@harris I really appreciate your response, very informative, thank you!
I can work with gzip on origin server, so it’s not a deal breaker. Great news about cache API.

I’ve made few experiments with production worker (not playground).
Here’s my worker code: https://gist.github.com/tadeuszwojcik/17651bc6359f4a67b84115a5ea2e88b3 (nothing fancy, just transform streaming along the way with encoding/decoding, in real world I would filter it somehow, but could do that without JSON parse I think as could encode filter name as first characters of each line of the response for example). I run those on pro plan (10ms CPU time).

  • For 3.5MB JSON (already stored compressed in S3, compressed size 330KB)
    https://workers.codefather.io/3.5MB.json.gz

    • it works 90% percent of time, but from time to time returns invalid JSON, as if response got cut off.

    • it fails more often is when I try request 10MB.json.gz before it

As a note to myself, it’s better not to use chrome dev tools to check larges files like that, it’s slooow :slight_smile:

I presume that all of those errors are because I’m exceeding CPU quota, although my 99th worker percentile says it’s 2.2ms, I think it is because when streaming I send headers first which basically say that response is fine (status 200) and then start streaming, the issue is I when I exceed CPU quota streaming simply ends, but it’s not observable in any other way than response JSON is invalid. Am I correct, is it how it’s working? And reporting is based on status codes? Another issue is that when trying 10MB, and then 3.5MB then smaller response also doesn’t work more often, is it because previous stream wasn’t closed properly, so it sort of hangs there? not sure.

You’ve mentioned that decompression is part of CPU time, is it only for first time to produce decompressed file, or every time file is taken from the cache as well? What I mean, is it like Cloudflare stores two copies, compressed and decompressed, or produces decompressed one every time for worker consumption?

I’ve also noticed that when content-encoding is not specified, then Cloudflare applies gzip by default to response and when brotli is turned on in settings it applies brotli compression, is it how it works? is ‘implicit’ output compression part of CPU time?

Sorry, it’s bit long one. I generally love the service and how fast am able to hack something together, I’m fully aware I’m trying push it to the limits and it’s probably not best use case for it right now.


#8

Hi @tadeuszwojcik,

Correct. Request status and CPU time are both recorded at the time the script returns the response object, which is why the analytics says the worker is only using 2.2ms at the 99th percentile. This is a bug that we plan to fix. Note that the request status in the analytics is not actually related to the response status code, but merely records whether or not an exception was thrown or the CPU time limit was exceeded (or some other error) before the response object was returned. So a worker manually returning a 500 would still count as a successful request.

The invalid, truncated JSON is indeed caused by the CPU time limit being exceeded. The fact that smaller responses don’t work as often when the previous response was larger is an artifact of our implementation: you should be able to observe that response capacity averages out to the same amount over time, given enough requests.

Cloudflare caches the gzipped version of the response, so the decompression in the worker happens every time. When the Cache API becomes available, that will help reduce that overhead, since you’ll be able to manually cache the decompressed version of the JSON. Alternatively, you might consider serving the files in identity encoding from S3, and closely monitoring your cache hit rate to make sure your AWS bill doesn’t get too high. Setting a custom cache TTL on the fetch() call might help: https://developers.cloudflare.com/workers/reference/cloudflare-features/#override-cache-ttl

I had forgotten about this feature. This is applied at a different stage in our pipeline, and thus does not count against the Workers CPU time limit. It looks like you can disable this behavior by setting Cache-Control: no-transform in the worker (https://support.cloudflare.com/hc/en-us/articles/200168396), but it would clearly be in your interest to leave it enabled and stream output from the worker in identity encoding.

To save a bit more CPU time, you might consider removing the use of TextEncoder/TextDecoder, and trying to do the filtering with raw Uint8Arrays.

Lastly, I’d also recommend experimenting with removing gzip encoding from the file served from the origin (by storing it uncompressed and removing any Accept-Encoding header on the way to the origin), and also experimenting with different sized buffers when reading. I.e., try readable.getReader({mode: "byob"}) and reader.read(new Uint8Array(1 << 13)).

Harris


#9

Hi @harris, thanks! I really appreciate your detailed responses. All of that makes sense, I’ll experiment a little bit more given your suggestions.