[FIXED] Cloudflare Workers slow with moderate sized webassembly bindings

(Edit 2: see the bottom of this thread, the problem has been mostly fixed by cloudflare. Amazing response, thank you.)

(Edit 1: this is a much better demonstration of the problem)

I have a worker (code here) that renders html templates.

Much of the logic is written in rust, this is mainly to allow the templates to be rendered using the tera template rendering library.

Everything seemed to be going well…

However there’s a big problem: running the worker is VERY slow - rough 2.5 to 3s extra time per request. That’s after trying everything to reduce the size of the compiled wasm, before responses where ~5s!

That might not sound like that much but when cloudflare workers advertise adding only a few milliseconds to the response time; 3 seconds is ~1000x slower than expected.

These slow response are when the worker is not “hot” e.g. loaded in memory (I’ve added a header to show when this is the case). When the worker is hot, the response time drops to a more reasonable few 10s of milliseconds. However only around 1 in 20 requests when making requests continuously is in memory - I would estimate you’d need to be making ~100/s to each data centre to have a good chance that most requests hit a hot worker.

What am I doing wrong? How can this be fixed? Is the slowness here in loading the worker code from disk, or running some initialisation code that is run before the worker is executed?

If this is just “how it is” cloudflare worker are effectively useless for running webassembly.

A few things to note:

  • In this PR I’ve tried everything to reduce the size, hence the marginally improve response time discussed above. There’s now no Your built project has grown past the 1MiB size limit..., instead I get Built successfully, built project size is 604 KiB. (though ls -lh worker shows module.wasm is 2.9M - this is also weird)
  • I’ve tried every combination of the following to reduce size, nothing has made a significant difference:
    • opt-level options
    • wasm-opt options
    • other compile time options like lto = true
    • using wee_alloc
    • running wasm-snip on the generated module.wasm
  • The slow response time is not in executing the rust code, or even running await import('../pkg') - I’ve inserted a short circuit here and the response time when hitting the short circuit is basically the same
  • This worker is implemented using type = "webpack" in wrangler.toml with the wasm-pack-plugin webpack plugin to compile the wasm, but that’s not the problem either - I tried type = "rust", see this branch, but the performance is the same

I am running to the same issue. I believe the reason is because on each cold start the WASM needs to get compiled into native code.

The followings links talk more about compiling and caching IndexDB (Deprecated), and implicit caching on V8.

Maybe Workers KV can be utilised to cache the compiled WASM if its under 10 MB (it should only be one read when each worker instance is instantiated so shouldn’t be too expensive), However I didn’t want to try and modify the generated JS for initialising the WASM.

I am hoping that Cloudflare can provide an out of box solution for this.

Hi @nesh, thanks for the reply that’s interesting, particularly the v8 implicit caching link.

Unfortunately that doesn’t work. I tried modifying the generated code to load the raw wasm from somewhere other than the binding object:

  • WebAssembly.instantiateStreaming is not available
  • WebAssembly.instantiate with an array buffer is blocked (I guess for the same reasons eval() is blocked) - you get CompileError: WebAssembly.instantiate(): Wasm code generation disallowed by embedder

I also modified the generated JS of my slow worker and inserted lots of console.log() statements. Unfortunately the slow bit happens before we get to WebAssembly.instantiate or WebAssembly.compile. The slow component (presumably getting the wasm object off disk) happens before any JS is executed.

It looks to me like there’s absolutely no work around possible at the moment - the only way to run wasm is using cloudflare’s binding system and that binding system is jaw achingly slow.

Very sad.

Ah, it could be something that can only be solved by Cloudflare themselves, if we don’t have permission.

I leaned away from it being a loading file issue because the same problem occurs when running it on localhost through wrangler dev. It could be that the logs are out of sync because the initialising code is async?

It could be that the logs are out of sync because the initialising code is async?

I don’t think so because if you return from the worker before calling await import(...) you still get the slow performance.

Sounds like this is unsolvable without changes from CloudFlare, real shame.

Ah I see, hopefully we can get an official answer for this.

Quick update on this:

I asked [email protected] to get someone knowledgable in the matter at CF to answer this question. That was 8 days ago. They’ve relied saying they’re looking into it a number of times, but no answer yet.

I guess that’s an implicit admission that there’s a problem their end…

I’ll update here when I hear anything.

Thanks for the update, hopefully they can solve the issue :pray:

I got the following somewhat frustrating response from CF:

Hi Samuel,

Thank you for your patience here, our workers engineering team have updated to advise pretty much as I did that there isn’t any ETA on this and this will involve a lot of effort. So we can’t guarantee any dates on when this will be resolved.

The problem itself is due to the size of the WASM module, it takes a long time for V8 to compile and optimize this, as such, a work around for this might be to look at reducing the size of your code.

One possibility is to see if you can remove some large dependency libraries to get the code size down.

Kind regards

Basically, workers are useful for anything but “hello world” web-assembly usage.

Hi Samuel,

It looks like the response from support is a filtered version of what I told them.

We have many customers using WebAssembly successfully today.

Unfortunately, using Wasm today – whether in Workers, or in the browser – generally requires putting some effort into dependency management to get code size down to a reasonable level. Fundamentally, the problem is that Wasm modules end up like statically-linked binaries – they include not just your program itself, but also your programming language’s entire standard library, and all of your other transitive dependencies. Making matters worse, many programming languages that target Wasm were not historically designed to produce small binaries.

Contrast this with JavaScript. The entire JavaScript standard library is “built in” to V8 and therefore into the Workers Runtime. You do not have to bundle your own copy of the library with your application. Moreover, the Workers Runtime APIs aim to provide built-in support for many higher-level features too – like HTTP, TLS, WebCrypto, etc. – which would normally be provided by additional libraries in other languages.

The same trade-off exists in the browser. When using JavaScript, you get the standard library and all the APIs offered by the browser built-in. When using WebAssembly, you have to ship a Wasm module containing your language’s standard library.

Meanwhile, both browsers and edge compute are environments where small code footprint is important. In the browser, you don’t want the user to have to download a huge module before your web site can load. And in Workers, since we deploy your code to thousands of machines in order to be as close to the user as possible, we need to impose some limits on how big that code can be. And in both environments, since code needs to be loaded on-demand, large modules may lead to a further delay at load time.

Because of all this, as of today, Wasm may not be the best technology for packaging “whole applications”. Instead, it is often best used to target specific tasks that would be hard to do in JavaScript, like running a particular preexisting library, or doing number crunching that would be slow in JS.

Generally, when using Wasm, it’s important to use options like Rust’s no_std, which omits the standard library from your program. This can make binary sizes much smaller, but it does create a bit of a challenge in that you will need to work around missing library features. Again, this is best practice when using Wasm in both browsers and Workers.

In the future

In order for Wasm to work really well on the edge, we need to come up with a “shared library” standard. We need each major programming language to package its standard library as a shared module, so that that one copy of that module can be loaded into all the different Workers written on that language running on the same machine. That way, individual apps can stay small. (This would also help in browsers, if those shared runtimes can be cached and shared across web sites.)

The Wasm standard itself already supports the notion of multiple modules that call each other. However, we also need the compilers for each language to support this concept. Unfortunately, at present, none of them do, as far as I know.

There are some tricks we could do on the Workers end to make large wasm modules load a bit faster, such as by using V8’s code cache features to effectively precompile modules. However, that’s a big project for us, and it’s not clear how much it will really help, if the fundamental problems with Wasm dependency management are not solved.

7 Likes

Thank you so much @KentonVarda, as you can probably imagine, it’s really nice to hear a thorough explanation of the problem and a confirmation that it’s not just me being daft or missing something obvious. My faith in the point of this forum is somewhat restored.

Do you think it would be fair for CloudFlare to include this caveat in your documentation on running wasm on workers? It seems pretty significant to me that if you’re using rust with cloudflare workers you’ll probably need no_std. It precludes a lot of use cases and it would be good to learn that before building the application, not after.

I still don’t quite understand why the worker is slow if you exit before initiating the WebAssembly. Are your workers doing something similar to WebAssembly.instantiate before user code is ever executed?

On the specific question of what I wanted to do: render jinja (or jinja-like) templates on workers - do you have any suggestions? I originally wanted to use nunjucks (without pre-compiling the templates) but that requires eval() which is prohibited. I then tried tera in rust, but three weeks later it has become clear that this wasm problem means tera won’t work. Do you have any alternative suggestions?

EDIT: A few people are linking to this comment from the internet, but note that my previous comment is probably the one you want. The previous comment discusses challenges of Wasm in general; this comment is more specifically our implementation.


Yeah some documentation improvements probably make sense. It’s tricky because Wasm best practices are still an area of active exploration by the Wasm community. TBH we on the Workers team aren’t the experts on this topic; we probably need to spend more time building Wasm apps ourselves to really get a feel for the limitations.

I still don’t quite understand why the worker is slow if you exit before initiating the WebAssembly.

Workers compiles the Wasm file to a WebAssembly.Module before your application code starts executing. This is the slow part. WebAssembly.instantiate starts up the module that has already been compiled – this is more like starting up a native-code program and is pretty fast.

During the compilation step, V8 translates Wasm intermediate representation into its own internal implementation, then runs its optimizer over it, and finally outputs native code that can execute. The optimization is not very fast. One thing V8 in the browser does that we haven’t been able to enable yet is run a low-optimization “baseline” compiler first (what they call “liftoff”) which translates directly from Wasm to native code quickly – but the native code is not optimized so runs relatively slowly. Then, V8 spawns a background thread that works on the real optimized build.

We could enable this but there are a few potential problems:

  1. If your first few requests run unoptimized, they might confusingly go over the 50ms limit and fail, while later requests succeed.
  2. We have a general “no background threads” policy due to potential for Spectre attacks (an attacker could time their code execution by watching for the wasm optimization pass to finish).
  3. Very long Wasm builds running in the background could chew up a lot of CPU time that the developer might not notice, which could end up being pretty wasteful. In a way, the fact that the optimizations are blocking today puts some backpressure on the developer to get them to trim some fat, which we need.

That said, one thing we’re considering experimenting with as a short-term fix is disabling the optimized build altogether and using only Liftoff. We need to run some tests to see just how much slower Wasm code built by Liftoff is, to make sure it won’t cause problems for existing users. Unfortunately V8 doesn’t currently give us a way to change this flag on a per-isolate basis, only process-wide, so it’s all or nothing (unless we patch V8?). In any case, going Liftoff-only could reduce compilation time by 5x-10x (but if you see multi-second compiles now, you’ll still be seeing hundreds of milliseconds, which is still not great).

Longer-term, we could use V8 code cache to effectively do compilation centrally, distributing precompiled modules to the edge. There’s a bunch of complications to doing this, though – V8 code cache is specific to one version of V8 and one hardware architecture, so we’d potentially have to build multiple versions (e.g. for x86-64 and arm64) and proactively update them every time V8 has a major update (every 6 weeks). Big project!

Prioritizing any of this vs. other work is a struggle, though, since:

  1. The vast, vast majority of our users use JavaScript and don’t express much interest in Wasm.
  2. It’s not really clear if these incremental measures will make big Wasm applications usable. They might just run up against memory limits or the 1MB code size limit next, and then we’re back to square 1. We really need shared modules to fully solve the problem.

That’s why we couldn’t give a timeline for improvements. There are definitely things we want to do, but we have a lot of important stuff on our plate so I just don’t know when we’ll get to it. :confused:

On the specific question of what I wanted to do: render jinja (or jinja-like) templates on workers - do you have any suggestions?

Well, I guess my question is: Do you really need to load templates dynamically? Could you rig up a system where you proactively update the worker code, precompiling the template, every time a template is updated? That seems like the best of all worlds in terms of performance.

2 Likes

Thanks again Kenton for taking the time to give such a detailed response and explanation.

To make the performance considerations slightly more concrete, I’ve build a simple worker samuelcolvin/cloudflare-worker-speed-test which renders a string template in a number of ways, then recorded the difference in performance from simple javascript string replace with no wasm, to rust replace() right up to a tera template.

The results are in the readme, but a plot of the timings gives a good summary:

I think it would be sufficient for a whole range of applications if CF workers could have reasonable performance (perhaps <100ms of overhead) with a wasm module less than 1MB in size. Does that sound feasible in the medium term?

That said, one thing we’re considering experimenting with as a short-term fix is disabling the optimized build altogether and using only Liftoff.

That makes lots of sense to me. CF workers aren’t designed for CPU intensive tasks so I would imagine liftoff mode would be fine in most scenarios. But I see you have a problem with existing usage.

Well, I guess my question is: Do you really need to load templates dynamically?

My idea is to develop a new way to build web applications, instead of rendering templates either on the origin server or in the browser, I want to do it at the edge. I think this would have numerous advantages (performance, organisational, monitoring & error handling, dev, etc.) over either django/rails style templates or react/angular SPAs. I want to be able to update the frontend by uploading new templates without having to rebuild and deploy the worker to make it as easy as possible to use - “edgerendering” would already add another component to setup, so it needs to be as easy as possible to use.

The thing is, one reason people use WebAssembly on Workers today is specifically to perform CPU-intensive tasks faster, like image resizing. So we need to carefully study the performance penalty for using liftoff only.

Anyway, we definitely agree that some work is needed here. I just can’t say exactly when we’ll get a chance to work on it.

I noticed on the new workers unbounded page https://www.cloudflare.com/workers-unbound-beta/ and https://workers.cloudflare.com/, it says no cold starts, does this work for the web assembly use case?

@nesh The “0ms cold starts” thing refers to our new optimization of parallelizing cold start with TLS handshake. This eliminates the perceived cold start if it finishes faster than a TLS handshake, which is typically some 10ms-50ms. The vast majority of Workers in production today do in fact start faster than this, but large Wasm modules that start slowly today unfortunately will not have that start time eliminated.

However, good news: We found a way to enable V8’s Liftoff+Tier-up that we like. This should go out to production next week. I think it should cut Wasm startup time by a factor of 5-10.

(This still doesn’t use code cache, which is a much bigger project that should cut Wasm startup time to almost nothing. No timeline on that yet, but let’s see how much Liftoff helps.)

Also, thanks @samuelcolvin, your speed test is now used in a unit test in the Workers Runtime codebase. :smiley:

6 Likes

That’s great news. I’ll have a play towards the end of next week and update the readme.

Glad my code was of use.

@KentonVarda Awesome that is great news :slight_smile: Thank you.

Thanks very much. I’ve been suffering from these issues on one of my Workers even though I’ve taken significant steps to reduce the bundle size (I have my Wasm module down to 192KB, a bit heavy but still significantly improved from the nearly ~500KB with zero effort at optimization). Looking forward to the rollout!

It seems like the update to reduce cold start times has rolled out. I am still seeing some occasional spikes to 300-500ms to reply to a request from a cold client, however I am seeing 100-200ms response times much more frequently. Great work!