[FIXED] Cloudflare Workers slow with moderate sized webassembly bindings

Quick update on this:

I asked [email protected] to get someone knowledgable in the matter at CF to answer this question. That was 8 days ago. They’ve relied saying they’re looking into it a number of times, but no answer yet.

I guess that’s an implicit admission that there’s a problem their end…

I’ll update here when I hear anything.

Thanks for the update, hopefully they can solve the issue :pray:

I got the following somewhat frustrating response from CF:

Hi Samuel,

Thank you for your patience here, our workers engineering team have updated to advise pretty much as I did that there isn’t any ETA on this and this will involve a lot of effort. So we can’t guarantee any dates on when this will be resolved.

The problem itself is due to the size of the WASM module, it takes a long time for V8 to compile and optimize this, as such, a work around for this might be to look at reducing the size of your code.

One possibility is to see if you can remove some large dependency libraries to get the code size down.

Kind regards

Basically, workers are useful for anything but “hello world” web-assembly usage.

Hi Samuel,

It looks like the response from support is a filtered version of what I told them.

We have many customers using WebAssembly successfully today.

Unfortunately, using Wasm today – whether in Workers, or in the browser – generally requires putting some effort into dependency management to get code size down to a reasonable level. Fundamentally, the problem is that Wasm modules end up like statically-linked binaries – they include not just your program itself, but also your programming language’s entire standard library, and all of your other transitive dependencies. Making matters worse, many programming languages that target Wasm were not historically designed to produce small binaries.

Contrast this with JavaScript. The entire JavaScript standard library is “built in” to V8 and therefore into the Workers Runtime. You do not have to bundle your own copy of the library with your application. Moreover, the Workers Runtime APIs aim to provide built-in support for many higher-level features too – like HTTP, TLS, WebCrypto, etc. – which would normally be provided by additional libraries in other languages.

The same trade-off exists in the browser. When using JavaScript, you get the standard library and all the APIs offered by the browser built-in. When using WebAssembly, you have to ship a Wasm module containing your language’s standard library.

Meanwhile, both browsers and edge compute are environments where small code footprint is important. In the browser, you don’t want the user to have to download a huge module before your web site can load. And in Workers, since we deploy your code to thousands of machines in order to be as close to the user as possible, we need to impose some limits on how big that code can be. And in both environments, since code needs to be loaded on-demand, large modules may lead to a further delay at load time.

Because of all this, as of today, Wasm may not be the best technology for packaging “whole applications”. Instead, it is often best used to target specific tasks that would be hard to do in JavaScript, like running a particular preexisting library, or doing number crunching that would be slow in JS.

Generally, when using Wasm, it’s important to use options like Rust’s no_std, which omits the standard library from your program. This can make binary sizes much smaller, but it does create a bit of a challenge in that you will need to work around missing library features. Again, this is best practice when using Wasm in both browsers and Workers.

In the future

In order for Wasm to work really well on the edge, we need to come up with a “shared library” standard. We need each major programming language to package its standard library as a shared module, so that that one copy of that module can be loaded into all the different Workers written on that language running on the same machine. That way, individual apps can stay small. (This would also help in browsers, if those shared runtimes can be cached and shared across web sites.)

The Wasm standard itself already supports the notion of multiple modules that call each other. However, we also need the compilers for each language to support this concept. Unfortunately, at present, none of them do, as far as I know.

There are some tricks we could do on the Workers end to make large wasm modules load a bit faster, such as by using V8’s code cache features to effectively precompile modules. However, that’s a big project for us, and it’s not clear how much it will really help, if the fundamental problems with Wasm dependency management are not solved.

9 Likes

Thank you so much @KentonVarda, as you can probably imagine, it’s really nice to hear a thorough explanation of the problem and a confirmation that it’s not just me being daft or missing something obvious. My faith in the point of this forum is somewhat restored.

Do you think it would be fair for Cloudflare to include this caveat in your documentation on running wasm on workers? It seems pretty significant to me that if you’re using rust with Cloudflare workers you’ll probably need no_std. It precludes a lot of use cases and it would be good to learn that before building the application, not after.

I still don’t quite understand why the worker is slow if you exit before initiating the WebAssembly. Are your workers doing something similar to WebAssembly.instantiate before user code is ever executed?

On the specific question of what I wanted to do: render jinja (or jinja-like) templates on workers - do you have any suggestions? I originally wanted to use nunjucks (without pre-compiling the templates) but that requires eval() which is prohibited. I then tried tera in rust, but three weeks later it has become clear that this wasm problem means tera won’t work. Do you have any alternative suggestions?

EDIT: A few people are linking to this comment from the internet, but note that my previous comment is probably the one you want. The previous comment discusses challenges of Wasm in general; this comment is more specifically our implementation.


Yeah some documentation improvements probably make sense. It’s tricky because Wasm best practices are still an area of active exploration by the Wasm community. TBH we on the Workers team aren’t the experts on this topic; we probably need to spend more time building Wasm apps ourselves to really get a feel for the limitations.

I still don’t quite understand why the worker is slow if you exit before initiating the WebAssembly.

Workers compiles the Wasm file to a WebAssembly.Module before your application code starts executing. This is the slow part. WebAssembly.instantiate starts up the module that has already been compiled – this is more like starting up a native-code program and is pretty fast.

During the compilation step, V8 translates Wasm intermediate representation into its own internal implementation, then runs its optimizer over it, and finally outputs native code that can execute. The optimization is not very fast. One thing V8 in the browser does that we haven’t been able to enable yet is run a low-optimization “baseline” compiler first (what they call “liftoff”) which translates directly from Wasm to native code quickly – but the native code is not optimized so runs relatively slowly. Then, V8 spawns a background thread that works on the real optimized build.

We could enable this but there are a few potential problems:

  1. If your first few requests run unoptimized, they might confusingly go over the 50ms limit and fail, while later requests succeed.
  2. We have a general “no background threads” policy due to potential for Spectre attacks (an attacker could time their code execution by watching for the wasm optimization pass to finish).
  3. Very long Wasm builds running in the background could chew up a lot of CPU time that the developer might not notice, which could end up being pretty wasteful. In a way, the fact that the optimizations are blocking today puts some backpressure on the developer to get them to trim some fat, which we need.

That said, one thing we’re considering experimenting with as a short-term fix is disabling the optimized build altogether and using only Liftoff. We need to run some tests to see just how much slower Wasm code built by Liftoff is, to make sure it won’t cause problems for existing users. Unfortunately V8 doesn’t currently give us a way to change this flag on a per-isolate basis, only process-wide, so it’s all or nothing (unless we patch V8?). In any case, going Liftoff-only could reduce compilation time by 5x-10x (but if you see multi-second compiles now, you’ll still be seeing hundreds of milliseconds, which is still not great).

Longer-term, we could use V8 code cache to effectively do compilation centrally, distributing precompiled modules to the edge. There’s a bunch of complications to doing this, though – V8 code cache is specific to one version of V8 and one hardware architecture, so we’d potentially have to build multiple versions (e.g. for x86-64 and arm64) and proactively update them every time V8 has a major update (every 6 weeks). Big project!

Prioritizing any of this vs. other work is a struggle, though, since:

  1. The vast, vast majority of our users use JavaScript and don’t express much interest in Wasm.
  2. It’s not really clear if these incremental measures will make big Wasm applications usable. They might just run up against memory limits or the 1MB code size limit next, and then we’re back to square 1. We really need shared modules to fully solve the problem.

That’s why we couldn’t give a timeline for improvements. There are definitely things we want to do, but we have a lot of important stuff on our plate so I just don’t know when we’ll get to it. :confused:

On the specific question of what I wanted to do: render jinja (or jinja-like) templates on workers - do you have any suggestions?

Well, I guess my question is: Do you really need to load templates dynamically? Could you rig up a system where you proactively update the worker code, precompiling the template, every time a template is updated? That seems like the best of all worlds in terms of performance.

2 Likes

Thanks again Kenton for taking the time to give such a detailed response and explanation.

To make the performance considerations slightly more concrete, I’ve build a simple worker samuelcolvin/cloudflare-worker-speed-test which renders a string template in a number of ways, then recorded the difference in performance from simple javascript string replace with no wasm, to rust replace() right up to a tera template.

The results are in the readme, but a plot of the timings gives a good summary:

I think it would be sufficient for a whole range of applications if CF workers could have reasonable performance (perhaps <100ms of overhead) with a wasm module less than 1MB in size. Does that sound feasible in the medium term?

That said, one thing we’re considering experimenting with as a short-term fix is disabling the optimized build altogether and using only Liftoff.

That makes lots of sense to me. CF workers aren’t designed for CPU intensive tasks so I would imagine liftoff mode would be fine in most scenarios. But I see you have a problem with existing usage.

Well, I guess my question is: Do you really need to load templates dynamically?

My idea is to develop a new way to build web applications, instead of rendering templates either on the origin server or in the browser, I want to do it at the edge. I think this would have numerous advantages (performance, organisational, monitoring & error handling, dev, etc.) over either django/rails style templates or react/angular SPAs. I want to be able to update the frontend by uploading new templates without having to rebuild and deploy the worker to make it as easy as possible to use - “edgerendering” would already add another component to setup, so it needs to be as easy as possible to use.

The thing is, one reason people use WebAssembly on Workers today is specifically to perform CPU-intensive tasks faster, like image resizing. So we need to carefully study the performance penalty for using liftoff only.

Anyway, we definitely agree that some work is needed here. I just can’t say exactly when we’ll get a chance to work on it.

I noticed on the new workers unbounded page Workers Unbound Serverless Compute Platform and https://workers.cloudflare.com/, it says no cold starts, does this work for the web assembly use case?

@nesh The “0ms cold starts” thing refers to our new optimization of parallelizing cold start with TLS handshake. This eliminates the perceived cold start if it finishes faster than a TLS handshake, which is typically some 10ms-50ms. The vast majority of Workers in production today do in fact start faster than this, but large Wasm modules that start slowly today unfortunately will not have that start time eliminated.

However, good news: We found a way to enable V8’s Liftoff+Tier-up that we like. This should go out to production next week. I think it should cut Wasm startup time by a factor of 5-10.

(This still doesn’t use code cache, which is a much bigger project that should cut Wasm startup time to almost nothing. No timeline on that yet, but let’s see how much Liftoff helps.)

Also, thanks @samuelcolvin, your speed test is now used in a unit test in the Workers Runtime codebase. :smiley:

7 Likes

That’s great news. I’ll have a play towards the end of next week and update the readme.

Glad my code was of use.

@KentonVarda Awesome that is great news :slight_smile: Thank you.

Thanks very much. I’ve been suffering from these issues on one of my Workers even though I’ve taken significant steps to reduce the bundle size (I have my Wasm module down to 192KB, a bit heavy but still significantly improved from the nearly ~500KB with zero effort at optimization). Looking forward to the rollout!

It seems like the update to reduce cold start times has rolled out. I am still seeing some occasional spikes to 300-500ms to reply to a request from a cold client, however I am seeing 100-200ms response times much more frequently. Great work!

I just tested out my code too, the cold start went from around 5 second average to now around 500ms, which is great :-).

Indeed, this is out now! Seems to be a ~10x improvement.

I’m still not happy that large Wasm takes hundreds of milliseconds on first start. We should be able to get that down a lot further with code caching. Hoping to be able to work on that in the next few months.

@samuelcolvin This thread has been linked from a few places around the internet. I don’t suppose you’d be willing to edit the title or add a note to the top saying the situation has been improved?

Done. Thank you so much.

I’m on holiday right now, but will try it out next week. Very exciting…

I’ve tried this and updated the speed-test repo readme.

TL;DR: For the largest case the response time has dropped from 2844ms to 211ms.

6 Likes

@samuelcolvin That’s an amazing improvement! Big applause to @KentonVarda and the team!

@samuelcolvin you still around? Side note to your epic thread with @KentonVarda here:

My idea is to develop a new way to build web applications, instead of rendering templates either on the origin server or in the browser, I want to do it at the edge. I think this would have numerous advantages (performance, organisational, monitoring & error handling, dev, etc.) over either django/rails style templates or react/angular SPAs. I want to be able to update the frontend by uploading new templates without having to rebuild and deploy the worker to make it as easy as possible to use - “edgerendering” would already add another component to setup, so it needs to be as easy as possible to use.

This really speaks to me.

I am hobbyist with this type of dev mostly so forgive me being probably a big idiot though.

I used to think simply having an SSR app at edge was a dream setup, but I am not sure it’s enough now for what I really want.

I have been giving a ton of thought to this idea of “edge rendering”, and, if performance can get there, I think it can really expand to something even bigger and more exciting.

I imagine it drifting into more of a lane of a build process for dynamically loaded JavaScript components.

For example, imagine you just hit a simple single worker route (/about) with a bunch of virtual JavaScript modules:

<Hero2 />
<Accordion />
<Slider 4 />
<Article />
<Map />
<Footer />

Then during the request it loads them in however. Right before response:

  • Render
  • Tree shaking / Dead Code Elimination
  • CSS Purging
  • Uglify
  • Minify

Giving you the single smallest footprint possible for the single routed page on load (quickly!). Then just cache the response for other visitors.

I imagine without having to deal with multi page input and output / code chunking, keeping to a single page route, and leveraging Cloudflare’s WASM setup… you could maybe just maybe get this to be performant enough to work to produce these hyper tiny single routes.

This setup allowing people of small or gigantic sites to skip huge builds or annoying build pipeline.

I would call it “JIT Simple Bundling” or “Runtime Simple Bundling” or something…

I’ve seen how ridiculously fast the Rollup and Svelte REPL are as client side web worker on some heavy JS. Or, ESBuild (Go bundler) and SWC Project (Rust bundler) have shocking benchmarks. Both those come with a WASM API. I think are multi-threaded though. Working towards throwing a proof of concept together.

Either way. Again full disclosure, I’m a big dummy and know nothing.

Would love to hear more about your specific experiment, performance, thoughts, use case, where the heck you landed with it, etc. Happy to connect separate / offline from this thread too if interested in sharing/discussing.

Cheers and thanks!