If you have a ticket number post it here, so @cloonan can track it.
I don’t think they actually differentiate CPU from RAM, could it be RAM exhaustion? Just a guess, not sure if it’s based in facts.
Hm, that does sound plausible. I’ll test it, thanks for the hint!
You where probably right, I isolated the crypt code and ran load tests on that alone.
It’s no problem for workers to generate random strings as large as 5000 characters.
I’ll refactor the whole thing and try to clear all variables to make it eligible for garbage collection.
It’s tricky because when the isolate comes close to running out of memory, the garbage collector tends to work much harder, and as a result you tend to hit the CPU limit before the memory limit even though your problem is actually memory.
With that said, in practice, if you see error 1102, it is almost certainly from the CPU limit. These days when an isolate goes over its memory limit, we let it finish up in-flight requests before evicting it – unless it goes way over. So it’s actually hard to observe errors from hitting the memory limit.
Note that it’s common to see requests working correctly for a while, and then start throwing 1102. This is because we allow a worker to exceed its limit from time to time as long as it is not consistently over the limit. The enforced limit is 50ms (regardless of plan), so if you have a worker that runs for 60ms every time, it will succeed a few times and then start failing after awhile. TBH we should probably improve this so that the first few requests have strong enforcement and then let up a bit later on, so that errors are easier to see early…
FWIW generating a 50-character string should be very cheap, assuming you are using
crypto.getRandomValues(). I’d guess there’s something else going on in your code that’s making it expensive…
@KentonVarda It was actually an earlier hash comparison function that caused it, It’s doing PBKDF2 in SHA512 with 30K iterations of salt. Lowering it to 25K solves the issue.
Ah, yes, PBKDF2 is very expensive in terms of CPU – in fact it’s explicitly intended to be expensive.
Yes, I’m aware, would have actually loved it if Bcrypt or Argon2 was officially supported
Now that i know the limit is enforced at a later stage, it makes more sense so thanks for that, now i can stop pulling my hair trying to find a reason
Now that I’ve run even larger load-tests, the CPU exhaustion happened again at ~50K requests for about 13% of requests.
Seems that it accumulates somehow?
It sounds like your average request time is slightly over 50ms, but due to random noise a significant number of them are coming in slightly under 50ms, hence some requests succeed and some fail.
For now, there’s nothing you can do except try to reduce the time further. In the future we hope to add the option to pay for more CPU time.
Yeah, I lowered it again, but now it’s at a limit where it’s not strong enough.
PBKDF2 in SHA256 with 25K iterations of salt, 32-bit salt length
In comparison, Django’s default is 180K rounds and 64-bit salt key length.
Which means, storing password hashes for anything sensitive, might not be enough.
But, I guess that since you already encrypt the database contents, it shouldn’t be a problem and if the code-base is compromised, the hashing doesn’t matter anyway.
Unfortunately PBKDF2 may simply be a bad fit for Workers right now, due to the CPU time limit.
The security of PBKDF2 is proportional to how much CPU time you spend on it. The whole idea is that an attacker performing a brute force attack has to do that amount of work for each guess, so the longer you can make it take, the slower the attack runs. So e.g. if you do 25k iterations, then a brute force attack takes 1/4 as long as it would take if you did 100k.
Naturally, since Workers limits you to 50ms, the best you can do is force the attacker to do 50ms of work per guess. Meanwhile, guidelines for using PBKDF2 usually aim in the range of hundreds of milliseconds. Unfortunately, these guidelines are inherently incompatible with the Workers CPU limit as it stands.
To make matters worse, PBKDF2 is sort of obsolete because it can run much, much faster on a GPU or ASIC. A GPU can brute-farce PBKDF2 something like 1000x as fast as a CPU can, while an ASIC could be 10,000x or more. The time taken will still be proportional to the number of iterations, but it’s not clear if any number of iterations that can reasonably be performed on a CPU will really get you that far against a dedicated attacker. (Modern password-hashing algorithms like scrypt are better because they are harder to implement in GPUs and ASICs.)
@KentonVarda I’m well aware, I’d rather use bcrypt or Argon2, but that’s not possible. I’ll try scrypt.
Nevermind, I see scrypt is not available in the Web Crypto API.
@KentonVarda I see that even scrypt has ASIC implementations now. Argon2 seems safest.
I guess I’ll have to setup an external API just for the hashing, which is kind of silly.
Maybe a paid feature/function just for hashing/crypto can be added to Workers?
We’d like to let people pay for longer CPU limits, which would solve the problem, but I’m not sure yet if/when that will happen.
We could maybe add better algorithms to the WebCrypto implementation as well, but probably not useful until we have longer CPU limits in place, since I think all of them try to use multiple-hundred-milliseconds. (To be honest I’m not really sure why we added PBKDF2…)
@KentonVarda I’d like to know the attack-area, considering that the database entries are encrypted at rest and decrypted only during a request - the plausible password leaks here are all code-based and if someone has access to the worker account/code, then they can capture the actual passwords anyway and the hashing doesn’t matter. Or am I missing something here?
Have you considered Rust/WASM?
That’ll get you scrypt and PBKDF2, might be faster…