Changes last week:
- Fixed a regression which could cause HTMLRewriter.transform() to throw spurious “The parser has stopped.” errors.
- Upgraded V8 from 8.4 to 8.5.
Changes last week:
Hi @harris thanks for this change. We’ve seen a significant reduction in the number of truncated responses we serve to customers as a result of this change, we really appreciate it.
Curious if you can go into any more detail about the nature of the fix for “The parser has stopped” errors? The reason I ask is because while our response truncation rate has dropped significantly, we still see two or three truncated responses a day (previously we were at ~60 per day).
I’m also wondering if there are any plans to improve the visibility of these errors in the future? I’ve been monitoring the exceptions we see in the dashboard and I haven’t seen them drop off at all, leading me to assume that that graph doesn’t actually include these kind of failures. It makes it really difficult for us to track these issues on our end because we have to use fuzzier heuristics like response size to determine whether or not HTML transforming failed.
Finally, thanks for all the hard work, Cloudflare workers are awesome product.
Hi Ryan, I’m sorry for my slow reply.
Sure, I’d be happy to. The TL;DR: there was a thread-local object used to communicate errors across a language boundary that wasn’t getting reset between invocations of the parser, causing a client disconnection (which stops the parser) from one request to become visible as a stopped parser error in the next request. Subsequent requests would recover.
As you’re probably aware, HTMLRewriter is implemented using lol-html under the hood. lol-html only supports synchronous handlers directly – when a selector matches, it calls your handler, and that handler must complete processing before returning. There’s no way to suspend the parser, return from the handler, do something else like run an event loop, then later resume the parser and the handler from where they left off. This was why, up until recently, you could not await promises inside HTMLRewriter handlers.
Coming up with a good way to allow HTMLRewriter handlers to save and later resume the parser state seemed a bit difficult, so we decided to implement asynchronous handlers in HTMLRewriter using fibers (stackful coroutines). By running the parser on a fiber, a handler can suspend the entire fiber, allowing our event loop to resume processing on the thread’s main stack while the parser’s state is saved, safe and sound, on the fiber’s stack.
This worked great, but introduced a new error path: what if a client disconnects while the parser is suspended? In that case, we must wake the suspended fiber and have our handler abort parsing by returning a special stop token to the parser. When the parser sees a handler return the stop token, it finishes whatever write() was in flight by immediately returning an error.
lol-html is written in Rust, though, and the Workers runtime is written in C++, so of course it’s not as simple as “returning an error”. We embed lol-html using its C API, and that C API uses a fairly standard error reporting mechanism: functions can return a special value like -1 or NULL, indicating a thread-local error string can be retrieved with a separate free function. What’s easy to forget is that if we don’t call that function, the error never gets reset. When we introduced async handlers with their new client disconnection error path, we forgot to call that function to clear the “parser stopped” error, and the next request to use the parser on that thread would run into it.
I certainly can’t preclude the existence of more HTMLRewriter bugs, but I could imagine many other explanations as well. I take it you are detecting them based on response size? I’m curious if there’s any pattern to them, or if they’re more or less random sizes?
The best tool we have for visibility into errors at the moment is
wrangler tail, which I’d encourage you to try if you haven’t already. However, it is currently limited to tailing logs from scripts which serve 10 requests per second or fewer, so if you serve a lot of traffic, it may not yet be of help.
As for the dashboard error rate being seemingly unaffected, this is likely because the truncated responses were never reported as errors. We report errors in the dashboard based on the response status and headers generated by the worker – errors or disconnections which occur mid-response-stream aren’t counted as errors. Indeed this is not a super great experience, and we do plan to build out our tail logging support more to provide better visibility into such errors. We’re getting there.
Thank you for the kind words, and for using Workers!
I’ll let myself to jump in just to say the explanation was super extensive and professional. Impressive, @harris ! Even I though was not directly effected by the bug, I’ve learned quite a few things reading this. Thanks.
I’d like to second the response above, this kind of post epitomises why I like Cloudflare as a company. This transparency is so incredibly helpful to us when we’re making decisions about building vs mitigating.
Yes, unfortunately we have been unable to use it reliably. It was useful in first uncovering the fact that we were getting errors from our workers - we managed to get it running early in the morning once and got lucky seeing the “Parser has been stopped” error in the wild, but ideally, we’d love a wail to tail exceptions exclusively, or some better control over upstream filtering of logs so that we can tail even under peak load. Good to hear something might already be in the works though, looking forward to it!
The response sizes do seem to be pretty random. I’m going to try and carve out some time on our side to investigate further, if I find anything interesting I’ll let you know.
Thanks again for all of this. It’s been a frustrating investigation from our end but it makes the experience much more positive knowing why we were seeing issues.