Another valuable use-case for ‘parsing then query/mutating DOM inside workers’ is to automatically generate server-push headers based on the content of the html file, ie.
<link rel='stylesheet href='file/on/the/same/server.css'>,
<link rel=preload>, etc. in the body of the response.
We can use
HTMLRewriter to automate server-push. And it works well: it can even mirror the behavior of Chrome when the
<base href> element is ‘illegally’ positioned after a
<link href> element.
HTMLRewriter is a good solution/API here. But. It has two issues:
the body of the response has to be read inside the worker before the response is passed out of the worker. This is because we use the content of the response body to add extra response headers. Cart before the horse. Similar problem as referred to by @davidbarratt here. On the surface, this seems to wipe out any performance benefit of using HTMLRewriter rather than say
DOMParser (although this doesn’t seem to be the case, as I describe below).
are there any other server providers that supply HTMLRewriter? If no, then that means vendor lock-in and uncertainty about long term future prospects. I don’t mean to raise any alarms, HTMLRewriter looks solid and Cloudflare seems utterly committed++. All is good What I mean is: wouldn’t it be nice if
HTMLRewriter became a standard solution for all web workers, everywhere? I vote yes!
If parsing the content of the html files were done on each request to the worker, then HTMLRewriter and low latency becomes paramount. In this scenario, HTMLRewriter seems to be the only solution.
But, if the worker either caches the result in its memory, cloudflare caches, or a KV store, then it might not matter much if it takes 20ms or 4ms to rewrite the html or the http header. Because then you might have one request every other week that takes 16ms extra, and then 10.000 requests in between that take the same time regardless. If the worker mostly provide the same handful of files, or always caches the result anyways, then 16ms is not an issue.
But. Running DOMParser is not necessarily 16ms. In fact, it might be closer to 160ms. I did the following speed tests:
pasted the following code into devtools:
const myString2 = new XMLSerializer().serializeToString(document);
const start = performance.now();
const myDocument2 = new DOMParser().parseFromString(myString2,'text/html');
This took :
155.09500000916887ms. (I guesstimate that my Chrome devtools is 150% slower than the cloudflare worker runtime environment, so that would translate to roughly 100ms if a DOMParser were to parse this youtube file in a cloudflare worker. This is a very, very rough estimate).
Similarly, I ran the same youtube page through
HTMLRewriter as many times as I could before the worker was shut down after 50ms.
This runs approximately 10 times, ie. takes roughly 5ms.
conclusion: Even though this use-case reads the entire body of the response before it is passed out of the worker,
HTMLRewriter is still maybe 10-20 times faster than DOMParser. If my guesstimates are correct. And, the time spent is not insignificant when the HTML files are large: 10ms vs 100ms is significant even when the result is cached. (on small html files, this would not be true, small html files might be 1ms vs 10ms, which might be tolerable even if the file was not cached).
Caveat: These are very rough estimates. I might be totally off the mark. If anyone spots any errors in my calculations, please post them so I can update my post.