DOMParser in Worker

I’m trying to do a little scraping of metadata (like the og:title and og:image tags, etc.) from pages. However, there doesn’t appear to be an easy way to do this in a Worker. The only way to do it is with HTMLRewritter, but that API is really difficult to use for this purpose.

I would prefer to use DOMParser which is available from jsdom.

Would it be possible to add DOMParser to Workers?

1 Like

I forgot to mention that the DOMParser API in the browser is availble in a ServiceWorker, so it seems like it should be availble on Cloudflare. :slight_smile:

The reader may already be wondering: “Isn’t this a solved problem, aren’t there many widely used open-source browsers out there with HTML parsers that can be used for this purpose?”. The reality is that writing code to run in 190+ PoPs around the world with a strict low latency requirement turns even seemingly trivial problems into complex engineering challenges.

It’s HTMLRewriter or nothing unless you can build a parser that executes in <10ms.

Does HTMLRewritter work with XML as well?

I’m not opposed to using it, but it doesn’t fit the same use case. I’m paring HTML/XML to get data out of it and return it to a different response, it wont be rewritting anything in the original response, only gathering data.

Do you have a project name in mind, or is this still just an idea in its infant state? If I may ask.

This is for chickar.ee, you can see this happen on the client in HTML or in XML (both using DOMParser).

My plan was to rewrite the static page’s (it’s an SPA) metadata (using HTMLRewritter) with the metadata retrieved from the provided 3rd party.

1 Like

Thanks. I want to take a look into the code being used when I’ve the time to do so uninterrupted.

Here is the source code for how that works in HTML and XML. :slight_smile:

1 Like

I did a little test and it looks like HTMLRewriter does appear to work with XML, so I think this should work by abstracing my existing code to work with either DOMParser or HTMLRewriter (I guess by collectiong all the things that might be useful the serializer).

I don’t think this use-case really fits within HTMLRewriter, and I wonder if it actually provides any performance enhancements over using DOMParser since the whole response has to be read anyways. :confused:

Abstracting this code will actually result in a bit of waste for both the client and the edge because I’ll have to query for things that I’m going to throw away.

again I’d recommend reading through the blog post I mentioned, which explains why htmlparser specifically is required:

For this reason, most parsers don’t even try to perform streaming parsing and instead take the input as a whole and produce a document tree as an output. This is not something we could do for streaming transformation without adding significant delays to page loading.

And the part pertaining to parsing in workers:

This does raise a problem though, let’s say there is an Atom feed like this:
https://www.wikidata.org/w/api.php?hidebots=1&hidecategorization=1&urlversion=1&days=7&limit=50&action=feedrecentchanges&feedformat=atom

How would I get the <title> and the <link> for all of the <entry>? I know I could do a selector like entry title and entry link but what if one (or more) entries doesn’t have one or the other? Then the list wont line up… Is there a way to treverse an items children? or limit the scope of HTMLRewriter to just be that specific entry rather than the entire document?

even at the worst-case scenario, 500ms/100 iterations is 5ms… well within the 10ms limit and the 50ms paid limit… I’m not seeing the problem.

Regardless, I’m perfectly willing to use HTMLRewriter if the parent/child problem I mentioned can be resolved, but it doesn’t seem like it resolves a hierarchy (unless I’m missing something?)

whoops… DOMParser is not availble in a service worker. :frowning:

I’ll find some other solution then!

Another valuable use-case for ‘parsing then query/mutating DOM inside workers’ is to automatically generate server-push headers based on the content of the html file, ie. <link rel='stylesheet href='file/on/the/same/server.css'>, <base>, <link rel=preload>, etc. in the body of the response.

We can use HTMLRewriter to automate server-push. And it works well: it can even mirror the behavior of Chrome when the <base href> element is ‘illegally’ positioned after a <link href> element.

HTMLRewriter is a good solution/API here. But. It has two issues:

  1. the body of the response has to be read inside the worker before the response is passed out of the worker. This is because we use the content of the response body to add extra response headers. Cart before the horse. Similar problem as referred to by @davidbarratt here. On the surface, this seems to wipe out any performance benefit of using HTMLRewriter rather than say DOMParser (although this doesn’t seem to be the case, as I describe below).

  2. are there any other server providers that supply HTMLRewriter? If no, then that means vendor lock-in and uncertainty about long term future prospects. I don’t mean to raise any alarms, HTMLRewriter looks solid and Cloudflare seems utterly committed++. All is good :slight_smile: What I mean is: wouldn’t it be nice if HTMLRewriter became a standard solution for all web workers, everywhere? I vote yes!

On performance.
If parsing the content of the html files were done on each request to the worker, then HTMLRewriter and low latency becomes paramount. In this scenario, HTMLRewriter seems to be the only solution.

But, if the worker either caches the result in its memory, cloudflare caches, or a KV store, then it might not matter much if it takes 20ms or 4ms to rewrite the html or the http header. Because then you might have one request every other week that takes 16ms extra, and then 10.000 requests in between that take the same time regardless. If the worker mostly provide the same handful of files, or always caches the result anyways, then 16ms is not an issue.

But. Running DOMParser is not necessarily 16ms. In fact, it might be closer to 160ms. I did the following speed tests:

  1. Open https://www.youtube.com/watch?v=EV4J_UrUTnI

  2. pasted the following code into devtools:

function tst(){
  const myString2 = new XMLSerializer().serializeToString(document);
  const start = performance.now();
  const myDocument2 = new DOMParser().parseFromString(myString2,'text/html');
  console.log(performance.now()-start);
}
tst();
  1. This took : 155.09500000916887ms. (I guesstimate that my Chrome devtools is 150% slower than the cloudflare worker runtime environment, so that would translate to roughly 100ms if a DOMParser were to parse this youtube file in a cloudflare worker. This is a very, very rough estimate).

  2. Similarly, I ran the same youtube page through HTMLRewriter as many times as I could before the worker was shut down after 50ms.

  3. This runs approximately 10 times, ie. takes roughly 5ms.

conclusion: Even though this use-case reads the entire body of the response before it is passed out of the worker, HTMLRewriter is still maybe 10-20 times faster than DOMParser. If my guesstimates are correct. And, the time spent is not insignificant when the HTML files are large: 10ms vs 100ms is significant even when the result is cached. (on small html files, this would not be true, small html files might be 1ms vs 10ms, which might be tolerable even if the file was not cached).

Caveat: These are very rough estimates. I might be totally off the mark. If anyone spots any errors in my calculations, please post them so I can update my post.