Possible to fetch HTML from a URL and DOM Parse it?

Based on the APIs methods exposed here:

It doesn’t look like I can fetch the HTML from a URL, parse it / turn it into a DOM object that I could operate over, and then output the HTML from my modified DOM.

Do I have that correct?

Do I need to do simple searches and replaces against text / html to modify the contents of a webpage? No, css or tag targetting possible?

If I can turn it into a dom to operate? Do you have an example?

Thanks!!!

1 Like

We don’t (yet) have a built-in way to parse HTML unfortunately. We do, however, have wonderful users who have tackled this problem before and written about it. Take a look at this, for example: https://simon-thompson.me/simple-dom-manipulation-via-jquery-in-Cloudflare-workers/

This is awesome!

I can see that I am able to target make changes…It’s hitting my images / etc but I know how to avoid messing with those.

I’m getting this when I copy it straight from his playground and from gist:

But I can see manipulations happening so is this known / noise?

This is exactly what I was looking for! Just need to get it to work correctly.

Works fine from the https://Cloudflareworkers.com/ site but when I use Workers from my actual account, I get the error message above.

Hey @mattp - glad to see you’re finding this useful, and thanks @zack for sharing the post!

I’ve just tested this against an actual site (as opposed to the Playground) and oddly I’m not seeing the same error - could I check if you copy/pasted the source from the raw gist?

If you’re still having trouble, let me know and I’ll try a few other things this evening :slight_smile:

THANKS! I copied from gist before but should have gone raw gist…that did the trick and it is time to play!

Working great over here,
Matt

1 Like

Awesome, happy to hear it worked! I’ve updated the link in my blog post to go directly to the raw gist, so hopefully that helps anybody stumbling across this in the future.

Thanks, again!

Also, you are probably already doing this but to make sure you don’t mess with images, css, etc, I am doing this hack right now:

  // Make sure we only modify text, not images.
  let type = response.headers.get("Content-Type") || ""

  if (!type.startsWith("text/")) {
    // Not text. Don't modify.
    return response
  }
  if(type == "text/css"){
    return response;
  }

Images were getting broken and it took me a bit to figure it out…

2 Likes

Thanks for sharing that, that’s a really good shout - i’ll look into modifying the worker to incorporate that too and let you know! In this case I think we can just bypass unless it’s exactly text/html?

Yeah, that was my thought, too! I can probably localize my code down to just text/html…was worried that if the headers weren’t set right, I wouldn’t catch the right content but probably a silly concern.

Native DOM parsing is an absolute MUST HAVE feature for me.

For the past 5+ years, I have written all my websites in raw HTML5/CSS3/JS, using only the standards and consistently refusing to use any of the plethora of frameworks like node.js or JQuery or anything like that. I really like to write all my stuff by hand and have full control.

Now I want to port my global status page from client-side javascript to server-side javascript using Cloudflare Workers. This would be very easy to do if Cloudflare Workers natively supported DOM manipulation.

Without native DOM manipulation, the simplest approach I can think that would technically work would be to use a huge template string and generate HTML as text, but this would be a very painful approach compared to simple DOM manipulation. I simply DO NOT want to bring in random dependencies for this.

Please implement native DOM manipulation soon! This would open a world of possibilities!

Was interested in this too, thanks for posting!

Thank you for implementing this so quickly! I’m burning to try it out!
(Here’s the documentation.)