Possible to fetch HTML from a URL and DOM Parse it?


#1

Based on the APIs methods exposed here:

It doesn’t look like I can fetch the HTML from a URL, parse it / turn it into a DOM object that I could operate over, and then output the HTML from my modified DOM.

Do I have that correct?

Do I need to do simple searches and replaces against text / html to modify the contents of a webpage? No, css or tag targetting possible?

If I can turn it into a dom to operate? Do you have an example?

Thanks!!!


Modifying HTML output using workers
#2

We don’t (yet) have a built-in way to parse HTML unfortunately. We do, however, have wonderful users who have tackled this problem before and written about it. Take a look at this, for example: https://simon-thompson.me/simple-dom-manipulation-via-jquery-in-cloudflare-workers/


#3

This is awesome!

I can see that I am able to target make changes…It’s hitting my images / etc but I know how to avoid messing with those.

I’m getting this when I copy it straight from his playground and from gist:

But I can see manipulations happening so is this known / noise?

This is exactly what I was looking for! Just need to get it to work correctly.

Works fine from the https://cloudflareworkers.com/ site but when I use Workers from my actual account, I get the error message above.


#4

Hey @mattp - glad to see you’re finding this useful, and thanks @zack for sharing the post!

I’ve just tested this against an actual site (as opposed to the Playground) and oddly I’m not seeing the same error - could I check if you copy/pasted the source from the raw gist?

If you’re still having trouble, let me know and I’ll try a few other things this evening :slight_smile:


#5

THANKS! I copied from gist before but should have gone raw gist…that did the trick and it is time to play!

Working great over here,
Matt


#6

Awesome, happy to hear it worked! I’ve updated the link in my blog post to go directly to the raw gist, so hopefully that helps anybody stumbling across this in the future.


#7

Thanks, again!

Also, you are probably already doing this but to make sure you don’t mess with images, css, etc, I am doing this hack right now:

  // Make sure we only modify text, not images.
  let type = response.headers.get("Content-Type") || ""

  if (!type.startsWith("text/")) {
    // Not text. Don't modify.
    return response
  }
  if(type == "text/css"){
    return response;
  }

Images were getting broken and it took me a bit to figure it out…


#8

Thanks for sharing that, that’s a really good shout - i’ll look into modifying the worker to incorporate that too and let you know! In this case I think we can just bypass unless it’s exactly text/html?


#9

Yeah, that was my thought, too! I can probably localize my code down to just text/html…was worried that if the headers weren’t set right, I wouldn’t catch the right content but probably a silly concern.