HTMLRewriter dynamic text replacement on all elements

Thanks for the explanation and example, @harris. I reworked my code based on your example and now I get the results I want. I must’ve made a mistake in my regex matching or something similar and misattributed the strange behavior to the HTMLRewriter.

On that note, can you answer the following:

  1. Individual text chunks are streamed to the client, right?

  2. Is there any way to buffer the response to the client until the HTMLRewriter encounters a specific tagName and then start streaming? I’m trying to parse some content in the head and add it to the response header before streaming (Non-streaming HTMLRewriter response?)?

Hi @ahmed2,

Assuming the response returned from HTMLRewriter.transform() is being streamed to the client, then yes. More pedantically, once the JS handler is done with the chunk, the (possibly modified) text chunk is enqueued to be read from the transformed response body. It will wait for something to read it, either the script itself via something like Response.arrayBuffer(), a ReadableStream reader, or by passively streaming back to the client via event.respondWith(). Further chunks will not be read and parsed until the already-processed chunk has been consumed, to minimize instantaneous memory usage.

Good question – I’ll respond in that thread.

How to replace parameters within an xml sitemap?
I have an xml sitemap in the previous version of my site in the following format:


http://myold.com/support/index.php?id=8423


I’m going to change it to new addresses as follows
myold.com/support/index.php?id=8423
To the new address
mynew.com/kb/8423

I am also trying to use worker to do text replacement and noticed that the text.replace() will automatically encode the special characters as such if there is javascript to be modified, the operators like & will be encoded to & and the inline javascript will be failed.

example: The following inline javascript plassed to text.replace will change all the & operator into &
!function(e,a,t){var r,n,o,i,p=a.createElement(“canvas”),s=p.getContext&&p.getContext(“2d”);function c(e,t){var a=String.fromCharCode;s.clearRect(0,0,p.width,p.height),s.fillText(a.apply(this,e),0,0);var r=p.toDataURL();return s.clearRect(0,0,p.width,p.height),s.fillText(a.apply(this,t),0,0),r===p.toDataURL()}

!function(e,a,t){var r,n,o,i,p=a.createElement(“canvas”)s=p.getContext&&p.getContext(“2d”);function c(e,t){var a=String.fromCharCode;s.clearRect(0,0,p.width,p.height),s.fillText(a.apply(this,e),0,0);var r=p.toDataURL();return s.clearRect(0,0,p.width,p.height),s.fillText(a.apply(this,t),0,0),r===p.toDataURL()}

Does anyone know how to avoid the text.replace() to encode the javascripts?

Thanks,
Eric

Hi @user6964, try using text.replace(content, { html: true }).

The { html: true } option tells HTMLRewriter not to escape any of the new content, but instead to insert it raw. Most of the various .before(), .after(), .prepend(), .append(), and .replace() functions accept the option.

See: https://developers.cloudflare.com/workers/reference/apis/html-rewriter/#global-types

2 Likes

Thanks Harris, it is working now. :slight_smile:

Hi @harris,

Thanks for sharing the code snippet!

I am trying to replace a hostname globally in an HTML document and tried your solution. But unfortunately, I always only get two elements back, even if I disable replacing:

<!DOCTYPE html>
<html lang="en-US" ...>

The rest of the document isn’t returned. Any idea why this could be the case? Is your code snippet supposed to work with the current HTMLRewriter version?

Thanks in advance!

Hi @user6251, as far as I know the snippet should still work. I’m not sure what could be wrong – could you share code to reproduce the issue?

Hi @harris, thank you for your quick response! Please find my (shortened) _worker.js code for Cloudflare Pages below:

export default {
    async fetch(request, env) {
          const url = new URL(request.url)
          const path = url.pathname.slice(1)
          ...
          switch (true) {
          case /^...$/.test(path): {

                let response = await env.ASSETS.fetch(...)

                let domainRewriter = {
                    text(text) {
                      buffer += text.text
                      if (text.lastInTextNode) {
                        text.replace(buffer.replace("https://domain1.com", "https://domain2.com"))
                        buffer = ""
                      } else {
                        text.remove()
                      }
                    }
                  }
                  
                response = new HTMLRewriter({ html: true })
                      .on("*", domainRewriter)
                      .transform(response)
                return response
                break;
            }
        }
    }
}

It looks like buffer is never declared with let, const, or var, which I think would cause the first time it’s referenced to throw an exception, potentially truncating the response. Try declaring it next to domainRewriter.

let buffer  // NEW
let domainRewriter = {
  test(text) {
    buffer += text.text

Thanks for catching that, @harris, I am also surprised the worker didn’t throw an exception. I had to use let buffer = "", otherwise I would get an “undefined” between the html tag and the following tags.

However, now that it works, I am completely questioning my approach as I need to replace the hostname in the entire HTML, including attributes. Do I then have to use element and iterate through the attributes or is there a more global “search and replace”?

Could it make sense to convert the element to a string and do the replace on that?

Iterating through the attributes sounds reasonable to me.

The only alternative I can think of would be to avoid parsing HTML at all, and search-and-replace the hostname across the whole HTTP message body. That has some drawbacks, though – for example, you’d need to detect when the hostname is split across read chunk boundaries – so, the HTMLRewriter method may end up more robust.

Thank you, @harris, that’s what I did over the weekend, iterating through the attributes works very well! :slight_smile: noscript was a bit tricky but figured out that text contains the HTML elements.

1 Like