HTMLRewriter dynamic text replacement on all elements

My point was that all that concatenation is not necessary to begin with.

What the OP seems to want to achieve appears not to be possible, but maybe there is some workaround, hence the suggestion to contact support respectively tagging Harris.

Also, the problem is not necessarily that anything has been sent to the client already (I havent found any indication of that so far) but rather that the content reference is chunk specific and not “global”.

The HTMLRewriter help content states:

remove(): Element: Removes the element with all its content.

Try matching a large block of text that gets chunked and then executing:

text.remove()

If you match the first text chunk, the entire element gets removed (opening and closing tags and all content contained within). This is the intended behavior.

If you match anything other than the first text chunk, the first chunk still gets streamed to the client while all subsequent data is removed (except the closing tag). This results in broken website text / code.

Please try the test I described and you’ll see remove() affects every text chunk from when it’s called until the end of the element. Thus when you execute it on the first match it removes the entire element, including opening and closing tags. When you execute it on any match other than the first, however, any prior text chunks are streamed to the client while all subsequent text chunks are removed.

  text(text) {
    if (text.text.match('some text to match')) {
      text.remove()
    }
  }

Try that on a large block of text that gets chunked, once with the matching text in the first chunk and once with the matching text in the second or later chunks.

That’s definitely not the behavior I see on my own websites. I just re-ran my tests to confirm.

I assume your example isn’t being streamed in chunks so that’s why it works?

Yes, .replace() only replaces that individual chunk of text. To deal with the split-chunk problem you’ll need to both buffer and remove the text chunks as you see them, until you see text.lastInTextNode === true. At that point, run the replacement code on the entire buffer, replace the last text chunk, and reset the buffer. (This is essentially the same as what @sandro suggested.)

Something like this should do the trick:

let handler = {
  text(text) {
    buffer += text.text

    if (text.lastInTextNode) {
      // We're done with this text node -- search and replace and reset.
      text.replace(buffer.replace(pattern, replacement))
      buffer = ''
    } else {
      // This wasn't the last text chunk, and we don't know if this chunk
      // will participate in a match. We must remove it so the client
      // doesn't see it.
      text.remove()
    }
  }
}

let response = new HTMLRewriter()
    .on("*", handler)
    .transform(response)

Something to keep in mind is that this will be defeated by opening elements:

<p>some.domain.to<strong>replace</strong></p>

This situation would be much more difficult to handle generically.

Harris

4 Likes

@ahmed2, what you describe is not the intended behavior, nor what I see in my own tests. Could you post a self-contained example?

Note that if you were to call .remove() on an element (not a text chunk), then it would have the described effect.

1 Like

That makes a lot of sense, thanks harris!

Thanks for the explanation and example, @harris. I reworked my code based on your example and now I get the results I want. I must’ve made a mistake in my regex matching or something similar and misattributed the strange behavior to the HTMLRewriter.

On that note, can you answer the following:

  1. Individual text chunks are streamed to the client, right?

  2. Is there any way to buffer the response to the client until the HTMLRewriter encounters a specific tagName and then start streaming? I’m trying to parse some content in the head and add it to the response header before streaming (Non-streaming HTMLRewriter response?)?

Hi @ahmed2,

Assuming the response returned from HTMLRewriter.transform() is being streamed to the client, then yes. More pedantically, once the JS handler is done with the chunk, the (possibly modified) text chunk is enqueued to be read from the transformed response body. It will wait for something to read it, either the script itself via something like Response.arrayBuffer(), a ReadableStream reader, or by passively streaming back to the client via event.respondWith(). Further chunks will not be read and parsed until the already-processed chunk has been consumed, to minimize instantaneous memory usage.

Good question – I’ll respond in that thread.

How to replace parameters within an xml sitemap?
I have an xml sitemap in the previous version of my site in the following format:


http://myold.com/support/index.php?id=8423


I’m going to change it to new addresses as follows
myold.com/support/index.php?id=8423
To the new address
mynew.com/kb/8423

I am also trying to use worker to do text replacement and noticed that the text.replace() will automatically encode the special characters as such if there is javascript to be modified, the operators like & will be encoded to & and the inline javascript will be failed.

example: The following inline javascript plassed to text.replace will change all the & operator into &amp;
!function(e,a,t){var r,n,o,i,p=a.createElement(“canvas”),s=p.getContext&&p.getContext(“2d”);function c(e,t){var a=String.fromCharCode;s.clearRect(0,0,p.width,p.height),s.fillText(a.apply(this,e),0,0);var r=p.toDataURL();return s.clearRect(0,0,p.width,p.height),s.fillText(a.apply(this,t),0,0),r===p.toDataURL()}

!function(e,a,t){var r,n,o,i,p=a.createElement(“canvas”)s=p.getContext&amp;&amp;p.getContext(“2d”);function c(e,t){var a=String.fromCharCode;s.clearRect(0,0,p.width,p.height),s.fillText(a.apply(this,e),0,0);var r=p.toDataURL();return s.clearRect(0,0,p.width,p.height),s.fillText(a.apply(this,t),0,0),r===p.toDataURL()}

Does anyone know how to avoid the text.replace() to encode the javascripts?

Thanks,
Eric

Hi @user6964, try using text.replace(content, { html: true }).

The { html: true } option tells HTMLRewriter not to escape any of the new content, but instead to insert it raw. Most of the various .before(), .after(), .prepend(), .append(), and .replace() functions accept the option.

See: https://developers.cloudflare.com/workers/reference/apis/html-rewriter/#global-types

2 Likes

Thanks Harris, it is working now. :slight_smile:

Hi @harris,

Thanks for sharing the code snippet!

I am trying to replace a hostname globally in an HTML document and tried your solution. But unfortunately, I always only get two elements back, even if I disable replacing:

<!DOCTYPE html>
<html lang="en-US" ...>

The rest of the document isn’t returned. Any idea why this could be the case? Is your code snippet supposed to work with the current HTMLRewriter version?

Thanks in advance!

Hi @user6251, as far as I know the snippet should still work. I’m not sure what could be wrong – could you share code to reproduce the issue?

Hi @harris, thank you for your quick response! Please find my (shortened) _worker.js code for Cloudflare Pages below:

export default {
    async fetch(request, env) {
          const url = new URL(request.url)
          const path = url.pathname.slice(1)
          ...
          switch (true) {
          case /^...$/.test(path): {

                let response = await env.ASSETS.fetch(...)

                let domainRewriter = {
                    text(text) {
                      buffer += text.text
                      if (text.lastInTextNode) {
                        text.replace(buffer.replace("https://domain1.com", "https://domain2.com"))
                        buffer = ""
                      } else {
                        text.remove()
                      }
                    }
                  }
                  
                response = new HTMLRewriter({ html: true })
                      .on("*", domainRewriter)
                      .transform(response)
                return response
                break;
            }
        }
    }
}

It looks like buffer is never declared with let, const, or var, which I think would cause the first time it’s referenced to throw an exception, potentially truncating the response. Try declaring it next to domainRewriter.

let buffer  // NEW
let domainRewriter = {
  test(text) {
    buffer += text.text

Thanks for catching that, @harris, I am also surprised the worker didn’t throw an exception. I had to use let buffer = "", otherwise I would get an “undefined” between the html tag and the following tags.

However, now that it works, I am completely questioning my approach as I need to replace the hostname in the entire HTML, including attributes. Do I then have to use element and iterate through the attributes or is there a more global “search and replace”?

Could it make sense to convert the element to a string and do the replace on that?

Iterating through the attributes sounds reasonable to me.

The only alternative I can think of would be to avoid parsing HTML at all, and search-and-replace the hostname across the whole HTTP message body. That has some drawbacks, though – for example, you’d need to detect when the hostname is split across read chunk boundaries – so, the HTMLRewriter method may end up more robust.