HTMLRewriter dynamic text replacement on all elements

In your example you are checking the full buffer but only replace the current chunk.

Yeah, I’ve seen that happen. My example is limited but in the real-world with elements that have thousands of text characters, it can get chunked mid-text.

That’s all you can replace with the current implementation. You can’t keep track of each text chunk and replace after the fact, as it’s already been sent to the client, which makes sense, but then results in this limitation.

The only real solution I can think of, which @ahmed2 alluded to, is an option in HTMLRewriter to not chunk text nodes.

Well, thats what you were asking, werent you?

I am not sure about that.

I was looking to see if there was a solution to this limitation that I was missing, but it doesn’t seem so.

That’s how the HTMLRewriter works - it’s a streaming response, so to not affect TTFB. If you keep track of old text nodes and try to replace them after the fact, you’ll see:
TypeError: This content token is no longer valid. Content tokens are only valid during the execution of the relevant content handler.

Well, the posted code is virtually identical to one I posted two days ago, save for the concatenation and the include check, both of which do not really serve a purpose in this context.

The best advice would be probably to clarify this with support. Maybe @harris has some idea too.

@sandro : I was writing example code to show what should work, except it doesn’t because Cloudflare streams the chunks before lastInTextNode is true. I wouldn’t even use the HTMLRewriter for text replacement at this time since there isn’t any way to ensure match patterns aren’t split across text chunks.

My point was that all that concatenation is not necessary to begin with.

What the OP seems to want to achieve appears not to be possible, but maybe there is some workaround, hence the suggestion to contact support respectively tagging Harris.

Also, the problem is not necessarily that anything has been sent to the client already (I havent found any indication of that so far) but rather that the content reference is chunk specific and not “global”.

The HTMLRewriter help content states:

remove(): Element: Removes the element with all its content.

Try matching a large block of text that gets chunked and then executing:

text.remove()

If you match the first text chunk, the entire element gets removed (opening and closing tags and all content contained within). This is the intended behavior.

If you match anything other than the first text chunk, the first chunk still gets streamed to the client while all subsequent data is removed (except the closing tag). This results in broken website text / code.

As far as I can tell remove() only seems to affect the current chunk and neither the previous nor any subsequent. The same for replace().

Please try the test I described and you’ll see remove() affects every text chunk from when it’s called until the end of the element. Thus when you execute it on the first match it removes the entire element, including opening and closing tags. When you execute it on any match other than the first, however, any prior text chunks are streamed to the client while all subsequent text chunks are removed.

Could you post the code you tested that with?

  text(text) {
    if (text.text.match('some text to match')) {
      text.remove()
    }
  }

Try that on a large block of text that gets chunked, once with the matching text in the first chunk and once with the matching text in the second or later chunks.

Well, it only removes the chunk, not everything

class elementHandler
{
	text(text)
	{
		if (text.text.match('text')) text.remove();
	}
}

async function handleRequest(request)
{
	const response = new Response(`<div>-----text---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------</div>`);

	const rewriter = new HTMLRewriter();
	rewriter.on('*', new elementHandler());

	return rewriter.transform(response);
}


addEventListener('fetch', event => {
	event.respondWith(handleRequest(event.request));
});

That’s definitely not the behavior I see on my own websites. I just re-ran my tests to confirm.

I assume your example isn’t being streamed in chunks so that’s why it works?

I cant comment on your examples I am afraid, but in the case at hand the chunk methods only apply to the current chunk.

Are you sure you are working with chunks? If you are not, remove() would certainly remove everything.

Yes, .replace() only replaces that individual chunk of text. To deal with the split-chunk problem you’ll need to both buffer and remove the text chunks as you see them, until you see text.lastInTextNode === true. At that point, run the replacement code on the entire buffer, replace the last text chunk, and reset the buffer. (This is essentially the same as what @sandro suggested.)

Something like this should do the trick:

let handler = {
  text(text) {
    buffer += text.text

    if (text.lastInTextNode) {
      // We're done with this text node -- search and replace and reset.
      text.replace(buffer.replace(pattern, replacement))
      buffer = ''
    } else {
      // This wasn't the last text chunk, and we don't know if this chunk
      // will participate in a match. We must remove it so the client
      // doesn't see it.
      text.remove()
    }
  }
}

let response = new HTMLRewriter()
    .on("*", handler)
    .transform(response)

Something to keep in mind is that this will be defeated by opening elements:

<p>some.domain.to<strong>replace</strong></p>

This situation would be much more difficult to handle generically.

Harris

2 Likes

@ahmed2, what you describe is not the intended behavior, nor what I see in my own tests. Could you post a self-contained example?

Note that if you were to call .remove() on an element (not a text chunk), then it would have the described effect.

1 Like

That makes a lot of sense, thanks harris!

Thanks for the explanation and example, @harris. I reworked my code based on your example and now I get the results I want. I must’ve made a mistake in my regex matching or something similar and misattributed the strange behavior to the HTMLRewriter.

On that note, can you answer the following:

  1. Individual text chunks are streamed to the client, right?

  2. Is there any way to buffer the response to the client until the HTMLRewriter encounters a specific tagName and then start streaming? I’m trying to parse some content in the head and add it to the response header before streaming (Non-streaming HTMLRewriter response?)?

Hi @ahmed2,

Assuming the response returned from HTMLRewriter.transform() is being streamed to the client, then yes. More pedantically, once the JS handler is done with the chunk, the (possibly modified) text chunk is enqueued to be read from the transformed response body. It will wait for something to read it, either the script itself via something like Response.arrayBuffer(), a ReadableStream reader, or by passively streaming back to the client via event.respondWith(). Further chunks will not be read and parsed until the already-processed chunk has been consumed, to minimize instantaneous memory usage.

Good question – I’ll respond in that thread.