HTMLRewriter dynamic text replacement on all elements

I’ve been playing with HTMLRewriter recently and am loving the flexibility of it, however have run into a limitation that I’m not entirely sure how to solve, or if I’m just missing something with the API.

My use-case is proxying my blog from another domain, and essentially rewriting all content, HTML attributes, link/meta tags etc. from one domain to another. Up until this point, I’ve just had the entire payload in memory, replaced with regex, and then served it back in one go. That works just fine, but if I can make these changes faster, more efficient and not affect TTFB… that’s very cool.

With text nodes, does .replace replace only that individual chunk of text? I guess it would make sense if so, but then, how can I manipulate the content of elements like <div>HELLO replace.example.com</div>, read the contents, and make replacements if necessary, with any degree of reliability with the text chunking?

For a very stripped down code example, I currently have this:

addEventListener('fetch', event => {
	event.respondWith(handleRequest(event.request));
});

class elementHandler{
	element(){
		// reset nextText for chunking
		this.nextText = '';
	}
	text(text){
		// append to nextText for chunking
		this.nextText += text.text;
		if(text.lastInTextNode){
			// this is the last bit of text in the chunk. Check and replace as necessary
			if(this.nextText.includes('replace.example.com')){
				text.replace(this.nextText.replace(/replace.example.com/g, 'some.other.domain.com'));
			}
		}
	}
}

async function handleRequest(request){
	const response = new Response(`<div>HELLO replace.example.com <span>nope.example.com</span></div>`);

	const rewriter = new HTMLRewriter();
	rewriter.on('*', new elementHandler());
	return rewriter.transform(response);
}

As you can see, I’m essentially just trying to replace all occurrences of replace.example.com with some.other.domain.com.

I would expect this to produce:

<div>HELLO some.other.domain.com <span>nope.example.com</span></div>

But instead, it actually produces:

<div>HELLO replace.example.com HELLO some.other.domain.com <span>nope.example.com</span></div>

Wouldnt this do the trick?

class elementHandler
{
	text(text)
	{
		text.replace(text.text.replace(/replace.example.com/g, 'some.other.domain.com'));
	}
}

In my limited example, yes, but not in the real world with hundreds of elements, each of which could be chunked mid URL to produce replace.ex and ample.com for example as separate chunks.

I’ve tested replacing text on my own websites and I think I know what’s happening in your case - the previous text chunk (or chunks) have already been passed to the client so they cannot be replaced. You can only replace any subsequent text chunks that still haven’t been passed to the client.

Have you tried replacing the text before lastInTextNode is true? That has worked for me.

I think the proper fix here is for Cloudflare to update the HTMLRewriter such that it only passes text chunks to the client after lastInTextNode is true and any rewrites for said node have been parsed.

Thanks for the info.

I’ve tried replacing the text at all instances I can think of. Replacing before lastInTextNode can work if you’re lucky that the full domain is in the previous text chunk, but if it gets split between two chunks, it breaks.

I’ve been wracking my brain for a way to fix this, but I don’t think it’s possible with the current implementation. I’ve gone back to doing a full-page fetch and regex replace for now, but I’d love to find a way to do this using HTMLRewriter in the future, perhaps by passing an option to the HTMLRewriter to waitForLastText or something.

Here is example code that works for me:

text(text) {
  if (this.textChunk === null) {
    // Initialize text chunk
    this.textChunk = text.text
  } else {
    // Concatenate text chunks
    this.textChunk += text.text
  }

  // Rewrite some subdomains
  if (this.textChunk.includes('old.example.domain')) {
    try {
      text.replace(text.text.replace(/old\.example\.domain/gi, 'new.example.domain'))
    } catch (error) {
      console.log('text.replace', error)
    }


  // On last text chunk
  if (text.lastInTextNode) {
    // Reset text chunk
    this.textChunk = null
  }
}

The code should work unless the text gets chunked and passed to the client mid match pattern (this goes back to my point that Cloudflare should update HTMLRewriter to wait for lastInTextNode).

In your example you are checking the full buffer but only replace the current chunk.

Yeah, I’ve seen that happen. My example is limited but in the real-world with elements that have thousands of text characters, it can get chunked mid-text.

That’s all you can replace with the current implementation. You can’t keep track of each text chunk and replace after the fact, as it’s already been sent to the client, which makes sense, but then results in this limitation.

The only real solution I can think of, which @ahmed2 alluded to, is an option in HTMLRewriter to not chunk text nodes.

Well, thats what you were asking, werent you?

I am not sure about that.

I was looking to see if there was a solution to this limitation that I was missing, but it doesn’t seem so.

That’s how the HTMLRewriter works - it’s a streaming response, so to not affect TTFB. If you keep track of old text nodes and try to replace them after the fact, you’ll see:
TypeError: This content token is no longer valid. Content tokens are only valid during the execution of the relevant content handler.

Well, the posted code is virtually identical to one I posted two days ago, save for the concatenation and the include check, both of which do not really serve a purpose in this context.

The best advice would be probably to clarify this with support. Maybe @harris has some idea too.

@sandro : I was writing example code to show what should work, except it doesn’t because Cloudflare streams the chunks before lastInTextNode is true. I wouldn’t even use the HTMLRewriter for text replacement at this time since there isn’t any way to ensure match patterns aren’t split across text chunks.

My point was that all that concatenation is not necessary to begin with.

What the OP seems to want to achieve appears not to be possible, but maybe there is some workaround, hence the suggestion to contact support respectively tagging Harris.

Also, the problem is not necessarily that anything has been sent to the client already (I havent found any indication of that so far) but rather that the content reference is chunk specific and not “global”.

The HTMLRewriter help content states:

remove(): Element: Removes the element with all its content.

Try matching a large block of text that gets chunked and then executing:

text.remove()

If you match the first text chunk, the entire element gets removed (opening and closing tags and all content contained within). This is the intended behavior.

If you match anything other than the first text chunk, the first chunk still gets streamed to the client while all subsequent data is removed (except the closing tag). This results in broken website text / code.

As far as I can tell remove() only seems to affect the current chunk and neither the previous nor any subsequent. The same for replace().

Please try the test I described and you’ll see remove() affects every text chunk from when it’s called until the end of the element. Thus when you execute it on the first match it removes the entire element, including opening and closing tags. When you execute it on any match other than the first, however, any prior text chunks are streamed to the client while all subsequent text chunks are removed.

Could you post the code you tested that with?

  text(text) {
    if (text.text.match('some text to match')) {
      text.remove()
    }
  }

Try that on a large block of text that gets chunked, once with the matching text in the first chunk and once with the matching text in the second or later chunks.

Well, it only removes the chunk, not everything

class elementHandler
{
	text(text)
	{
		if (text.text.match('text')) text.remove();
	}
}

async function handleRequest(request)
{
	const response = new Response(`<div>-----text---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------</div>`);

	const rewriter = new HTMLRewriter();
	rewriter.on('*', new elementHandler());

	return rewriter.transform(response);
}


addEventListener('fetch', event => {
	event.respondWith(handleRequest(event.request));
});

That’s definitely not the behavior I see on my own websites. I just re-ran my tests to confirm.

I assume your example isn’t being streamed in chunks so that’s why it works?