Solution-using HTMLRewriter to extract/scrape, not rewrite

So, HTMLRewriter according to some people can be a generic XML or HTML. Im surprise “SAX” and lolhtml and htmlrewriter have never been name checked in the same page ever. On the first round of my code, I realized, I don’t return a response object to the client until all of the origin file was read and parsed. Pointless to waste latency/CPU parsing to the end, although latency penalty on client side penalty is unmeasurable IMO after IO. On discord someone suggested I use try/“throw” to stop HTMLRewriter execution. But I get

worker.js:48 Uncaught (in promise) Error: foundCarrier
    at Object.text (worker.js:48)
text @ worker.js:48
Uncaught (async) Error: foundCarrier

Yes it is my exception so it is not a problem, and bizarrely CFW runtime if you execute

throw "string";

turns into

e = String("Error: internal error")

You have to create an Error object, IDK why, not string object. Plus 2 ugly warnings in preview editor.

try/catch/throw isn’t supposed to be “normal control flow” by the book, so I decided to rewrite it with “modern” JS.

First error I ran into, inside an HTMLRewriter handler, you can’t call event.respondWith(new Response(…))

Too late to call FetchEvent.respondWith(). It must be called synchronously in the event handler.

and also

A hanging Promise was canceled. This happens when the worker runtime is waiting for a Promise from JavaScript to resolve, but has detected that the Promise cannot possibly ever resolve because all code and events related to the Promise's request context have already finished.

After alot of work, the secret being not using await and using

event.respondWith(new Promise(function(resolveCB) {
  resolveCB(new Response(respBody));
}));

I got the runtime to return a response object mid-parsing inside HTMLRewriter. I found is really not obvious, to create a new “fake event” and manually fire it from inside a HTMLRewriter callback and using .then() instead of the ubiquitous await that nearly all CFW code samples and production code uses.

I am posting the try/catch/throw and the promises-only version of the same worker for anyone else in the future to know how to use HTMLRewriter as a parser/scraper ALONE, not a rewriter. Would anyone do this differently? are the return void context of a promise callback function correct to avoid CPU/firing “resolve(undefined)” events?

addEventListener("fetch", event => {
  event.respondWith(new Promise(function(resolveCB) {
    let num = new URL(event.request.url).pathname.substring(1);
    //match foo.com/2125551234 or foo.com/2125551 only
    if (!/^\d{7,10}$/.test(num)) {
      resolveCB(new Response(null, {
        status: 400
      }));
      //console.log tracing shows exec continues even though
      //client gets code 400
      return;
    }
    //todo use Cache API and client Cache-Control some how
    let responseOrigin = fetch('https://www.telcodata.us/search-area-code-exchange-detail?npa=' +
      num.substr(0, 3) + '&exchange=' + num.substr(3, 3));
    let metaCarrier;

    let textBuf;
    let curExch;
    //reformat number to origin-like string
    num = num.substr(0, 3) + '-' + num.substr(3, 3) + '-' + num.substr(6, 1);
    let rewriter = new HTMLRewriter()
      .on('tr[class="results"]>td:nth-child(1)>a', {
        element: function() {
          textBuf = '';
        },
        text: function(text) {
          textBuf += text.text; // concatenate new text with existing text buffer
          if (text.lastInTextNode) {
            curExch = textBuf;
            //console.log("saw xch "+textBuf);
          }
        }
      })
      .on('tr[class="results"]>td:nth-child(3)>a', {
        element: function() {
          textBuf = '';
        },
        text: async function(text) {
          textBuf += text.text; // concatenate new text with existing text buffer
          if (text.lastInTextNode) {
            metaCarrier ??= textBuf;
            //console.log(textBuf + 'cur xchg ' + curExch + ' match xch ' + num);
            if (curExch == num) {
              resolveCB(new Response(textBuf, {
                headers: {
                  "content-type": "text/plain",
                  "cache-control": "no-transform"
                }
              }));
            }
          }
        }
      })
    //originally was origin file end, but cpu/parse time, abandon
    //the 1000s block search an element right after the <table>
    //element
    //   .onDocument({
    //       end: function() {
    //         if (!response) {
    //           response = new Response(metaCarrier, {
      .on('div[id="WSPadding"]', {
          element: function() {
            resolveCB(new Response(metaCarrier, {
              status: (metaCarrier ? 200 : 404),
              headers: {
                "content-type": "text/plain",
                "cache-control": "no-transform"
              }
            }));
          }
        }
      );
    //Promise {[[PromiseState]]: "pending", [[PromiseResult]]: undefined}
    responseOrigin.then(function(resp) {
    //Promise {[[PromiseState]]: "pending", [[PromiseResult]]: undefined}
    //but lets just toss the arrayBuffer() object/void return, just in case
      rewriter.transform(resp).arrayBuffer();
      return;
    });
    //toss promise just in case
    return;
  }));
})

And now for thr try/catch/throw version

async function handleRequest(event) {
    let num = new URL(event.request.url).pathname.substring(1);
    //match foo.com/2125551234 only
    if (!/^\d{7,10}$/.test(num)) {
      return new Response(null, {
        status: 400
      })
    }
    //todo use Cache API and client Cache-Control some how
    let responseOrigin = fetch('https://www.telcodata.us/search-area-code-exchange-detail?npa=' +
      num.substr(0, 3) + '&exchange=' + num.substr(3, 3));
    let metaCarrier;
    let response;
    let cfstatus;

    let textBuf;
    let curExch;
    num = num.substr(0, 3) + '-' + num.substr(3, 3) + '-' + num.substr(6, 1);
    let rewriter = new HTMLRewriter()
      .on('tr[class="results"]>td:nth-child(1)>a', {
        element: function() {
          textBuf = '';
        },
        text: function(text) {
          textBuf += text.text; // concatenate new text with existing text buffer
          if (text.lastInTextNode) {
            curExch = textBuf;
            console.log(textBuf);
          }
        }
      })
      .on('tr[class="results"]>td:nth-child(3)>a', {
        element: function() {
          textBuf = '';
        },
        text: async function(text) {
          textBuf += text.text; // concatenate new text with existing text buffer
          if (text.lastInTextNode) {
            metaCarrier ??= textBuf;
            if (curExch == num) {
              response = new Response(textBuf, {
                headers: {
                  "content-type": "text/plain",
                  "cache-control": "no-transform"
                }
              });
              throw new Error('foundCarrier');
            }
          }
        }
      })
      //not optimal, other example stops on an element
      //right after the table element for no 1000s block
      //numbers
      .onDocument({
          end: function() {
            if (!response) {
              response = new Response(metaCarrier, {
                status: (metaCarrier? 200 : 404),
                headers: {
                  "content-type": "text/plain",
                  "cache-control": "no-transform"
                }
              });
            }
          }
        }

      );
    responseOrigin = await responseOrigin;

    //stop parsing early once match hit
    try {
    await rewriter.transform(responseOrigin).arrayBuffer();
    } catch(e) {
      if(e != 'Error: foundCarrier') {
        return errorResponse(e);
      }
    }
    return response;
}

addEventListener("fetch", event => {
  return event.respondWith(handleRequest(event))
})

Hi @bulk88,

Yes, respondWith() is a confusing API that doesn’t play well with async/await. This comes from the Service Workers API standard. We’re thinking of moving away from that standard in large part because of this.

To avoid this problem, I recommend always starting a worker like this:

addEventListener("fetch", event => {
  event.respondWith(handle(event.request));
});

async function handle(request) {
  // ... you can now use async/await here ...
  return new Response("OK");
}

This way you never have to use .then().

Regarding throwing strings, yes, there are some issues around that when throwing from a callback. I recommend always throwing Error objects, never strings or any other type.

@bulk88: Every time you call .arrayBuffer() you are forcing the entire response body through the HTMLRewriter. Rather than trying to escape this by throwing exceptions, it would be better to stop consuming the body when you are done.

Unfortunately, this is easier said than done, but something like the following works:

import '@worker-tools/event-target-polyfill';
import 'yet-another-abortcontroller-polyfill';

/** 
 * Consumes a `Response` body while discarding all chunks. 
 * Useful for pulling data into `HTMLRewriter`. 
 */
export async function consume(r: Response, signal?: AbortSignal) {
  const reader = r.body!.getReader();
  if (!signal) {
    while (await reader.read().then(x => !x.done)) { /* noop */ }
  } else {
    const aborted = new Promise(res => signal.addEventListener('abort', res));
    while (await Promise.race([
      reader.read().then(x => !x.done),
      aborted.then(() => false),
    ])) { /* noop */ }
  }
}

self.addEventListener('fetch', event => {
  event.respondWith((async () => {
    const htmlResponse = await fetch('/some/endpoint');

    const ctrl = new AbortController();
    const ids: string[] = [];

    const rewriter = new HTMLRewriter()
      .on('*[id]', {
        element(el) {
          const id = el.getAttribute('id')
          ids.push(id as string) 
          if (id === 'special') ctrl.abort();
        }
      })

    await consume(rewriter.transform(htmlResponse), ctrl.signal);

    return new Response(ids.join());
  })());
});