Parsing HTML with Cheerio using too much CPU time?


#1

Hi
There have been a couple of similar posts, but I’m not sure this has specifically been answered…

I’ve followed https://simon-thompson.me/simple-dom-manipulation-via-jquery-in-cloudflare-workers/ to utilise Cheerio to parse/modify the HTML content of my page.

The only problem is that I occasionally (Randomly) get 1102 errors when viewing the pages live. I can refresh several times and the error goes away. I don’t get an error in the preview window.

I’m only testing on the free plan at the moment, but obviously wouldn’t want this on a live implementation. Is parsing HTML like this likely to use too much CPU time to be reliable in production?

Is there a way to track the amount of CPU time we’re utilising in a call?

Are there plans to implement HTML parsing in Cloudflare workers natively?

thanks


#2

Cheerio is 0.5MB large in itself, so parsing the library might even take too much CPU. What do your cloudflare workers dashboard say as the 99th percentile CPU time?

I’m a bit worried about the 1MB limit and CPU time, it will be a problem when people start including node.js libraries like these…


#3

Thanks for the reply. The dashboard says 78.7ms for the 99th percentile - I have 570 requests (19 fails) with a total CPU time of 7.1s

I guess that makes the average requests is 12ms…which may answer my question about whether using cheerio is sensible in production. Even on the pro plan, it’s too close to the 10ms time?


#4

You’d need at least 15ms for that to work somewhat reliably, I think that’s only on the Enterprise plan?


#5

If you want a reliable way to change the site, you can use the xml2js package

It’s 1/5th the size and much faster.


#6

Thanks a lot. The Business plan gives < 50ms per request, which could be suitable with a faster package.

I’ll give it a try - thanks!


#7

Ah, sorry, I mean the Business Plan (The highest with a price).

Confused it with some other “plan”, there’s plans everywhere! :wink:


#8

Parsing with a DOM-based parser is generally problematic, as it requires the upstream page to be fully downloaded before you can start streaming a response to your users. A streaming parser might be a better option: https://github.com/fb55/htmlparser2, particularly if combined with our TransformStream support.


#9

Many thanks for that - I’ll try this out too.