TextDecoder doesn't support any legacy charsets

@KentonVarda @harris
Is this something that can be added, there’s a LOT of use-cases where legacy encodings are required to be read and sent.

Example that works in node/browser but not in Workers:

const isoDecoder = new TextDecoder('ISO-8859-1');
const bytes = new Uint8Array([49, 59, 49, 56, 59, 73, 110, 103, 97, 32, 102, 101, 108, 32, 112, 229, 116, 114, 228, 102, 102, 97, 100, 101, 115]);
console.log(isoDecoder.decode(bytes)); //1;18;Inga fel påträffades

Playground example:

Sure, we can do character replacement maps manually, however, that would consume several MBs of objects which will exhaust the CPU-time long before we’ve covered even the ISO-8859 family.

1 Like

For ISO-8859-1 specifically, you can write this instead:

String.fromCharCode(...bytes)

I agree we should support more charsets here, but there’s a lot on our plate so I don’t know how soon we’ll get to it. :confused:

1 Like

Shouldn’t that be chunked though? .apply() has a limit (~125k?) in Chrome. Not sure if the same limit applies in Workers (but I would’ve guessed that it would)

https://github.com/google/closure-library/commit/da353e0265ea32583ea1db9e7520dce5cceb6f6a

Ah, yes, I suppose so. But @thomas4’s example was a shorter string, which I’d guess is the more common use case in Workers.

1 Like

Thanks for the reply, I’m using that method now, It works because it matches parts of the UTF8 charset. We still need to have the full sets so we can support Chinese for example.

It works because ISO-8859-1 includes exactly the first 256 codepoints of Unicode, one byte per codepoint. (But UTF-8 and ISO-8859-1 only match for the first 128 codepoints, i.e. ASCII.)

But yes, this won’t work for any other charset.

On the web, Chinese, like most languages, is almost always encoded as UTF-8, in which case you don’t need any other charsets to support it.

If you have a use case where you commonly encounter charsets other than UTF-8, I’d be curious to know more about it.

You’re right, on the web UTF-8 dominates, but on the API integration side, there’s a lot of legacy windows systems that default to ISO-8859. In china, old windows systems are still way too common for their own good…