Detecting invalid JS strings
Is there any built-in way to "detect" invalid strings?
Here's an example:
Some questions:
1. Can I somehow detect "bad" Unicode strings?
2. Why can I console.log it, and what does it do?
3. What happens in the Deno REPL that makes it throw an error?
2 Replies
There's a stage 3 proposal to detect non-valid-UTF-8 (or to use the term they're going with in the proposal, "non-well-formed") strings: https://github.com/tc39/proposal-is-usv-string
GitHub
GitHub - tc39/proposal-is-usv-string: a proposal for a method to de...
a proposal for a method to determine if a String is welll-formed Unicode - GitHub - tc39/proposal-is-usv-string: a proposal for a method to determine if a String is welll-formed Unicode
for now you can test it with
!/\p{Surrogate}/u.test(str)
2. console.log()
and the REPL have different printing implementations, with console.log()
being written in JS and the REPL mainly in Rust. Rust's string type is UTF-8-based, and it doesn't support invalid UTF-16.
Looks like what console.log()
is essentially doing is encoding the JS string into UTF-8 using the "lossy" encoding which turns invalid UTF-16 code points into a replacement character (U+FFFD, �)
I think there's a bug open for the REPL output to do that
what the REPL currently does is use a lossless conversion that can fail when given invalid UTF-16, as in this case