abi
abi17mo ago

Detecting invalid JS strings

Is there any built-in way to "detect" invalid strings? Here's an example:
// this is an invalid unicode code point
const bad = "\udc11"

// but i can console.log it:
console.log(bad)
// prints: �

// and i can use it in other strings:
const foo = bad + "-" + bad

// but when i try to evaluate it in the repl:
> badAndPrefixed
Unterminated string literal Unknown exception
// this is an invalid unicode code point
const bad = "\udc11"

// but i can console.log it:
console.log(bad)
// prints: �

// and i can use it in other strings:
const foo = bad + "-" + bad

// but when i try to evaluate it in the repl:
> badAndPrefixed
Unterminated string literal Unknown exception
Some questions: 1. Can I somehow detect "bad" Unicode strings? 2. Why can I console.log it, and what does it do? 3. What happens in the Deno REPL that makes it throw an error?
2 Replies
Andreu Botella (they/them)
There's a stage 3 proposal to detect non-valid-UTF-8 (or to use the term they're going with in the proposal, "non-well-formed") strings: https://github.com/tc39/proposal-is-usv-string
GitHub
GitHub - tc39/proposal-is-usv-string: a proposal for a method to de...
a proposal for a method to determine if a String is welll-formed Unicode - GitHub - tc39/proposal-is-usv-string: a proposal for a method to determine if a String is welll-formed Unicode
Andreu Botella (they/them)
for now you can test it with !/\p{Surrogate}/u.test(str) 2. console.log() and the REPL have different printing implementations, with console.log() being written in JS and the REPL mainly in Rust. Rust's string type is UTF-8-based, and it doesn't support invalid UTF-16. Looks like what console.log() is essentially doing is encoding the JS string into UTF-8 using the "lossy" encoding which turns invalid UTF-16 code points into a replacement character (U+FFFD, �) I think there's a bug open for the REPL output to do that what the REPL currently does is use a lossless conversion that can fail when given invalid UTF-16, as in this case