Saturday, March 16, 2019

Normalizing Unicode strings

A very informative story about encoding text with Unicode.

When "Zoë" !== "Zoë". Or why you need to normalize Unicode strings | With Blue Ink

"the dog emoji 🐶 has the code point U+1F436.

When encoded, the dog emoji can be represented in multiple byte sequences:

UTF-8: 4 bytes, 0xF0 0x9F 0x90 0xB6
UTF-16: 4 bytes, 0xD83D 0xDC36


Most JavaScript interpreters (including Node.js and modern browsers) use UTF-16 internally. Which means that the dog emoji is stored using two UTF-16 code units (of 16 bits each). So, this should not surprise you:
console.log('🐶'.length) // => 2



...some characters appearing identical, but having different representations.

...The problem is that some of these characters could be represented in multiple ways.

For example, the letter é could be represented using either:
A single code point U+00E9
The combination of the letter e and the acute accent, for a total of two code points: U+0065 and U+0301

The two characters look the same, but do not compare as equal, and the strings have different lenghts. In JavaScript:
console.log('\u00e9') // => é
console.log('\u0065\u0301') // => é
console.log('\u00e9' == '\u0065\u0301') // => false
console.log('\u00e9'.length) // => 1
console.log('\u0065\u0301'.length) // => 2 
...

This can cause unexpected bugs, such as records not found in a database, passwords mismatching letting users unable to authenticate, etc."

No comments: