When "Zoë" !== "Zoë". Or why you need to normalize Unicode strings | With Blue Ink
"the dog emoji 🐶 has the code point U+1F436.
When encoded, the dog emoji can be represented in multiple byte sequences:
UTF-8: 4 bytes, 0xF0 0x9F 0x90 0xB6
UTF-16: 4 bytes, 0xD83D 0xDC36
Most JavaScript interpreters (including Node.js and modern browsers) use UTF-16 internally. Which means that the dog emoji is stored using two UTF-16 code units (of 16 bits each). So, this should not surprise you:
console.log('🐶'.length) // => 2
...some characters appearing identical, but having different representations.
...The problem is that some of these characters could be represented in multiple ways.
For example, the letter é could be represented using either:
A single code point U+00E9
The combination of the letter e and the acute accent, for a total of two code points: U+0065 and U+0301
The two characters look the same, but do not compare as equal, and the strings have different lenghts. In JavaScript:
console.log('\u00e9') // => é
console.log('\u0065\u0301') // => é
console.log('\u00e9' == '\u0065\u0301') // => false
console.log('\u00e9'.length) // => 1
console.log('\u0065\u0301'.length) // => 2
...The problem is that some of these characters could be represented in multiple ways.
For example, the letter é could be represented using either:
A single code point U+00E9
The combination of the letter e and the acute accent, for a total of two code points: U+0065 and U+0301
The two characters look the same, but do not compare as equal, and the strings have different lenghts. In JavaScript:
console.log('\u00e9') // => é
console.log('\u0065\u0301') // => é
console.log('\u00e9' == '\u0065\u0301') // => false
console.log('\u00e9'.length) // => 1
console.log('\u0065\u0301'.length) // => 2
...
This can cause unexpected bugs, such as records not found in a database, passwords mismatching letting users unable to authenticate, etc."
This can cause unexpected bugs, such as records not found in a database, passwords mismatching letting users unable to authenticate, etc."
No comments:
Post a Comment