The bare minimum we need to know about character encoding
After fixing a DOM-based XSS vulnarability, I was made aware that the missing sanitization of user input wasn’t a bug or on oversight, but know and a feature (unfortunately nobody knew before we pushed an update). By sanitizing a window location hash, I broke the following (simplified) scenario: A user creates an "element" with a title mb點
, we turn it into an id
attribute on the element #prefix-mb%e9%bb%9e
and append a URL hash which we then read back from window.location.hash
to select the element with the id.
Somewhere along the way, I came across The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and (re-)learned a lot:
As long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work.
The whole article comes down to: Don’t assume anyone using your software is speaking your language or using the same character set as you are. Digital colonialims sucks balls!