The bare minimum we need to know about character encoding

#13722 · April 15, 2021

After fixing a DOM-based XSS vulnarability, I was made aware that the missing sanitization of user input wasn’t a bug or on oversight, but know and a feature (unfortunately nobody knew before we pushed an update). By sanitizing a window location hash, I broke the following (simplified) scenario: A user creates an "element" with a title mb點, we turn it into an id attribute on the element #prefix-mb%e9%bb%9e and append a URL hash which we then read back from window.location.hash to select the element with the id.

Somewhere along the way, I came across The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and (re-)learned a lot:

As long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work.

The whole article comes down to: Don’t assume anyone using your software is speaking your language or using the same character set as you are. Digital colonialims sucks balls!