Substring function?

A significant difference from UTF-8 is that there is no such thing as malformed UTF-16, only unpaired surrogate code units, which still have code points, just not ones that correspond to valid Unicode characters. So every UTF-16 string can be iterated as code points, you might just get the code point for an unpaired surrogate if the string is invalid. Compare that with UTF-8 where some byte sequences just don’t follow the right structure at all.

(Another way to put this is that every UTF-16 sequence, valid or invalid, can be represented as a sequence of code points using WTF-8, which is an extension of UTF-8 allowing surrogate pair code points.)

4 Likes