Stupid question on Unicode

Julia does allow you to use \uD83D\uDE3F to create a String value with two invalid surrogate code points:

julia> "\uD83D\uDE3F"
"\ud83d\ude3f"

Since the String type is UTF-8 based, that means having two separate well-formed but invalid characters with surrogate code points. Specifically, this is an instance of the WTF-8 extension of UTF-8.

This is different than what would happen if you constructed a UTF-16 string with two invalid surrogates in the right order, which is that you’d get a correctly encoded :crying_cat_face: character. This is similar to how you can form a correctly encoded UTF-8 character by stringing together the right sequence of code units, which would independently be invalid:

julia> "\xe2\x88\x80"
"∀"

But these “tricks” for writing valid characters as a sequence of invalid code units are inherently tied to a given encoding, which I suspect it what @stevengj means by a leaky abstraction. The benefit of allowing directly writing invalid strings in terms of individual bytes is that you can write arbitrary data that is mostly string-like, which is often quite useful. It’s much more useful with a UTF-8 based string type than with a UTF-16 based string type since UTF-8 can work with individual bytes whereas the code unit of UTF-16 is a byte pair.

1 Like