Julia does allow you to use \uD83D\uDE3F
to create a String
value with two invalid surrogate code points:
julia> "\uD83D\uDE3F"
"\ud83d\ude3f"
Since the String
type is UTF-8 based, that means having two separate well-formed but invalid characters with surrogate code points. Specifically, this is an instance of the WTF-8 extension of UTF-8.
This is different than what would happen if you constructed a UTF-16 string with two invalid surrogates in the right order, which is that you’d get a correctly encoded character. This is similar to how you can form a correctly encoded UTF-8 character by stringing together the right sequence of code units, which would independently be invalid:
julia> "\xe2\x88\x80"
"∀"
But these “tricks” for writing valid characters as a sequence of invalid code units are inherently tied to a given encoding, which I suspect it what @stevengj means by a leaky abstraction. The benefit of allowing directly writing invalid strings in terms of individual bytes is that you can write arbitrary data that is mostly string-like, which is often quite useful. It’s much more useful with a UTF-8 based string type than with a UTF-16 based string type since UTF-8 can work with individual bytes whereas the code unit of UTF-16 is a byte pair.