Stupid question on Unicode

StefanKarpinski · August 19, 2019, 6:39pm

Julia does allow you to use \uD83D\uDE3F to create a String value with two invalid surrogate code points:

julia> "\uD83D\uDE3F"
"\ud83d\ude3f"

Since the String type is UTF-8 based, that means having two separate well-formed but invalid characters with surrogate code points. Specifically, this is an instance of the WTF-8 extension of UTF-8.

This is different than what would happen if you constructed a UTF-16 string with two invalid surrogates in the right order, which is that you’d get a correctly encoded character. This is similar to how you can form a correctly encoded UTF-8 character by stringing together the right sequence of code units, which would independently be invalid:

julia> "\xe2\x88\x80"
"∀"

But these “tricks” for writing valid characters as a sequence of invalid code units are inherently tied to a given encoding, which I suspect it what @stevengj means by a leaky abstraction. The benefit of allowing directly writing invalid strings in terms of individual bytes is that you can write arbitrary data that is mostly string-like, which is often quite useful. It’s much more useful with a UTF-8 based string type than with a UTF-16 based string type since UTF-8 can work with individual bytes whereas the code unit of UTF-16 is a byte pair.

Topic		Replies	Views
Recommendation for a font with good Unicode support? Offtopic fonts	17	6639	December 5, 2023
Steven Johnson's #19847 (more verbose multi-line display for Char) Internals & Design	1	657	January 4, 2017
Tab completion of \uXXXX in the REPL? Internals & Design unicode	23	4367	January 12, 2024
Using Unicode characters like \to and \ne instead of -> and != General Usage	3	380	July 24, 2020
Mayan Numerals in Julia - MayanNumerals.jl Package Announcements	9	437	November 4, 2024

Stupid question on Unicode

Related topics