How does Char get stored?

I was basically under the impression that for 4 byte UTF-8 characters
Char stored a UInt32 that was the codepoint containing all 4 codeunits.

This is clearly not the case, as reinterpret disagrees with the codepoint

julia> '🍰'
'🍰': Unicode U+1F370 (category So: Symbol, other)

julia> codepoint('🍰')
0x0001f370

julia> reinterpret(UInt32, '🍰')
0xf09f8db0

julia> Char(0x0001f370)
'🍰': Unicode U+1F370 (category So: Symbol, other)

julia> reinterpret(Char, 0x0001f370)
'\x00\x01\xf3\x70': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> reinterpret(Char,  0xf09f8db0)
'🍰': Unicode U+1F370 (category So: Symbol, other)

julia> Char(0xf09f8db0)
ERROR: Base.CodePointError{UInt32}(0xf09f8db0)
2 Likes

It seems like Char actually uses UTF-8 encoding as well, just padded with zeros, so it’s always 4 bytes:

julia> codeunits("🍰")
4-element Base.CodeUnits{UInt8,String}:
 0xf0
 0x9f
 0x8d
 0xb0

julia> reinterpret(Char, reverse(collect(ans)))
1-element reinterpret(Char, ::Array{UInt8,1}):
 '🍰': Unicode U+1F370 (category So: Symbol, other)

I actually didn’t know that as well and thought it just stored the value of the codepoint directly, but my guess would be that it’s because it’s more efficient to convert to a string and back this way.

1 Like

I’m amazed it has taken so long for anyone to discover my ruse! There are two reasons for this representation of characters…

First, we can often avoid UTF-8 decoding entirely. If all you need to do is read a character and compare it with other characters, and maybe print it back out, that can all be done without decoding. Why pay the cost of all that bit twiddling unnecessarily? Especially since it’s paid in both directions if you’re both reading and writing the characters. All you really need to do is identify how long the character is and pass its bytes around.

Second, this allows invalid UTF-8 data to be processed as characters. Suppose you have written a collect_string function that does something like this:

function collect_string(src::IO)
    n = 0
    str = sprint() do dst
        for c in readeach(src, Char)
            write(dst, c)
            n += 1
        end
    end
    return n, str
end

This collects the contents of of an IO stream as a sequence of characters into a string and returns the number of characters as well as the collected string:

julia> open("hello.txt", write=true) do io
           println(io, "Hello, world!")
       end

julia> open(collect_string, "hello.txt")
(14, "Hello, world!\n")

But what happens if there’s some invalid UTF-8 data in there? Let’s see:

julia> open("hello.txt", write=true) do io
           name = String(rand(UInt8, 12))
           println(io, "Hello, $(name)!")
       end

julia> read("hello.txt", String)
"Hello, N\x15e=c\xf4O\x06\xc0?(\t!\n"

julia> open(collect_string, "hello.txt")
(21, "Hello, N\x15e=c\xf4O\x06\xc0?(\t!\n")

It works just fine. How? How does invalid UTF-8 data get iterated as characters and written back out so that it ends up being identical? Let’s try it:

julia> str = String(rand(UInt8, 12))
"\bU\xbbgX\x81i\xaa\xec\x83t\x95"

julia> collect(str)
11-element Vector{Char}:
 '\b': ASCII/Unicode U+0008 (category Cc: Other, control)
 'U': ASCII/Unicode U+0055 (category Lu: Letter, uppercase)
 '\xbb': Malformed UTF-8 (category Ma: Malformed, bad data)
 'g': ASCII/Unicode U+0067 (category Ll: Letter, lowercase)
 'X': ASCII/Unicode U+0058 (category Lu: Letter, uppercase)
 '\x81': Malformed UTF-8 (category Ma: Malformed, bad data)
 'i': ASCII/Unicode U+0069 (category Ll: Letter, lowercase)
 '\xaa': Malformed UTF-8 (category Ma: Malformed, bad data)
 '\xec\x83': Malformed UTF-8 (category Ma: Malformed, bad data)
 't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)
 '\x95': Malformed UTF-8 (category Ma: Malformed, bad data)

Here you can see that we generate a random 12-byte string and the contents is 11 characters that are a mix of valid and invalid characters. If we represented characters as code points, we would not be able to represent these and the collect_string function wouldn’t work. By representing characters as the sequence of 1-4 UTF-8-like bytes, we can represent both valid and invalid character sequences, even completely malformed ones like we see here that have no corresponding code point.

Aside: the one two-byte character in that string is a malformed one — '\xec\x83' (not a valid char literal). It consists of a valid leading byte followed by valid continuation byte, but then followed by 't' which is an ASCII character and not a valid continuation byte, so the sequence is malformed. The decoding of invalid characters is specified in the Unicode standard: you try do decode a well-formed sequence of code units; if you find a well-formed sequence, that’s the character (it may still be invalid if it is, say, an unpaired surrogate or a too-high encoding); if you reach a code unit such that the sequence cannot be well-formed, the invalid character consists of all the code units before that one.

Back to the less weedy weeds… As you might expect, an invalid character is equal to itself and unequal to valid characters:

julia> good = '\uff'
'ÿ': Unicode U+00FF (category Ll: Letter, lowercase)

julia> good == good
true

julia> good == 'x'
false

julia> bad = "\xff"[1]
'\xff': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> bad == bad
true

julia> bad == good
false

In fact, the only problem when handling invalid characters is if you explicitly need to know their code point value for some reason:

julia> codepoint(good)
0x000000ff

julia> codepoint(bad)
ERROR: Base.InvalidCharError{Char}('\xff')

If you need to look at code points and you want your program to be robust against invalid UTF-8 data, then you must check if a character is valid before taking its code point:

julia> isvalid(good)
true

julia> isvalid(bad)
false

This design was a bit of a gamble and I had some longish discussions with Jeff about whether it was a good idea or not, in the lead up to the 1.0 release. But it’s panned out quite well. The proof is that:

  1. Almost no one ever notices that Char isn’t represented as a code point;
  2. Julia “just works” when processing strings and characters, even with stray invalid UTF-8.

The fact that the representation of Char is mostly irrelevant to people was already demonstrated by the fact that hardly any code broke when we changed the representation for 1.0, but it’s nice that it’s so rare for anyone to notice.

15 Likes

Oh, other fun (and potentially useful) things you can do thanks to this design:

julia> str = String(rand(UInt8, 100))
"\x87\xbe\xb3Z'\x94\xa1\xad\xec\xd4.X\xba\xbc\x8d\x0f\xc8\xee2⒆\x1c\x15\x0e1H1|\xabU\x82Ik\xc6'\xedj\x9c\x96\xa9\xaf\x8e%QV0\x05\xc0\xa3\xf6\xdf\xd3@('D\xc1Ի\xe74\xb7\xcb\xfc>\xac\xd1'ݫ\xe2\xc9/*\xbd\xe6c0A\x7f\xd8\x06\x02u\xbf\x04n\xe9\x11u\x16\x13l\x15y\xaa\xd5\xeca"

julia> filter(isvalid, str)
"Z'.X\x0f2⒆\x1c\x15\x0e1H1|UIk'j%QV0\x05@('DԻ4>'ݫ/*c0A\x7f\x06\x02u\x04n\x11u\x16\x13l\x15ya"

It’s simple and obvious what this does, right? But it only works because we can represent invalid characters!

5 Likes

Ok it seem like the thing Julia does is the most sensible thing (other than being backwards??) since you can just reinterpret Chars out of chunks of Strings. (as long and you reverse the bytes?)

julia> raw = unsafe_load(Ptr{UInt32}(pointer("🍰")))
0xb08d9ff0

julia> reinterpret(Char, ntoh(raw))
'🍰': Unicode U+1F370 (category So: Symbol, other)

I was assuming that that was what a codepoint was also.
Why wouldd a UInt32 codepoint not be equal to the sequence of codeunits that need to be put into the string to insert this character?
What is a codepoint?

Not the way you did, you just got lucky with using a codepoint that’s actually 4 bytes. Showing with my name (yes would work with individual Char)::

julia> raw = unsafe_load(Ptr{UInt32}(pointer("Páll")));

julia> reinterpret(Char, ntoh(raw))
'\x50\xc3\xa1\x6c': Malformed UTF-8 (category Ma: Malformed, bad data)

Also this thing about reversing the bytes, is probably a big- vs. little- endian thing and it’s dangerous to assume one or the other (your code will not be portable), even that your platform supports unaligned loads. Most do nowadays, even ARM (but didn’t historically). Since UTF-8 is byte-based, none of this matters, when you do the loading correctly as Base does.

1 Like

[This is mostly trivia, except for those implementing new string types.]

It’s good to know how Char is implemented, it’s very important in my opinion to support invalid byte (e.g. UTF-8) sequences too. Julia supports all important text encodings to some degree.* But it seems wouldn’t work for my future (not yet implemented!) encoding I designed UTF-4 (and that’s ok), as nibble-, not byte-based.

Actually I abandoned that plan for a better one, byte-based one, that would also compress text by half. Most common letters (in English) are fewer than 16 or 8, why extending the UTF-8 idea down to 4-bit code units could have worked.

* What I mean by important encodings, are: GB18030, standardized in China, supporting all Unicode letters, some other lesser capable Asian encodings, and legacy 8-bit ones, for Russian (and e.g. Greek). And UTF-16 I want to see die, is supported by a package (and to some degree by Julia):

I see e.g. GB18030 is supported by https://github.com/JuliaStrings/StringEncodings.jl
but that library converts to (or from) UTF-8, so Char just works, on the converted stuff of course.

If you would iterate over UTF-16 (or other encoding, even 8-bit ones), without taking the encoding into account, you will get garbage-looking Chars, while still when outputting them you would get the same byte-sequence.

But if you do take the encoding into account (not just be converting the whole string, only trying for Char support), it’s unclear to me you can use it with Char. A similar Char-like type <: AbstractChar could work, and that way is I believe used by some other packages, it’s just unclear to me how well it’s done for illegal sequences. Using the Julia Char could be made to work, for legal sequences, of e.g. UTF-16, but I think not for illegal sequences.

1 Like

Right, the currently legal UTF-8, i.e. up to the maximum 4-byte UTF-8 sequence.

Some trivia: UTF-8 used to allow up to 6-byte sequence when UTF-32 had a maximum of 31-bit value, while it now has a 21-bit maximum. Either way, Julia’s Char could have supported UTF-32, while as implemented can’t support the, by now, illegal 5 and 6 byte sequences, in one Char. It can, and will decode those illegal as more than one Char.

1 Like