How does Char get stored?

I’m amazed it has taken so long for anyone to discover my ruse! There are two reasons for this representation of characters…

First, we can often avoid UTF-8 decoding entirely. If all you need to do is read a character and compare it with other characters, and maybe print it back out, that can all be done without decoding. Why pay the cost of all that bit twiddling unnecessarily? Especially since it’s paid in both directions if you’re both reading and writing the characters. All you really need to do is identify how long the character is and pass its bytes around.

Second, this allows invalid UTF-8 data to be processed as characters. Suppose you have written a collect_string function that does something like this:

function collect_string(src::IO)
    n = 0
    str = sprint() do dst
        for c in readeach(src, Char)
            write(dst, c)
            n += 1
        end
    end
    return n, str
end

This collects the contents of of an IO stream as a sequence of characters into a string and returns the number of characters as well as the collected string:

julia> open("hello.txt", write=true) do io
           println(io, "Hello, world!")
       end

julia> open(collect_string, "hello.txt")
(14, "Hello, world!\n")

But what happens if there’s some invalid UTF-8 data in there? Let’s see:

julia> open("hello.txt", write=true) do io
           name = String(rand(UInt8, 12))
           println(io, "Hello, $(name)!")
       end

julia> read("hello.txt", String)
"Hello, N\x15e=c\xf4O\x06\xc0?(\t!\n"

julia> open(collect_string, "hello.txt")
(21, "Hello, N\x15e=c\xf4O\x06\xc0?(\t!\n")

It works just fine. How? How does invalid UTF-8 data get iterated as characters and written back out so that it ends up being identical? Let’s try it:

julia> str = String(rand(UInt8, 12))
"\bU\xbbgX\x81i\xaa\xec\x83t\x95"

julia> collect(str)
11-element Vector{Char}:
 '\b': ASCII/Unicode U+0008 (category Cc: Other, control)
 'U': ASCII/Unicode U+0055 (category Lu: Letter, uppercase)
 '\xbb': Malformed UTF-8 (category Ma: Malformed, bad data)
 'g': ASCII/Unicode U+0067 (category Ll: Letter, lowercase)
 'X': ASCII/Unicode U+0058 (category Lu: Letter, uppercase)
 '\x81': Malformed UTF-8 (category Ma: Malformed, bad data)
 'i': ASCII/Unicode U+0069 (category Ll: Letter, lowercase)
 '\xaa': Malformed UTF-8 (category Ma: Malformed, bad data)
 '\xec\x83': Malformed UTF-8 (category Ma: Malformed, bad data)
 't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)
 '\x95': Malformed UTF-8 (category Ma: Malformed, bad data)

Here you can see that we generate a random 12-byte string and the contents is 11 characters that are a mix of valid and invalid characters. If we represented characters as code points, we would not be able to represent these and the collect_string function wouldn’t work. By representing characters as the sequence of 1-4 UTF-8-like bytes, we can represent both valid and invalid character sequences, even completely malformed ones like we see here that have no corresponding code point.

Aside: the one two-byte character in that string is a malformed one — '\xec\x83' (not a valid char literal). It consists of a valid leading byte followed by valid continuation byte, but then followed by 't' which is an ASCII character and not a valid continuation byte, so the sequence is malformed. The decoding of invalid characters is specified in the Unicode standard: you try do decode a well-formed sequence of code units; if you find a well-formed sequence, that’s the character (it may still be invalid if it is, say, an unpaired surrogate or a too-high encoding); if you reach a code unit such that the sequence cannot be well-formed, the invalid character consists of all the code units before that one.

Back to the less weedy weeds… As you might expect, an invalid character is equal to itself and unequal to valid characters:

julia> good = '\uff'
'ÿ': Unicode U+00FF (category Ll: Letter, lowercase)

julia> good == good
true

julia> good == 'x'
false

julia> bad = "\xff"[1]
'\xff': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> bad == bad
true

julia> bad == good
false

In fact, the only problem when handling invalid characters is if you explicitly need to know their code point value for some reason:

julia> codepoint(good)
0x000000ff

julia> codepoint(bad)
ERROR: Base.InvalidCharError{Char}('\xff')

If you need to look at code points and you want your program to be robust against invalid UTF-8 data, then you must check if a character is valid before taking its code point:

julia> isvalid(good)
true

julia> isvalid(bad)
false

This design was a bit of a gamble and I had some longish discussions with Jeff about whether it was a good idea or not, in the lead up to the 1.0 release. But it’s panned out quite well. The proof is that:

  1. Almost no one ever notices that Char isn’t represented as a code point;
  2. Julia “just works” when processing strings and characters, even with stray invalid UTF-8.

The fact that the representation of Char is mostly irrelevant to people was already demonstrated by the fact that hardly any code broke when we changed the representation for 1.0, but it’s nice that it’s so rare for anyone to notice.

15 Likes