Changes to the representation of Char

Here’s the new replacement PR for the one linked above:

https://github.com/JuliaLang/julia/pull/24999

This turned into a bigger overhaul than expected – there were a lot of conceptual inconsistencies that had crept into the string code over years of various people with slightly different mental models working on it. Changing the character representation itself wasn’t such a big deal, but rewriting all the low-level functions and having everything work required some rethinking of what the fundamental API that a string type has to provide should be. The core methods that a string type must provide are now:

  • ncodeunits(s::AbstractString) – the number of code units
  • codeunit(s::AbstractString) – the code unit type of a string
  • codeunit(s::AbstractString, i::Integer) – extracting a single code unit
  • isvalid(s::AbstractString, i::Integer) – is an index the start of a character
  • next(s::AbstractString, i::Integer) – string iteration

Everything else has pretty efficient generic definitions in terms of these. (Except for string reversal, which we don’t currently have an efficient way to express generically, but that can be added later.)

I changed the representation of Char and it now holds bytes of data from an underlying String object and iterates String whether they contain valid UTF-8 data or not, by breaking the data into chunks of 1-4 bytes in a 32-bit Char value, which are either:

  • well-formed UTF-8 characters
  • bytes corresponding to a replacement character sequence in Unicode 10

For the latter case, the Char value captures the underlying bytes instead of replacing them, which allows code to operate on malformed values, choosing to ignore them, replace them, or passing them through as is. You only get an error for malformed UTF-8 in a String if you try to convert a malformed Char value to an integer code point. At that point there’s no other choice since there is no correct value to return since the character data is malformed.

The phrase “well-formed UTF-8” is more lenient than what Unicode defines as “valid UTF-8”. It includes anything with the basic structure of UTF-8: ASCII values or a lead byte (with 2-4 leading one bits) followed by enough continuation bytes (with 1 leading one bit). Any such sequence can be decoded to a code point value, and that’s what is returned when you do Int(c). This allows decoding non-standard UTF-8 schemes like Modified UTF-8 and WTF-8, which commonly occur in practice. So if you want to check if a character is valid UTF-8, you could check that is a well-formed, which guarantees that converting it to an integer will not fail, and then check that converting it back to a character gives you the same character value. Of course, there’s an isvalid(c) predicate that checks this more efficiently, but you get the point.

The base string functions no longer try to combine CESU-8 and instead will yield Char values for each surrogate pair. You can detect this and combine them if easily, but combining them automatically would violate the “give the programmer what’s actually there” maxim that the new approach takes. Higher-level functionality can easily be provided for normalizing and “fixing” strange/broken UTF-8 data.

1 Like