Changes to the representation of Char


#1

As a very high level comment here, I’m currently working on changes to how Julia handles string data that will make errors on invalid UTF-8 far less common. The work-in-progress is here:

The basic concept is to allow any invalid data in String values and allow extracting invalid character fragments as Char values, all without error. You will only encounter an error if you try to convert a Char value for an invalid character fragment into a code point value – something which most string processing code doesn’t actually need to ever do. I just wanted to mention that here because every time I see someone struggling with errors from invalid UTF-8 data, it makes me sad.


Unicode related error when reading a .csv
#2

I believe this will seriously impact performance, for those of us who are doing lots of string processing, esp. when using UCS-2, UTF-16, and UTF-32 representations.
At the very least, I hope that there would be an AbstractChar type, with the current Char type still available, as UChar maybe, and the code in Base/Stdlib made generic, instead of forcing this UTF-8 based representation on all of us.

In my experience with supporting applications that were heavily into string processing for many years, and dealing with all sorts of pre-Unicode single and multibyte representations, as well as Unicode 1.0 UCS-2, and all sorts of Unicode encoding variants over the years, I haven’t seen cases where people really wanted to deal with invalid data in their applications, they wanted: 1) use of a replacement character (possibly of their choice), 2) exceptions with enough information returned to figure out how to proceed if desired, 3) user handling of invalid sequences (to allow more complex behavior than a simple replacement character or exception) as well as being able to optionally allow on input things like: overlong UTF-8 sequences (frequently used for representing a nul byte as \0xc0\0x80, used by Java), or characters in the 0x10000-0x1fffff range represented as the two surrogate characters encoded as 3 bytes each, instead of a single 4-byte sequence.

Do you have any benchmarking results, using this changed representation of Char?
In particular, using things like UTF-16 and UTF-32, needed for interfacing with other languages and packages such as ICU?


#3

That’s exactly the kind of thing that allowing for invalid Unicode in strings will make possible.

I don’t see why we wouldn’t be able to use AbstractChar if it turns out that the standard Char has a noticeable performance impact for some particular use cases.


#4

No, not in my experience. The last thing customers wanted was to have invalid data in their databases, leading to logical corruption of their data. We almost never saw cases where the data was actually “corrupt”, it was pretty much always cases of misidentification of the encoding.
Trying to make all of the code cognizant of the possibility of garbage in strings 1) makes applications much more complex and 2) most programmers don’t know about all the possible issues dealing with different character sets and encodings, and so applications generally don’t handle that correctly.

That’s why I feel the best philosophy for string handling is to have good functionality for 1) handling of common variants of Unicode encoding on input, 2) detection of input data’s likely character set, 3) flexible replacement and exception schemes, and most importantly, to only handle valid Unicode strings once input (which can greatly speed up performance in some of the string functions, when you don’t have to worry about dealing with garbage data), and greatly reduce the burden on applications logic to handle invalid data.

The best way to handle potentially invalid data would be to simple have a binary string type (which I’ve wanted from the beginning in Julia), that is actually <: AbstractString, that could be either reinterpreted (only if valid!) or converted to String (or another valid string type) once the data has been checked for validity or what encoding it is from.

I can’t think of a single customer who ever wanted to allow potentially invalid strings in their data, in all the time I worked in just that area.

I think it’s very important to help people to avoid “Garbage In, Garbage Out”.


#5

Here’s the new replacement PR for the one linked above:

https://github.com/JuliaLang/julia/pull/24999

This turned into a bigger overhaul than expected – there were a lot of conceptual inconsistencies that had crept into the string code over years of various people with slightly different mental models working on it. Changing the character representation itself wasn’t such a big deal, but rewriting all the low-level functions and having everything work required some rethinking of what the fundamental API that a string type has to provide should be. The core methods that a string type must provide are now:

  • ncodeunits(s::AbstractString) – the number of code units
  • codeunit(s::AbstractString) – the code unit type of a string
  • codeunit(s::AbstractString, i::Integer) – extracting a single code unit
  • isvalid(s::AbstractString, i::Integer) – is an index the start of a character
  • next(s::AbstractString, i::Integer) – string iteration

Everything else has pretty efficient generic definitions in terms of these. (Except for string reversal, which we don’t currently have an efficient way to express generically, but that can be added later.)

I changed the representation of Char and it now holds bytes of data from an underlying String object and iterates String whether they contain valid UTF-8 data or not, by breaking the data into chunks of 1-4 bytes in a 32-bit Char value, which are either:

  • well-formed UTF-8 characters
  • bytes corresponding to a replacement character sequence in Unicode 10

For the latter case, the Char value captures the underlying bytes instead of replacing them, which allows code to operate on malformed values, choosing to ignore them, replace them, or passing them through as is. You only get an error for malformed UTF-8 in a String if you try to convert a malformed Char value to an integer code point. At that point there’s no other choice since there is no correct value to return since the character data is malformed.

The phrase “well-formed UTF-8” is more lenient than what Unicode defines as “valid UTF-8”. It includes anything with the basic structure of UTF-8: ASCII values or a lead byte (with 2-4 leading one bits) followed by enough continuation bytes (with 1 leading one bit). Any such sequence can be decoded to a code point value, and that’s what is returned when you do Int(c). This allows decoding non-standard UTF-8 schemes like Modified UTF-8 and WTF-8, which commonly occur in practice. So if you want to check if a character is valid UTF-8, you could check that is a well-formed, which guarantees that converting it to an integer will not fail, and then check that converting it back to a character gives you the same character value. Of course, there’s an isvalid(c) predicate that checks this more efficiently, but you get the point.

The base string functions no longer try to combine CESU-8 and instead will yield Char values for each surrogate pair. You can detect this and combine them if easily, but combining them automatically would violate the “give the programmer what’s actually there” maxim that the new approach takes. Higher-level functionality can easily be provided for normalizing and “fixing” strange/broken UTF-8 data.


#6

I’ve done a variety of benchmarking and there are no performance issue that I can find. Here’s a comparison of the following benchmark functions computing the total characters in the dictionary, iterating over words:

words = readlines("/usr/share/dict/words");

function countchars1(words::Vector{String})
    t = 0
    for word in words, c in word
        t += 1
    end
    return t
end

function countchars2(words::Vector{String})
    t = 0
    for word in words
        t += length(word)
    end
    return t
end

On 0.7 master:

julia> @benchmark countchars1($words)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.050 ms (0.00% GC)
  median time:      4.164 ms (0.00% GC)
  mean time:        4.180 ms (0.00% GC)
  maximum time:     4.692 ms (0.00% GC)
  --------------
  samples:          1193
  evals/sample:     1

julia> @benchmark countchars2($words)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.226 ms (0.00% GC)
  median time:      3.323 ms (0.00% GC)
  mean time:        3.338 ms (0.00% GC)
  maximum time:     3.978 ms (0.00% GC)
  --------------
  samples:          1493
  evals/sample:     1

After the string overhaul:

julia> @benchmark countchars1($words)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.781 ms (0.00% GC)
  median time:      3.869 ms (0.00% GC)
  mean time:        3.883 ms (0.00% GC)
  maximum time:     4.474 ms (0.00% GC)
  --------------
  samples:          1284
  evals/sample:     1

julia> @benchmark countchars2($words)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.628 ms (0.00% GC)
  median time:      3.764 ms (0.00% GC)
  mean time:        3.780 ms (0.00% GC)
  maximum time:     5.027 ms (0.00% GC)
  --------------
  samples:          1319
  evals/sample:     1

So character iteration got faster (7%) and string length got slower (12%). There are lots of tricks that can be done to make string length faster despite somewhat tricky character splitting rules, so I’m not too worried about that slow down. The current numbers prove that this is a viable approach from the performance perspective.

Regarding performance issues of the new character representation with respect to UTF-32, UTF-16, UCS-2 and whatever, I think those are a bit overstated since whenever you have an algorithm on a particular string encoding that’s a bottleneck, you generally want to rewrite it not to operate character by character. Operating on individual characters is mostly useful for providing reasonable fallbacks for generic string operations when performance isn’t paramount. However, it seems fine to introduce an AbstractChar type and allow packages to subtype that and choose whatever character representation they want to use. That could be introduced at any time since we’re specifically reserving the right to insert abstract types into the type hierarchy in the 1.x timeframe, but someone may get around to it before the 1.0 feature freeze.


#7

The second test doesn’t seem to have anything at all to do with the proposed change to the representation of Char, and the first test is doing no processing at all on the character values.

These tests don’t show what the effects would be with strings in different languages, and I disagree that you would generally rewrite things not to operate character by character if there is a performance bottleneck.
One of the things I liked very early on in Julia was the way generic algorithms on strings and characters could get optimized automatically because the compiler could take advantage of things like knowing the types of the codeunits.

I don’t think that these tests prove at all that this approach is at all viable from a performance perspective.
The PR #24999 also seems to do a lot of the things that people such as @tkelman and yourself complained about a few years ago, of not being focused on a single issue, i.e. changing the representation of Char, and instead has incorporated a whole grab bag of unrelated string changes (which makes it very difficult, if not impossible, to determine the performance affects of just the Char changes).


#8

The first test measures character iteration for the new Char type. I can’t see how that’s not relevant. The second test does not produce characters, but since the PR affects what counts as a character, it’s relevant to see how fast one can count the number of character in strings. These aren’t two random benchmarks – these test the two operations that got trickier in the new scheme: character iteration and character counting. Everything else is the same or easier than it was.

These tests don’t show what the effects would be with strings in different languages, and I disagree that you would generally rewrite things not to operate character by character if there is a performance bottleneck.

Almost all of the performance improvements to string code in Base Julia that you made some years ago did precisely that. All the performance-sensitive operations on the String type all do exactly this as well. Decoding UTF-8 is tricky. The way you make an operations on UTF-8 faster is by not actually decoding it fully.

One of the things I liked very early on in Julia was the way generic algorithms on strings and characters could get optimized automatically because the compiler could take advantage of things like knowing the types of the codeunits.

This is still true. Nothing has changed with respect to that.

I don’t think that these tests prove at all that this approach is at all viable from a performance perspective.

I’ve shown that the new character iteration is slightly faster. This makes sense since it does less work. It just determines where the boundaries between characters are, it doesn’t have to convert those bytes into code points. It is really quite rare to need an integer code point value – what you typically actually want to do is compare characters to other characters using equality and inequality. The new representation allows you to do that without decoding characters into code points because of the property that UTF-8 bytes sort the same as the code points they represent. It’s one of the brilliant things about the design of UTF-8.

Regarding whether converting characters to code points is typical or not, I analyzed Base, which one would expect to do such low-level operations as much as any code out there. There were only six instances of explicitly converting a character to an integer code point in all of Base:

  • one is the definition of how to convert to general integer types,
  • two are in character display code for showing code point values,
  • three should have been doing character comparisons without decoding, so I’ve fixed those.

That’s only two real uses of code points – both for printing code points as numerical values. Based on that analysis, it’s a bit implausible that extracting code points is essential to working with strings. Equality and inequality comparisons on character values are sufficient for almost all character processing.

And keep in mind that even in cases where code point conversion does happen, you are still only doing the same amount of work as was done to produce the old character representation. Previously, we had to determine where characters end and then do the conversion of bytes to a code point. Now we get to skip the second part almost all of the time.

I can understand the concern with using this character representation with string encodings other than UTF-8. Particularly when working with UTF-32 where producing code points in the old representation was trivial. But here’s the thing: UTF-8 won. Here’s a graph of popularity of encodings in web pages over time:

And that’s only up to 2014 – 90% of the web is now UTF-8. If you are one of the increasingly rare people working with UTF-32 data, you can use a Char32 <: AbstractChar type that represents code points in the old way. No problem.

The PR #24999 also seems to do a lot of the things that people such as @tkelman and yourself complained about a few years ago, of not being focused on a single issue, i.e. changing the representation of Char, and instead has incorporated a whole grab bag of unrelated string changes (which makes it very difficult, if not impossible, to determine the performance affects of just the Char changes).

Surely, you understand the difference here? I have just a bit more credibility at making large changes to Julia than you do. A couple years ago, you showed up, decided you wanted to change everything about how strings work in Julia in ways that people did not agree with and that you did not persuade them were better, which, frankly, I still think were fundamentally misguided. These changes left strings in Julia in a state where a program throws an error any time it encounters invalid UTF-8 string data. As it turns out, this happens quite a lot in the real world. Getting errors all the time has forced anyone who wants to write robust string code to not use our strings at all and use byte vectors instead. It’s a disaster. I’ve been intending to undo the damage for a long time and this PR is that change. I can understand that you don’t like it or agree with it, but I’m ok with that. The new philosophy is quite different but very simple: bad data should never cause an error, only a bad program should. With this change, Julia can gracefully and efficiently handle any kind of UTF-8 data the world throws at it, good or bad.

A large part of why this PR is so sprawling is because the string code was in an incoherent state without a single vision, after being been pulled it in different directions by various people over the years. Changing one thing (character representation) that should, in theory, have been isolated, ended up forcing the resolution of a number of fundamental questions about what it means to be a string in Julia. If I had more time, I would factor the PR into smaller more manageable pieces, but we’re under a deadline here since the Julia 1.0 feature freeze is on Friday.


#9

Thanks for the summary of the PR.

A minor question: if I understand correctly, codeunit(string) returns a type and codeunit(string, i) returns a codeunit. I am wondering if it would be more clear to rename the first method to codeunittype or something similar.


#10

Yeah, I was a bit unsure about that, but codeunittype is just sooo looong.


#11

Fair point. OTOH, I guess you don’t need to query the type that often. My main concern is that I associate the joint existence of forms like

something(arg1, arg2)
something(arg1)

with optional arguments, or at least methods that roughly do similar things.


#12

I agree, codeunittype isn’t that bad for such a rarely used operation.


#13

BTW, sorry for not commenting at the PR, but (somehow embarrassingly) I could not find the location of this definition among the changes.


#14

It’s a down side of literate programming: it’s one line buried under a huge doc string.


#15

Please… Bases irrational hatred of underscores needs to have some bounds! Not choosing camelCase is increasingly showing its ugly head in julia, people are just so crazy about 1 or 2 extra characters at the huge expense of readability. codeunit_type themadnessmustendsometime