The first test measures character iteration for the new Char
type. I can’t see how that’s not relevant. The second test does not produce characters, but since the PR affects what counts as a character, it’s relevant to see how fast one can count the number of character in strings. These aren’t two random benchmarks – these test the two operations that got trickier in the new scheme: character iteration and character counting. Everything else is the same or easier than it was.
These tests don’t show what the effects would be with strings in different languages, and I disagree that you would generally rewrite things not to operate character by character if there is a performance bottleneck.
Almost all of the performance improvements to string code in Base Julia that you made some years ago did precisely that. All the performance-sensitive operations on the String
type all do exactly this as well. Decoding UTF-8 is tricky. The way you make an operations on UTF-8 faster is by not actually decoding it fully.
One of the things I liked very early on in Julia was the way generic algorithms on strings and characters could get optimized automatically because the compiler could take advantage of things like knowing the types of the codeunits.
This is still true. Nothing has changed with respect to that.
I don’t think that these tests prove at all that this approach is at all viable from a performance perspective.
I’ve shown that the new character iteration is slightly faster. This makes sense since it does less work. It just determines where the boundaries between characters are, it doesn’t have to convert those bytes into code points. It is really quite rare to need an integer code point value – what you typically actually want to do is compare characters to other characters using equality and inequality. The new representation allows you to do that without decoding characters into code points because of the property that UTF-8 bytes sort the same as the code points they represent. It’s one of the brilliant things about the design of UTF-8.
Regarding whether converting characters to code points is typical or not, I analyzed Base, which one would expect to do such low-level operations as much as any code out there. There were only six instances of explicitly converting a character to an integer code point in all of Base:
- one is the definition of how to convert to general integer types,
- two are in character display code for showing code point values,
- three should have been doing character comparisons without decoding, so I’ve fixed those.
That’s only two real uses of code points – both for printing code points as numerical values. Based on that analysis, it’s a bit implausible that extracting code points is essential to working with strings. Equality and inequality comparisons on character values are sufficient for almost all character processing.
And keep in mind that even in cases where code point conversion does happen, you are still only doing the same amount of work as was done to produce the old character representation. Previously, we had to determine where characters end and then do the conversion of bytes to a code point. Now we get to skip the second part almost all of the time.
I can understand the concern with using this character representation with string encodings other than UTF-8. Particularly when working with UTF-32 where producing code points in the old representation was trivial. But here’s the thing: UTF-8 won. Here’s a graph of popularity of encodings in web pages over time:
And that’s only up to 2014 – 90% of the web is now UTF-8. If you are one of the increasingly rare people working with UTF-32 data, you can use a Char32 <: AbstractChar
type that represents code points in the old way. No problem.
The PR #24999 also seems to do a lot of the things that people such as @tkelman and yourself complained about a few years ago, of not being focused on a single issue, i.e. changing the representation of Char, and instead has incorporated a whole grab bag of unrelated string changes (which makes it very difficult, if not impossible, to determine the performance affects of just the Char changes).
Surely, you understand the difference here? I have just a bit more credibility at making large changes to Julia than you do. A couple years ago, you showed up, decided you wanted to change everything about how strings work in Julia in ways that people did not agree with and that you did not persuade them were better, which, frankly, I still think were fundamentally misguided. These changes left strings in Julia in a state where a program throws an error any time it encounters invalid UTF-8 string data. As it turns out, this happens quite a lot in the real world. Getting errors all the time has forced anyone who wants to write robust string code to not use our strings at all and use byte vectors instead. It’s a disaster. I’ve been intending to undo the damage for a long time and this PR is that change. I can understand that you don’t like it or agree with it, but I’m ok with that. The new philosophy is quite different but very simple: bad data should never cause an error, only a bad program should. With this change, Julia can gracefully and efficiently handle any kind of UTF-8 data the world throws at it, good or bad.
A large part of why this PR is so sprawling is because the string code was in an incoherent state without a single vision, after being been pulled it in different directions by various people over the years. Changing one thing (character representation) that should, in theory, have been isolated, ended up forcing the resolution of a number of fundamental questions about what it means to be a string in Julia. If I had more time, I would factor the PR into smaller more manageable pieces, but we’re under a deadline here since the Julia 1.0 feature freeze is on Friday.