While that was a concern initially, this didn’t turn out to be true in the end. I did have to write some very tricky low-level implementations of functions, like this:
Assuming validity of UTF-8 data doesn’t end up buying you much additional performance – if any. There are some more iterated index arithmetic functions I’ll get to writing optimized versions of after the 0.7 feature freeze. I’ll also do some more benchmarking against the previous UTF-8 code.
This was a big issue. Validating incoming data requires looking at all of it, which is not acceptable for large enough text data. And the validation was extremely spotty – some ways of getting strings would error if the data was invalid, other ways, you’d end up with a string holding invalid data and no error. So we were paying the price for validation, and not even getting any validity guarantee from it. Moreover, as I’ve said, the assertion that you can decode UTF-8 much faster by assuming it is valid seems not to be correct and would need to be backed up by some actually benchmarks to that effect (e.g. an implementation decoding UTF-8 assuming validity that is faster than my implementation above which doesn’t).
Using a hybrid encoding like Python 3’s strings or @ScottPJones’s UniStr
means that not only do you need to look at every byte of incoming data, but you also have to transcode it in general. This is a total performance nightmare for dealing with large text files. This is also the reason why his benchmarks are extremely misleading: he’s comparing operations that are O(n) for variable-width encodings like UTF-8 but O(1) for fixed-width encodings like UTF-32. But how did you get that fixed-width encoded string data in the first place? You aren’t getting data in UniStr
form – since that’s not an actual encoding that exists in the wild. So you had to scan each incoming string to find its largest code point value, and then transcode it to the appropriate choice of Latin-1, UCS-2 or UTF-32. After all of that work, sure, indexing and counting code points are O(1), but you already did the work that you’re timing the UTF-8 string type doing.
Note also that in UniStr
, if a large string that is mostly ASCII has a single emoji in it, then it needs to be stored in UTF-32, so it will be 4x larger than it would be in UTF-8, for example. That’s an extreme example, but also not all that contrived – mostly-ASCII data with a few emoji is not exactly an unlikely scenario.
Are there use cases for the UniStr
kind of hybrid encoding? Sure. If you want to ingest a bunch of string data once and the strings are going to each be fairly small (limiting the potential effect of a single emoji), then it might be a good way to represent strings. But that’s a fairly specific scenario and hardly a typical one for data processing.