String indices : byte indexing feels wrong

The way Python3 handles this internally is that every string is transcoded to a fixed-width format based on the largest code point in the string. (Specifically one of Latin-1, UCS-2 or UTF-32—yes, the names of these are comically all over the place, but these are the three fixed-width encodings that represent Unicode code points directly as UInt8, UInt16 and UInt32 values). What is the problem with this? Here are the major issues:

  1. The Python interpreter has to read and decode every string before the user can do anything with it. For large data, you might not actually need to look at most of some data at all, so this can be a potential performance problem. E.g. you might want to mmap an entire large file as a string and then access only parts of it that you need without even looking at most of it. In Julia you can do this whereas in Python3 you cannot because the system will read the entire thing, and worse still, transcode the whole thing if there’s anything non-ASCII in there…

  2. Any UTF-8 string that contains non-ASCII data needs to transcoded at least once. Not only do you have to look at all string data, but if any of it isn’t ASCII, then you’ll need to expand the string into an up to 4x larger buffer than the original string. If you later print that string, you’ll need to transcode it back to UTF-8 in order to send the string anywhere external.

  3. As alluded to above, the blow up to represent a string can be up to 4x. How common is this? I’ve heard arguments made that high code points are very rare, which is mostly true, with the exception of one rather popular kind of character: emojii. Code points for emoji start at U+1F600, which is too big to fit in a UInt16. And of course, they’re constantly sprinkled into text that would otherwise be pure ASCII, which is basically the worst possible case for Python3: any otherwise-ASCII-text that contains even a single emojii ends up using four bytes per character, inflating it by 4x. Not ideal given how common emojii are in real world text data these days.

  4. Since Python3 represents strings as vectors of code point values, it cannot, by design, handle invalid string data. Is that a problem? In a perfect world, string data would all be valid and this would be fine, but unfortunately, that’s not the world we live in. CSV files regularly include invalid UTF-8 in fields even if the file is otherwise encoded as UTF-8. When you’re writing a server, you really don’t want your server to just die when someone sends a poorly formated request. File systems will happily give you invalid string data for path names—yes, this is real and it sucks, but both UNIX and Windows do this. How does Python let you handle this? It’s a mix between forcing you to work with byte arrays instead of strings and just not having any way to handle it. It’s extremely common for long-running data analysis processes in Python to just suddenly die deep in some work because they encountered a stray \xff byte. Any operation that gets external data and produces strings is susceptible.

A common defense of this behavior is to say that strings shouldn’t be allowed to be invalid in the first place—picture a strict Victorian school teacher wagging her finger at you. But amusingly, Python3 isn’t even disciplined about that since they do allow strings to be as invalid in some ways. Specifically, unpaired surrogate code units are allowed in strings even though they are invalid. You see, it’s only some kinds of invalid string data that are morally reprehensible and too dangerous to allow. Conveniently, the morally acceptable invalid strings are exactly the ones that Python happens to be able to represent and the morally repugnant ones are the kind that it can’t represent.

Anyway, why the long rant? Because it’s very annoying having people hold up Python3’s string handling as some paragon of excellence. Literally the only thing it has going for it is that once you have a string—you’ve already paid an O(n) scan and transcode cost—you can do O(1) indexing by character. Never mind that indexing by character isn’t particularly useful and almost never what you actually need: for efficiency, code units are better; for human percieved behavior, normalized grapheme clusters are what you should work with. Never mind that even if you don’t want or need to do O(1) indexing by character, you still are forced to pay the O(n) up front cost everywhere.

12 Likes