A Python rant about types

StefanKarpinski · July 18, 2020, 8:08pm

It’s a matter of opinion, but I believe that Python 3 bungled strings quite badly (and I know there are many even in the Python community who also have expressed this view). Here’s Python 2 reading a string that’s not valid UTF-8 no problem:

>>> io = open("/dev/random")
>>> line = io.readline()
>>> line
'N\xe2\x97\x8e@\xe8T2[8\xef\xb3 T\x06\x98\x86\xeb\xbcR\xfdxu\x97\x0b \x9b\xfc:\xb4\xdb\xa6_j\x1e"\xf0|\xf2B\x07Rs\x13\x88\x8bJ\x06.L\xa2\xb0\xe7\xba\xcc\x1a^\x98?\xcaR\xcb\x0b\xe3\xdc?\xf4\xdb\x04\x98yHK^\xf4t\x8c\xff\x83\x07\xeaV\xe2\xf8\x8b\xeb\x17%\xde)\xdcl\n'
>>> type(line)
<type 'str'>

Here’s Python 3 choking on the same thing:

>>> io = open("/dev/random")
>>> line = io.readline()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 2: invalid start byte

Some would chastise me and say, “Silly Stefan, you shouldn’t be reading /dev/random as a string—it’s never going to produce valid UTF-8!” And of course, they’re right. But the trouble is that even though this example is indeed silly, it’s very common for data sources that are reasonably expected to be strings to occasionally fail to be perfectly valid UTF-8. People are bad at encodings and sometimes data is just corrupt. What then? Then your Python 3 program that works with strings simply crashes, losing whatever work it was doing. In fact, if you want your Python 3 code to be robust to all kinds of input, then you must avoid using the str type at all and use the bytes type instead. What then is the point of the str type? What use is a string type that cannot reliably be used to work with string input?

It seems to me that Python 3’s string design violates a very deep principle in API design: it’s ok if your program crashes because your code is bad, but it is not ok if your program crashes because the data is bad. It must be possible to write code that is correct and works no matter what the input may be. It’s fundamentally impossible to write code that works with strings in Python 3 and robustly handles all possible inputs.

Compare this with what Julia does:

julia> io = open("/dev/random")
IOStream(<file /dev/random>)

julia> line = readline(io)
"hmr[\xc0{\xab\xf7\xe9\xab\xe3\xb7\xc9}|UT;\xcdz\xa6-B\xf2\xeb\xcc\xc2)\xc2\xd0\xf6rU}\xaf\xbc\xac\xd4\xd8h\xbd[\x83t\x1d\x01'\x85\xe3\x9c\xc4\xf8\xd9\x18\xb5\x03\xf4\xba\xe2\xebN\x9c\xde\\m\x973\xd4\xf5z\xe5\x97"

julia> isvalid(line)
false

julia> line[4]
'[': ASCII/Unicode U+005B (category Ps: Punctuation, open)

julia> line[5]
'\xc0': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> isvalid(line[4])
true

julia> isvalid(line[5])
false

julia> line′ = sprint() do io
           for c in line
               print(io, c)
           end
       end
"hmr[\xc0{\xab\xf7\xe9\xab\xe3\xb7\xc9}|UT;\xcdz\xa6-B\xf2\xeb\xcc\xc2)\xc2\xd0\xf6rU}\xaf\xbc\xac\xd4\xd8h\xbd[\x83t\x1d\x01'\x85\xe3\x9c\xc4\xf8\xd9\x18\xb5\x03\xf4\xba\xe2\xebN\x9c\xde\\m\x973\xd4\xf5z\xe5\x97"

julia> line′ == line
true

julia> codepoint(line[4])
0x0000005b

julia> codepoint(line[5])
ERROR: Base.InvalidCharError{Char}('\xc0')

Here are the significant points:

You can read and write any data, valid or not.
It is interpreted as UTF-8 where possible and as invalid characters otherwise.
You can simply check if strings or chars are valid UTF-8 or not.
You can work with individual characters easily, even invalid ones.
You can losslessly read and write any string data, valid or not, as strings or chars.
You only get an error when you try to ask for the code point of an invalid char.

Most Julia code that works with strings is automatically robust with respect to invalid UTF-8 data. Only code that needs to look at the code points of individual characters will fail on invalid data; in order to do that robustly, you simply need to check if the character is valid before taking its code point and handle that appropriately.