A Python rant about types

johnh · July 18, 2020, 5:07pm

Permit me a Python rant. with relevance to Julia.

I have been stuck for over two days on running a Python script written by a colleague.
The script runs the system command ‘lscpu’ and split the output into lines then parses it.
I get the obscure error TypeError: a bytes-like object is required, not ‘str’
I finally Google for this and find that this is a major change between Python 2 and Python3
Sure enough - I am running on a RHEL 8 system which has now defaulted to Python 3

What I want to say is
a) thank goodness we have a robust type system in Julia, instead of ‘hey ho lets just do the best job we can’ - and then change it
b) if things broke like this in Julia 0.C o 1.0 there would have been gnashing of teeth
c) In Julia I would have run up a REPL and run the code snippets in isilation and used the typeof macto to try to debug this (I know Python has a REPL but it is not in my list of skills)

The explanation of this error:
This is because, in Python 2, the strings are by default treated as bytes. The original strings in Python 2 are 8-bit strings , which play a crucial role while working with byte sequences and ASCII text. This version also supports automatic coercion between bytes and Unicode objects.

But in Python 3, by default, the strings are treated as Unicode. But unlike Python 2 there is no facility of automatic type coercion between Unicode strings and bytes. So in the code mentioned above, when you are trying to open the file in binary mode, Python 3 throws an error.

purplishrock · July 18, 2020, 6:58pm

LOL. I ran into this exact problem and had the same reaction. I was stuck using python, so therefore, very sad.

What was particularly silly was that a DEPRECATED function with “tobytes” in the name was now returning a string, and not bytes. You are not seeing that wrong a function name with “bytes” in the name was returning a string, which as you point out, python 3 will not treat as bytes. I then had to convert it to actual 8-bit values, which turned out to be a pain to figure out.

StefanKarpinski · July 18, 2020, 8:08pm

It’s a matter of opinion, but I believe that Python 3 bungled strings quite badly (and I know there are many even in the Python community who also have expressed this view). Here’s Python 2 reading a string that’s not valid UTF-8 no problem:

>>> io = open("/dev/random")
>>> line = io.readline()
>>> line
'N\xe2\x97\x8e@\xe8T2[8\xef\xb3 T\x06\x98\x86\xeb\xbcR\xfdxu\x97\x0b \x9b\xfc:\xb4\xdb\xa6_j\x1e"\xf0|\xf2B\x07Rs\x13\x88\x8bJ\x06.L\xa2\xb0\xe7\xba\xcc\x1a^\x98?\xcaR\xcb\x0b\xe3\xdc?\xf4\xdb\x04\x98yHK^\xf4t\x8c\xff\x83\x07\xeaV\xe2\xf8\x8b\xeb\x17%\xde)\xdcl\n'
>>> type(line)
<type 'str'>

Here’s Python 3 choking on the same thing:

>>> io = open("/dev/random")
>>> line = io.readline()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 2: invalid start byte

Some would chastise me and say, “Silly Stefan, you shouldn’t be reading /dev/random as a string—it’s never going to produce valid UTF-8!” And of course, they’re right. But the trouble is that even though this example is indeed silly, it’s very common for data sources that are reasonably expected to be strings to occasionally fail to be perfectly valid UTF-8. People are bad at encodings and sometimes data is just corrupt. What then? Then your Python 3 program that works with strings simply crashes, losing whatever work it was doing. In fact, if you want your Python 3 code to be robust to all kinds of input, then you must avoid using the str type at all and use the bytes type instead. What then is the point of the str type? What use is a string type that cannot reliably be used to work with string input?

It seems to me that Python 3’s string design violates a very deep principle in API design: it’s ok if your program crashes because your code is bad, but it is not ok if your program crashes because the data is bad. It must be possible to write code that is correct and works no matter what the input may be. It’s fundamentally impossible to write code that works with strings in Python 3 and robustly handles all possible inputs.

Compare this with what Julia does:

julia> io = open("/dev/random")
IOStream(<file /dev/random>)

julia> line = readline(io)
"hmr[\xc0{\xab\xf7\xe9\xab\xe3\xb7\xc9}|UT;\xcdz\xa6-B\xf2\xeb\xcc\xc2)\xc2\xd0\xf6rU}\xaf\xbc\xac\xd4\xd8h\xbd[\x83t\x1d\x01'\x85\xe3\x9c\xc4\xf8\xd9\x18\xb5\x03\xf4\xba\xe2\xebN\x9c\xde\\m\x973\xd4\xf5z\xe5\x97"

julia> isvalid(line)
false

julia> line[4]
'[': ASCII/Unicode U+005B (category Ps: Punctuation, open)

julia> line[5]
'\xc0': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> isvalid(line[4])
true

julia> isvalid(line[5])
false

julia> line′ = sprint() do io
           for c in line
               print(io, c)
           end
       end
"hmr[\xc0{\xab\xf7\xe9\xab\xe3\xb7\xc9}|UT;\xcdz\xa6-B\xf2\xeb\xcc\xc2)\xc2\xd0\xf6rU}\xaf\xbc\xac\xd4\xd8h\xbd[\x83t\x1d\x01'\x85\xe3\x9c\xc4\xf8\xd9\x18\xb5\x03\xf4\xba\xe2\xebN\x9c\xde\\m\x973\xd4\xf5z\xe5\x97"

julia> line′ == line
true

julia> codepoint(line[4])
0x0000005b

julia> codepoint(line[5])
ERROR: Base.InvalidCharError{Char}('\xc0')

Here are the significant points:

You can read and write any data, valid or not.
It is interpreted as UTF-8 where possible and as invalid characters otherwise.
You can simply check if strings or chars are valid UTF-8 or not.
You can work with individual characters easily, even invalid ones.
You can losslessly read and write any string data, valid or not, as strings or chars.
You only get an error when you try to ask for the code point of an invalid char.

Most Julia code that works with strings is automatically robust with respect to invalid UTF-8 data. Only code that needs to look at the code points of individual characters will fail on invalid data; in order to do that robustly, you simply need to check if the character is valid before taking its code point and handle that appropriately.

StefanKarpinski · July 18, 2020, 8:33pm

On a lighter note, this is a really fun way to generate random valid UTF-8 strings of printable characters:

julia> filter(c -> isvalid(c) && isprint(c), readline("/dev/random"))
"_BSȔ 3p=i?gZ;C3w,h!̃"

julia> filter(c -> isvalid(c) && isprint(c), readline("/dev/random"))
"PgJjov@-ѾR%K{eK1Cm]j]<3Ia]<_<z/O`Z;MbRZU8HKK^(_{<"

julia> filter(c -> isvalid(c) && isprint(c), readline("/dev/random"))
"<L߫hEj,i-/ۊ\\Kgk}`_w<\"|(uYj-s?t(̷j6YvU:î`\\.\$;h~v!*gk>nxi<*j/\$~&-<^t=0^y8:rsWPe{60R9G;7Sͭw#̡(%38+lV@Y@[^;=٘ZOPϙeYռ*Ia,R,;<f3C"

julia> filter(c -> isvalid(c) && isprint(c), readline("/dev/random"))
"POi@Xa.<[am(+\"VKAo|CbT>dC{ernsytYB`{gr3eA?}]*O05"

julia> filter(c -> isvalid(c) && isprint(c), readline("/dev/random"))
"<\$Y瓮[ոNnt*YjY8\"}4p-wkt\\\"6-Eɮd@(u<u#/(#\"4@`vev_\$oR<E[Jm8\$cak"

julia> filter(c -> isvalid(c) && isprint(c), readline("/dev/random"))
"j)<bQbr.N"

julia> filter(c -> isvalid(c) && isprint(c), readline("/dev/random"))
"ճ\\^m2he}1')ٚϼ`?DAt)bNhERDFO;k<ט"

Great for making passwords that you can’t type

apo383 · July 18, 2020, 9:52pm

While I empathize with your experience, I don’t think Python is particularly to blame here. In most modern languages, there can be breaking changes, which are now usually relegated to major versions much like Python 2 to 3 (e.g. Julia 0.x to 1, C++17 to 20).

The Python 3 was particularly painful in part because Python 2 was extraordinarily successful. Python 2 was basically cemented in the enterprise in the days of ASCII. As Unicode became established, Python 2 was flexible enough and had a good enough package ecosystem that one could deal with Unicode alright. (For years it was also not clear that UTF-8 would become the de facto standard, as opposed to -16 or -32 or UCS-2.) Python 3 was a chance to fix a bunch of lingering pain points and do the Unicode transition at once, but the changes also made it nontrivial to update a lot of code. Enterprise generally doesn’t want to fix something that isn’t broken, so it was a tough transition, and you are feeling some of the aftershocks.

Julia deals better with Unicode largely because it is more recent. It also learned from some of Python’s mistakes, and still has more to learn. Some would say that Julia’s error messages are not universally awesome, maybe even less so than Python’s. You can also expect some breaking changes with Julia 2.0, and there’s a lot of orphaned code from Julia <0.7 that never made the transition to 0.7/1.0 (note 1.0 is basically a re-branding of 0.7). It is with some recent pain that Julia has become as awesome as it is.

Your frustration with Python Unicode is understandable, but I would expect similar frustrations in any major version change in any language, especially newer ones. I also prefer Julia to Python, but Python has been and still is extraordinarily successful, and helped usher in the open source revolution.

ToucheSir · July 19, 2020, 1:10am

To be a little contrarian here, Python 2 to 3 could have been much smoother in spite of the fundamental breaking changes for more reasons than enterprise inertia.

(Just a disclaimer: all of the following is very much stated with the benefit of hindsight. This can be evidenced by the thought put into migrations and backwards compatibility in newer languages such as Rust and Julia. If the positions had been reversed, perhaps we would be lauding Python for pulling off a successful transition between major versions)

With that out of the way, I’d agree that Python was a victim of its own success. However, that extends beyond just widespread usage. For one, pip’s dependency compatibility management was (and remains to this day) absolutely lackluster. Bundler was providing such a service in the Ruby ecosystem back in 2010, so it’s a bit of a head-scratcher as to why those ideas weren’t quickly copied to aid in the (still new) Python 2 to 3 migration. In fact, I’d argue it would have been easier to introduce a Pipenv or Poetry then because pip hadn’t yet been integrated into the standard Python distribution.

Stepping back a bit, it feels like Python didn’t commit enough to the migration. If the core ecosystem had doubled down and torn off the band-aid, there might have been more motivation to come up grander and more novel ideas such as per-module versioning, C API versioning, shims with deep interpreter support (more powerful than six), JS-style “use strict” for __future__, Rust-style editions, etc. Heck, maybe we would’ve seen wheels and declarative package manifests come up 10 years earlier once it became clear that
a) the core team and co. are putting their full weight behind this happening now, and
b) that they will do whatever it takes in terms of creative solutions and boots-on-the-ground support to craft a smooth migration path.
Instead, the core ecosystem waffled on supporting this wholesale (to wit, eternal Python 2.7) and only threw out a few crumbs (2to3 and eventually six) for those looking to take the plunge.

To your final point, Python has remained an open source juggernaut in spite of all these historical shortfalls. However, so have its circa 2010 contemporaries Ruby and (Node.)JS! As such, I don’t agree that all of the struggles in the 2 to 3 saga were a bygone conclusion.

Tamas_Papp · July 19, 2020, 7:18am

To be fair, Julia has quite a few tiny band-aids in sensitive spots, let’s see how many we will be willing to tear off for 2.0. It is always tempting to Do It Right, but in practice there are inevitably trade-offs.

tamasgal · July 19, 2020, 8:14am

While I share your thoughts and the str/byte disaster is totally annoying, I’d say that your example with the call to readline() is a bit misleading, as the default will try to parse the encoding of str which is UTF-8 in Python 3. You can pass the "rb" to open() to get at least access to the data and then use whatever you like to check the UTF-8 validity.

In [1]: io = open("/dev/random", "rb")

In [2]: line = io.readline()

In [3]: line  # truncated manually
Out[3]: b't"\x0c\x0b?)oW\xa5\xad\x04\xfa\xb7p\xb....4\x08{\xf7A\xabS\x13\xed\n'

In [4]: line.decode("utf-8")
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-f9dbfeea92bc> in <module>
----> 1 line.decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 8: invalid start byte

So it reports a helpful message UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 8: invalid start byte but yes, one needs to go this way, which might confuse a lot of people.

Of course the isvalid() method in Julia is nice and the error reporting superior

rfourquet · July 19, 2020, 12:37pm

I don’t know much about strings, but your example made me think of this blog post Parse, don’t validate, which is written in the context of Haskell and static languages, but is certainly applicable more widely. Applying its ideas to this string case would probably imply having two string types, e.g. ValidatedString which contains only valid UTF8 and UnvalidatedString. Of course this is annoying and defining only one is simpler, but which one to choose?!

ToucheSir · July 19, 2020, 2:18pm

For sure, I should clarify that this is more a matter of attitude than technical implementation. Python wanted to fix their string/encoding model without putting much effort into creating a good transition layer from old to new. They also moved and renamed APIs without adding intermediate aliases. It felt like the worst of both worlds because the impact was comparable to a radical language redesign (i.e. no easy migration path), but many fundamental limitations like the GIL still remained. This is what I mean by ripping off the band-aid: wanting folks to migrate without providing compelling features (as mentioned, many didn’t care much about string encoding) with even less library/tooling support signals “all pain, no gain”. If more time and resources were fed into other compelling back compat breaking fixes/features or more back compat, I’d argue that users would notice and feel more compelled to attempt a migration.

Oscar_Smith · July 19, 2020, 2:21pm

Note that the solution of 2 separate typed is exactly what python does. Str is validated, bytes aren’t.

StefanKarpinski · July 19, 2020, 5:33pm

Of course you can handle invalid data in Python, but there are significant issues.

Most code won’t do this because working with str is the default and far more convenient. When invalid data is not the norm, this ends up being a land mine, just waiting for some CSV file that’s got one stray mojibake entry, or an HTTP server that’s misconfigured, or a file that someone saved in Latin-1 by accident.

Suppose you’re a responsible Python programmer and you do use the raw bytes (rb) flag when opening files (and sockets and anything else that might get data from the outside world) and you check for UTF-8 validity of all your input. What then? How do you handle the invalid case? You have three choices:

Throw an error. In that case you might as well just use str and let Python throw the error for you. Your program isn’t robust against invalid data.
Add a second code path that handles invalid data separately. Possible ways to handle it are to replace invalid data with replacement characters, scrub out the invalid data entirely, or mirror the entire logic of handling strings with bytes objects instead of str objects.
Don’t use str at all, just do everything with bytes. If you’re going to do 2 with the last option of mirroring the logic, then you might as well just not use str in the first place and use bytes everywhere. This is inconvenient, but at least it won’t die on invalid data.

The first case is not great because the code isn’t robust (fails on some inputs). The middle case where you replace invalid characters with replacement characters and then work with strings is ok, since at least you get the convenience, but it’s not super general because it’s lossy; it would also be better if the language just helped with that instead of making you do this annoying “raw bytes” dance before reading any string that might potentially be invalid. The last option is not great either because in that case, why even have the str type?

Here’s the thing: being so fussy about invalid UTF-8 data is entirely unnecessary. Most code can just skip over anything invalid — UTF-8 is self-synchronizing, so if you’re looking for some valid substring it can never match invalid data. In other words if you’re trying to parse a CSV and there’s some invalid data in a field, it can never look like a comma, double quote or newline, so all of the CSV-parsing logic is unaffected. A working CSV parser would, if allowed to do its thing, automatically parse out the invalid data as a CSV field containing an invalid string. The only time you really need to raise an error is when the program needs to get the code point of an invalid character—because that’s not well-defined—or transcode invalid data to some other encoding—also not well-defined (although some well-formed but invalid UTF-8 can be reasonably transcoded, such as WTF-8).

Oscar_Smith · July 19, 2020, 5:55pm

I think that scripts failing on invalid input is perfectly good behavior. There’s a reason Julia throws a domain error on sqrt(-1). Most times when files have invalid utf8 characters it’s because the files aren’t utf8. Knowing that is really useful since then you can actually parse it with the correct characterset instead of silently ignoring the “invalid” data.

purplishrock · July 19, 2020, 6:04pm

exactly right, I have had this happen to me several times. Test equipment putting unicode for “copyright” in their generated csv files for example.

of course, I solved the problem by writing my test equipment interface and data processing code in julia

StefanKarpinski · July 19, 2020, 6:32pm

That’s a fair point. However, when sqrt gets a negative value, there’s no question about whether that value was meant to be negative or not. On the other hand, it’s impossible to mechanically distinguish between “this data wasn’t meant to be UTF-8 at all” and “this data was meant to be UTF-8 but isn’t quite”. It takes near-human judgement to make that evaluation. Occasionally you’ll have data that’s valid if interpreted as UTF-16 or UTF-32 but not as UTF-8, but it’s pretty rare since \0 bytes are valid. If you can assume no embedded \0 bytes then you can much more easily distinguish them. And of course Latin-1 looks almost exactly like UTF-8 except for the occasional code point above 127. Practically, Python 3’s string design makes it extremely common for programs to work until some stray invalid byte occurs and then crash. In theory this is “correct behavior” but it’s really not great in practice.

sijo · July 19, 2020, 7:13pm

I would add that sqrt(-1) is closer to the “program needs to get the code point of an invalid character” case, where we agree that an error must be thrown.

I will abuse the comparison the other way: if Python 3 behaved with numbers as it does with strings, it would throw an error when any array element is set to a negative value, because it could be used with sqrt and that would be invalid . It’ of course an exaggeration, but not completely wrong: the point is that strings are useful containers for byte sequences that include invalid UTF-8, just like arrays are useful containers for number sequences that include negative numbers.

Mason · July 22, 2020, 8:05pm

For what it’s worth, I don’t really feel that it’s obvious that things like sqrt(-1) and sin(Inf) should error. We often end up paying a great runtime cost for that because It makes various optimizations much harder to do because throwing an error is a side effect. Even in cases where we get the optimizations working, it becomes a wack-a-mole process.

I think I’d rather these functions just return a NaN and we just develop better tooling for detecting where a given NaN was produced. This is the whole reason NaN exists and is a floating point number afterall

moble · July 23, 2020, 8:45pm

Though I am tempted to join the rant about this particular python pain point, I worry that piling on would be a bit myopic, so I’ll point out the real problem in @johnh’s experience: the reliance on system python — and more specifically, python’s failure to make johnh’s life easy in that regard. (This amplifies something @ToucheSir touched on.)

One of the python community’s biggest problems is understanding the importance of environments. RHEL needs its own version of python, and it has to be stable, and it has to be thoroughly tested for the good of the OS — but you should not be using it. You really need to install python via a separate environment manager if you want to use python.

Whereas Pkg.jl appears to have ~solved this problem in julia, python still has a large and ever-changing set of tools that don’t work well together. Don’t get me started about the nightmare that is python’s build system (even for plain python packages, never mind when compiling C or Fortran). But just look at the situation with environment managers: virtualenv, pyenv, pyenv-virtualenv, virtualenvwrapper, pyenv-virtualenvwrapper, venv, pyvenv, pipenv, and conda.

Python’s web developers have pipenv as the new shiny thing — designed to work like npm, which is fair enough for them — but whose developers are openly and explicitly hostile to the different needs of scientists, and frequently to the scientists themselves. The scientific python community has conda (and importantly, conda-forge), which works really well as far as it goes, but is still hobbled by the build system and the rest of the ecosystem.

Ideally, johnh would have had conda installed already and used it to load the environment.yml file the colleague provided along with the script, which would have just worked (by creating an environment with python 2 and compatible versions of any other packages). It’s not their fault that this isn’t the workflow everyone already knows and uses by default; it’s a failure of the python community.

I know it’s been said a million times, but this is one of Julia’s greatest strengths: its developers understand that a language is more than just the code; it is also the tooling that makes it easy to develop and run the code, and the community one finds when trying to make sense of it all.

johnh · July 23, 2020, 9:02pm

@moble Now it can be told. At a certain point in my career I worked for a very publically known company.
We bought the second SGI Ultraviolet supercomputer in the UK. I never have forgiven Stephen Hawking for getting the first one. This machine cost in the high six figures… it performed flawlessly.
Till I was asked to update the system version of bash on that machine.
What would go wrong you ask? What indeed?
Till the next time I rebooted the machine and it failed to boot…
The boot sequence had been tested with the supplied bash version.

Thank goodness I had the moxy to boot this $$$ machine from a USB stick and reinstalled the original bash version.
Boot. Hold breath and… breathe…

Liso · July 24, 2020, 8:36am

StefanKarpinski:

Suppose you’re a responsible Python programmer and you do use the raw bytes ( rb ) flag when opening files (and sockets and anything else that might get data from the outside world) and you check for UTF-8 validity of all your input. What then? How do you handle the invalid case? You have three choices:

Throw an error. In that case you might as well just use str and let Python throw the error for you. Your program isn’t robust against invalid data.

Add a second code path that handles invalid data separately. Possible ways to handle it are to replace invalid data with replacement characters, scrub out the invalid data entirely, or mirror the entire logic of handling strings with bytes objects instead of str objects.

Don’t use str at all, just do everything with bytes . If you’re going to do 2 with the last option of mirroring the logic, then you might as well just not use str in the first place and use bytes everywhere. This is inconvenient, but at least it won’t die on invalid data.

Python has possibility to use surrogateescape encoding error handler. It is used for example to handle posix file names. See:

$ cd /tmp
$ mkdir $(echo -e '\xa5')    
$ cd $(echo -e '\xa5')
$ pwd
/tmp/�
$ python
>>> import os
>>> os.getcwd()  # this return utf-8 string (escaped)
'/tmp/\udca5'
>>> os.getcwdb()  # this return bytes
b'/tmp/\xa5'

You are right. But you could get strings from invalid UTF-8 input in python too:

>>> with open('/dev/random', errors='surrogateescape') as f:
    f.readline(20)
'\udce8\udcc3K\udcd7\udce9\udcfde\udced\udcd7A\udcbc\udcc1\x04\udc94v\udc9f\udcf2\udc93>\x7f'

Topic		Replies	Views
Julia's UTF-8 handling [vs. new Python's 3.7 UTF-8 PEP 540] Internals & Design	29	4648	January 24, 2018
Converting string of bytes to integer General Usage question	6	3445	April 11, 2021
Passing bytes instead strings in PyCall.jl General Usage	2	1119	May 24, 2017
Solution for issue #25216, larger octal literals produce smaller types, sometimes Internals & Design	7	996	December 23, 2017
Julia equivalent to Python's int.to_bytes General Usage question , python	14	3281	October 1, 2020

A Python rant about types

Related topics