A Python rant about types

That option is the right concept but it only handles unpaired surrogate code points. So with this option what Python 3 handles reasonably is expanded from UTF-8 to WTF-8 but still doesn’t include the vast majority of ways that string data can be invalid.

1 Like

According to PEP-381 it is based on UTF-8b.

It seems that this type of solution was analyzed but abandoned in python. See explanation from that PEP:

“… , the approach of escaping each byte XX with the sequence U+0000 U+00XX has the disadvantage that encoding to UTF-8 will introduce a NUL byte in the UTF-8 sequence. As a consequence, C libraries may interpret this as a string termination, even though the string continues. In particular, the gtk libraries will truncate text in this case; other libraries may show similar problems.”

(There are also described some security concerns about supporting everything)

Python has some self-inflicted problems here. The choice was made when designing Python 3 that 0(1) character indexing was of paramount importance. In order to accomplish that while supporting Unicode, they decided on a design where they transcode each input string to a fixed-width encoding — one of Latin-1, UCS-2 or UCS-4 (aka UTF-32), depending on the largest code point appearing in the string. Each string also needs to be transcoded back to UTF-8 again on output, so this is all quite costly and means that any string that isn’t pure ASCII (the intersection of UTF-8 and Latin-1) needs to be transcoded twice. This is fine for small strings, but really bad for processing large text data.

This design means that Python 3 has to be able to represent any input string in terms of code points. Which doesn’t work for invalid data, of course. There are some limited ways to represent some kinds of invalid UTF-8, such as UTF-8b (which, as far as I can tell isn’t any kind of standard, just something that some Python people cooked up) or WTF-8 (which is standardized). But fundamentally needing to turn every string into a fixed-width sequence of code points puts them in a tough position with respect to invalid strings where there is simply no corresponding sequence of code points.

Why is this easier in Julia? We don’t insist on O(1) character indexing, instead using O(1) byte indexing (but returning characters). This means that string data doesn’t need to be transcoded into code points, it can just be left as-is. The only thing you need is a definition of what constitutes an invalid character (we follow the Unicode spec) and a way of representing invalid characters as Char values (we just leave the bytes as-is). With this you can just leave invalid data alone and only need to error if the user actually asks for the code point of an invalid character.

4 Likes