Julia's UTF-8 handling [vs. new Python's 3.7 UTF-8 PEP 540]

TL;DR We’re on the right track, maybe no changes needed except drop for byte indexing into strings (i.e. make it not exported; it’s not needed/wanted? and (kind of) redundant with SubString). (I’ve changed my mind regarding wanting exceptions on illegal UTF-8, want them out (as it is by now), by default, with the possible only exception for files with BOM, that indicate conclusively your file can’t be UTF-8; seems that’s what Python does also: “Python 3 raises a UnicodeDecodeError on the first undecodable byte”, that is for open(), but not for e.g. stdin/stdout).

Note, Julia has binary file support, but not an option to read in text mode (unlike most other languages). Maybe it influences what we should or could do, and we need not follow that PEP as is.

I think we should learn as much as we can from Python, and this new PEP, and use their non-default option (I think they would have wanted it by default). I also brig this up because of PyCall.jl. There may be implications there(?), does it/Python need to be configured to fit Julia?

From the PEP (we do away with “default strict error handler”, as we should and appropriate(?) for us):

"When decoding bytes from UTF-8 using the default strict error handler, Python 3 raises a UnicodeDecodeError on the first undecodable byte. […] Python 3 already has a solution to behave like Unix tools and Python 2: the surrogateescape error handler (PEP 383). It allows processing data as if it were bytes, but uses Unicode in practice; undecodable bytes are stored as surrogate characters.

UTF-8 Mode sets the surrogateescape error handler for stdin and stdout, since these streams as commonly associated to Unix command line tools.

However, users have a different expectation on files. Files are expected to be properly encoded, and Python is expected to fail early when open() is called with the wrong options, like opening a JPEG picture in text mode."

I’ve been loosing sleep over [Julia’s] UTF-8, how to best support it, or I should say the illegal superset of it (illegal bytes, overlong or short), and e.g. if we should detect the BOM and support then UTF-16 at least minimally, to read files in (should the file descriptor remember it opened an UTF-16 file?). It seems it might help some people, all those adding file reader support e.g. for CSV, to centralize that logic in one place benefiting all file readers.

Then I happened to read this PEP when looking into Python 3.7. They must have thought this all through, at least in the context of their language.

A file with BOM conclusively(?) can’t be ANY 1-byte encoding of legal text (see my Quora question below), at least not ISO8859-1 (and -15 if not all the variants) nor Windows 1525 (should it be the assumed encoding of illegal bytes?).

We’re already using at least part of their proposed solution. They can’t have it as default, but since Julia 1.0 is soon out, we can choose to have their non-default option as default for us.

Add a new “UTF-8 Mode” to enhance Python’s use of UTF-8. When UTF-8 Mode is active, Python will:

  • use the utf-8 locale, irregardless of the locale currently set by the current platform, and
  • change the stdin and stdout error handlers to surrogateescape.

This mode is off by default, but is automatically activated when using the “POSIX” locale.

Add the -X utf8 command line option and PYTHONUTF8 environment variable to control UTF-8 Mode.

At the bottom of the PEP are tables summarizing, with this info:

for open() UTF-8/strict

they have for e.g. (do we have similar to these fs* functions?):

os.fsdecode(), os.fsencode() UTF-8/surrogateescape

sys.stdin, sys.stdout UTF-8/surrogateescape

“On Windows, the encodings and error handlers are different”
[another table]

I.e. for os.fsdecode(), os.fsencode()

And then:
sys.stderr UTF-8/backslashreplace


https://www.quora.com/Is-the-byte-order-mark-BOM-as-code-FE-FF-or-FF-FE-code-confusable-with-part-of-any-word-from-any-language-in-any-non-UTF-16-encoding-8-byte-one-e-g-EBCDIC-variant-or-East-Asian-one-It’s-e-g-ЧЪ-or-ЪЧ-in-Russian-using-KOI8-R-encoding

2 Likes

This is an additional mode, not the default one, and depends on the locale, and is only part of an alpha version of Python 3.7.

The way it coerces the locale in certain cases I have a feeling may not be very well received around the world, for example, in China, GB 10380 is the official character set encoding (it is considered to be an encoding of Unicode, all Unicode characters can be encoded, and all official GB 10380 characters are either in Unicode or assigned Unicode Private Use Area code points by the Chinese standards body, while they are awaiting assignment of normal code points by the Unicode.org standards body, but it is much more efficient than UTF-8 or UTF-16 for encoding Chinese text).

This sort of unsafe encoding (see the Unicode and W3C recommendations about handling invalid sequences) could be handled, if really wanted, in the Strs.jl framework.

To answer my own question on [exception] on UTF-16 “files with BOM”, or reader support. It seems non-trivial, e.g. when a file has odd number of bytes. There seems to be no good option to still read those files (they’re however not meaningful, but could in theory be 1-byte encoding garbage-text file that’s legal).

Do we want to reintroduce UTF-16 support, with exception possibly thrown (or is there a workaround for that?):
https://github.com/JuliaArchive/LegacyStrings.jl/blob/master/src/utf16.jl#L226

What do other languages do, e.g. Python (when asked to read UTF-8, and the file has a BOM, what about if UTF-8 “BOM”?)?

A “UTF-16” file with an odd number of bytes is not UTF-16, regardless of whether it has a BOM. Your application has to decide what it wants to do with such invalid data — it’s not so much that it’s “non-trivial” as that there is no sensible default choice for a package like LegacyStrings (which can handle BOMs in valid UTF-16 just fine) other than to throw an exception.

Python reads the BOM as part of the UTF-8 string, just like Julia’s String(bytes) or read(io, String), unless you explicitly specify the encoding as utf-8-sig.

(This is, of course, trivial to implement on your own if your application needs to read UTF-8 files that start with a BOM, mainly thanks to Microsoft. Other encodings are handled by e.g. StringEncodings.jl. Auto-detecting the encoding is another matter entirely.)

1 Like

Right, so this should never happen and not be an issue for a real UTF-16 file.

I was just thinking, if you assume UTF-16 wrongly, and it’s actually some 1-byte encoding. It seems near impossible to support, in that order:

  • UTF-8
  • UTF-16
  • Latin1

As I said, auto-detecting encodings is another matter entirely. It is impossible to do completely reliably. There are various free/open-source packages that attempt an “educated guess”, including ICU (which has Julia wrappers, albeit somewhat unmaintained at the moment); it would be straightforward to port/interface one of them (e.g. cchardet) to Julia if someone wants to do the work.

Nowadays, fortunately, most applications should rarely need to auto-detect encodings. If you just want to past the text around without knowing the encoding, of course, you can just use Vector{UInt8}.

Not necessarily. This can happen for example if a file has been truncated (because of file size limitations, for example). If a file was limited to a max size of 2^32-1 or 2^24-1, you’d get an odd number of bytes.

It also could just be misidentified, for example you might find a “UTF8BOM”, and an odd number of UTF-8 encoded bytes.

It is very useful to have a function that can check for these different cases, and when possible, return a valid string.

Yes, in which case it is not UTF-16. It’s data that was formerly UTF-16 and has now been corrupted. It’s not possible to return a valid UTF-16 string from such corrupted data without some application-specific choice (e.g. to truncate the data).

It is common for things like web pages to be misidentified, which is why validity checking and a bit of auto-detection can be rather important (usually just between UTF-8, UTF-16, ISO-8859-1, CP-1252, which are usually pretty easy to distinguish)

1 Like

Yes, but most applications don’t try to open arbitrary “text” files off the web. I didn’t say no applications need to auto-detect encoding. If you are writing a text editor or a web browser, for example, you obviously need this. Charset detection is useful and important functionality, albeit for specialized use-cases, and it would be great to have a package for this.

1 Like

Where I’ve seen this problem frequently is with reading CSV or TSV files, and that is something where people frequently try to load arbitrary text files (they may be old, containing data from some experiment, or whatever).

2 Likes

But truncated UTF-8 is also not correct UTF-8, and we handle it the same as Perl (I thought they had the best UTF-8 support, or does Python?). It seems we could do the same for truncated UTF-16 as is done for truncated UTF-8, use the replacement char. That’s what it’s for?

$ echo "Páll" |cut -b1-2 |julia -e 'print(readline())'
P�

$ echo "Páll" |cut -b1-2 |perl -e 'use utf8; my $line = <STDIN>; print $line'
P�

$ echo "Páll" |cut -b1-2 |python3 -c 'print(input())'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 1: invalid continuation byte


$ echo "Páll" |cut -b1-2 |python2 -c 'print(input())'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1
    P�
     ^
SyntaxError: unexpected EOF while parsing

Yes, if you ask for replacements instead of errors, then it should return, with the last (truncated) character replaced.

1 Like

I was thinking, as a default support for BOM, to just get an exception (but people seem dead set against them, as the default), to postpone supporting UTF-16 until Julia 1.x. But it would be legal to change the whole file into replacement characters (with surrogates as one or two?)?

Or even the whole file as one?

At least getting replacement would be better than interpreting as garbage.

Is there any requirement to accept BOM and thus UTF-16? I know there is for XML (and there’s a library supporing it and I guess even UTF-16).


As opposed to XML, many other formats, maybe most, standardize on UTF-8 only, e.g. JSON, and I’m not sure what should happen if it happens to have a BOM and be UTF-16.


EDIT: JSON is fucked up, with even UTF-32 supported, I wander how much software actually does support it, or more than UTF-8 (or for any other file format):

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default
encoding is UTF-8

To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
“\uD834\uDD1E”.


EDIT2 (what I must have remembered, “Internet JSON”, what you output, but need to accept above?):

I-JSON messages MUST be encoded using UTF-8 [RFC3629]

Neither Julia nor Perl gave the replacement char, despite it looking so. It’s only my shell that displays it from truncated UTF-8 (here first byte of “á”).

I think most of the Base string devs would be strongly opposed to silently changing data when it is read. (Using the replacement character for display is another matter, but this is partly up to the terminal.)

But truncated UTF-8 is also not correct UTF-8, and we handle it the same as Perl

The exact data is preserved, not silently truncated/transformed. (See also the numerous discussions elsewhere for why/how the String type represents arbitrary byte streams, regardless of whether it is valid UTF-8; applications can opt-in to validate/re-encode if needed.)

Is there any requirement to accept BOM and thus UTF-16?

This is application-specific. A program is free to declare that it only accepts UTF-8 encoded JSON, or XML, CSV, or whatever.

If you have an application that requires automated encoding detection, please work on porting/wrapping a charset-detection library — that would be great. But guessing the encoding or rewriting corrupted data is the sort of thing that will have to be opt-in for specific applications, not the default behavior.

4 Likes

FWIW, JuliaString/ICU.jl supports it already (but nolta/ICU.jl doesn’t).

2 Likes

It would be great to see a port of FTFY to Julia: https://github.com/LuminosoInsight/python-ftfy

You could still support reading UTF-16 (if it comes with a BOM) without changing a bytes (or byte order). You would just read the code points in pairs, the only problem I see is if you end in odd number of bytes, and actually implementing this (for now I’m just suggesting an exception). That question was just about what the Unicode standard allows.

Them BOM (FF FE, or FE FF) is illegal in UTF-8. You could still read it in either way, and there after only read codepoints in pairs. We have a way to iterate through stray bytes. I was thinking, could we encode high-bytes differently (similar to UTF-8b scheme).

I agree UTF-8 should be our top priority. You seem to want 1-byte encoding too, as next priority.

Yes, ok. But why wouldn’t you still offer more (on input, I’m not asking for more, and only files; UTF-16 seems incompatible with STDIN).

Where you really only want to allow only UTF-8, seems to be on output rather than input, but could be a non-default option on input. Not doing it means reading in garbage for sure. So I’m suggesting an exception for now.


In all cases if you want bytes read (e.g. for file copy) and be 100% sure nothing gets changed there’s readbytes(). I was thinking, in case you read [past an header] that way, you would disable UTF-16 support.

I believe cat is implemented as reading bytes, what you could do. Would you really want it no NOT support concatenating UTF-8 and UTF-16 file and do the right thing, by default?

You could still support reading UTF-16 […]

Reading non-UTF8 text can be (and is) supported in Julia. Automated encoding detection also. These are great features! But they are out of scope for the base library, and reading strings in different encodings is (and probably always will be) opt-in, via packages.

But why wouldn’t you still offer more (on input, I’m not asking for more, and only files […])

Because not all applications want to auto-detect encodings (which is never 100% reliable without additional assumptions about the input) or silently transform/truncate corrupted data or data in an unexpected encoding, and even if they do they may want a different behavior than you. Because handling files one way and other streams another way is not a good default. Because supporting multiple encodings in Base imposes a large amount of additional complexity that is not needed for the other parts of Base, and hence is best left to packages.

(Transcoding UTF-16 to/from UTF-8 is needed in Base to call Win32 API functions, which is why the transcode function is in Base. But for this purpose, the translation needs to succeed even for invalid Unicode data — filenames don’t need to be valid strings. And auto-translating strings from other encodings when a file is read goes far beyond this.)

Again, these are not bad features to have, just not in Base, and not as a default. (Lots of important functionality happens in packages. Putting code in a package doesn’t mean it isn’t valued!)

1 Like