Julia's UTF-8 handling [vs. new Python's 3.7 UTF-8 PEP 540]

Yes, by other than readline() (or some package). I did a test, with files I got here:

https://bitbucket.org/site/master/issues/5648/utf-16-little-endian-files-are-downloaded

[Note how their issue would have been avoided with autodetection of BOM.]

My point is really that you’re making assumptions anyway (and detecting BOM makes fewer), and this is exactly what you do for UTF-16, you truncate (and possibly garble, as shown here for the text in quotes, look at the other file to compare):

$ cat Downloads/UTF-16-text |julia -e 'readline(); print(readline());'
# Use a utility like Butler to bind to a separate keystroke, such as "%#�!#I".

$ cat Downloads/UTF-16-text |julia -e 'print(readline());' |xxd
00000000: fffe 2300 2000 5300 6300 7200 6900 7000  ..#. .S.c.r.i.p.
00000010: 7400 2000 7400 6f00 2000 6300 6c00 6f00  t. .t.o. .c.l.o.
00000020: 7300 6500 2000 7400 6800 6500 2000 5300  s.e. .t.h.e. .S.
00000030: 6100 6600 6100 7200 6900 2000 5700 6500  a.f.a.r.i. .W.e.
00000040: 6200 2000 4900 6e00 7300 7000 6500 6300  b. .I.n.s.p.e.c.
00000050: 7400 6f00 7200 2e00 0a                   t.o.r....

Note the missing “00” at then end to still make it valid UTF-16.

These are insidious errors (that can be avoided, but most won’t as they are insidious); superficially it looks like UTF-16 (and Perl does exactly the same but Python2 and 3 errors out) got through seemingly unharmed (excluding garbage in quotes in the end), but it’s only because the shell makes it look ok (I wouldn’t trust that or know on Windows).

If you look at the PEP (tables at the bottom), that’s exactly what Python does (in current 3.6), they use UTF-8/surrogateescape on Windows for e.g. sys.stdin, sys.stdout but not for open().

The PEP is extending the divergence to Linux (and elsewhere), that is NOT keep using same as for files and sys.stdin, sys.stdout (in “UTF-8 Mode or POSIX locale” as of Python 3.7).

FTFY package would be great (or just a wrapper of this one).

I see it’s about other stuff (much more elaborate than what I propose, at least for now, see the line I point to):

https://github.com/LuminosoInsight/python-ftfy/commit/5b7b510995dfa8c7a030911daa112d7634524c73#diff-bd29ad198ce9a55553a5f2c82fc1e65aR187

What I’m proposing fixes handling of correct UTF-16 files (without harming support of any valid UTF-8 file, and not really other files either), and even applies to ASCII subset of those. FTFY is about garbled files.

Unless I misunderstand, String has nothing to do with files (it seems only file I/O, readline() etc., not really the String type per se, could be changed). And UTF-16 files are still streams of bytes, they just come in pairs.

Once you’re detected that (unless you explicitly want to disable that proposed detection), I see no harm in reading the file in, in pairs of bytes, and you still would have the same type of String… That would be a minimal change to Julia’s base.

All legal UTF-16 files (as all with BOM should be), would get through. It seems we would read the odd byte (or could) of illegal UTF-16 files if we want without exception.

Even mixing such strings (every other byte likely 00), with valid UTF-8 strings (e.g. concatenation) wouldn’t be worse than what we have now.


In many cases we would only have a group of either UTF-8 files or UTF-16 files, but as I said, mixing both would be better than now, but could be even better…

Although I don’t think the current architecture of String and Char in master is good, I do think that Steven is correct, and that things like allowing for replacement schemes when converting, handling invalid strings, or things like UTF-8 variants (such as overlong encodings used by Java or CESU-8 encoding of surrogate pairs) as well as handling other character set encodings (as well as opposite endian encodings of UTF-16/UTF-32) belong in packages, such as the one I’m making in GitHub - JuliaString/Strs.jl: String support package for Julia.

It’s good that you’re exploring string issues in a package.

You are aiming for something more than me (at least with this proposal), e.g. “O(1) indexing to characters, not just code units [and] China’s official character set, GB18030”

I started this thread with “drop for byte indexing into strings”. Do you agree? It’s a nice syntax to have, but not for byte-indexing, as then implementation details (that e.g. conflict with yours) are exposed in the API.

Hopefully we could reintroduce later it so e.g. “Páll”[3:3] can get the third letter in my name, and in general get a the SubString of grapheme cluster (like Swift does).

Getting support for UTF-16 files seems also doable, and if not now, what do do for Julia 1.x except not use Julia’s default file I/O?

read and write can be extended in Strs.jl to take keywords for character set encodings, I believe, and you will be able to do read(file, UTF16Str) to specify the return type desired.

I think it’s still useful for people wanting to deal with a multi-codeunit character set encoding, such as UTF-8, UTF-16, GB 10380, etc.

I mean e.g. “Páll”[3:3] is an “alias” for getindex(“Palli”, 3:3) and you (or anyone implementing a new type, or assuming UTF-8 encoding of the default strings, needs to be able to do the latter; for Base it’s ok to have this, just not exported).

But does anyone, just using strings, need the former (there’s also SubString you should probably be using)? If everyone keep using the byte addressing (and probably expecting SubString), you can’t reuse that syntax later.

My view is that processing should not be done on multi-codeunit encodings, or other character set encodings other than Unicode. That’s why I came up with the UniStr type in Strs.jl.
Only input and output conversions and validation are really necessary for types like UTF8Str, UTF16Str, or the current String.

FYI: There’s another PEP (I highlighted text I found interesting):

On Mac OS X, iOS, and Android, many components, including CPython, already assume the use of UTF-8 as the system encoding, regardless of the locale setting. However, this isn’t the case for all components, and the discrepancy can cause problems in some situations […]

This approach aims to make CPython behave less like a locale-aware application, and more like locale-independent language runtimes like those for Go, Node.js (V8), and Rust

[…]

The current design means that earlier Python versions will instead retain their default strict error handling on the standard streams, while Python 3.7+ will consistently use the more permissive surrogateescape handler even when these locales are explicitly configured (rather than being reached through locale coercion).

Dropping official support for ASCII based text handling in the legacy C locale

We’ve been trying to get strict bytes/text separation to work reliably in the legacy C locale for over a decade at this point. Not only haven’t we been able to get it to work, neither has anyone else - the only viable alternatives identified have been to pass the bytes along verbatim without eagerly decoding them to text (C/C++, Python 2.x, Ruby, etc), or else to largely ignore the nominal C/C++ locale encoding and assume the use of either UTF-8 (PEP 540, Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR).