TL;DR We’re on the right track, maybe no changes needed except drop for byte indexing into strings (i.e. make it not exported; it’s not needed/wanted? and (kind of) redundant with SubString). (I’ve changed my mind regarding wanting exceptions on illegal UTF-8, want them out (as it is by now), by default, with the possible only exception for files with BOM, that indicate conclusively your file can’t be UTF-8; seems that’s what Python does also: “Python 3 raises a UnicodeDecodeError on the first undecodable byte”, that is for open(), but not for e.g. stdin/stdout).
Note, Julia has binary file support, but not an option to read in text mode (unlike most other languages). Maybe it influences what we should or could do, and we need not follow that PEP as is.
I think we should learn as much as we can from Python, and this new PEP, and use their non-default option (I think they would have wanted it by default). I also brig this up because of PyCall.jl. There may be implications there(?), does it/Python need to be configured to fit Julia?
From the PEP (we do away with “default strict error handler”, as we should and appropriate(?) for us):
"When decoding bytes from UTF-8 using the default strict error handler, Python 3 raises a UnicodeDecodeError on the first undecodable byte. […] Python 3 already has a solution to behave like Unix tools and Python 2: the surrogateescape error handler (PEP 383). It allows processing data as if it were bytes, but uses Unicode in practice; undecodable bytes are stored as surrogate characters.
UTF-8 Mode sets the surrogateescape error handler for stdin and stdout, since these streams as commonly associated to Unix command line tools.
However, users have a different expectation on files. Files are expected to be properly encoded, and Python is expected to fail early when open() is called with the wrong options, like opening a JPEG picture in text mode."
I’ve been loosing sleep over [Julia’s] UTF-8, how to best support it, or I should say the illegal superset of it (illegal bytes, overlong or short), and e.g. if we should detect the BOM and support then UTF-16 at least minimally, to read files in (should the file descriptor remember it opened an UTF-16 file?). It seems it might help some people, all those adding file reader support e.g. for CSV, to centralize that logic in one place benefiting all file readers.
Then I happened to read this PEP when looking into Python 3.7. They must have thought this all through, at least in the context of their language.
A file with BOM conclusively(?) can’t be ANY 1-byte encoding of legal text (see my Quora question below), at least not ISO8859-1 (and -15 if not all the variants) nor Windows 1525 (should it be the assumed encoding of illegal bytes?).
We’re already using at least part of their proposed solution. They can’t have it as default, but since Julia 1.0 is soon out, we can choose to have their non-default option as default for us.
Add a new “UTF-8 Mode” to enhance Python’s use of UTF-8. When UTF-8 Mode is active, Python will:
- use the utf-8 locale, irregardless of the locale currently set by the current platform, and
- change the stdin and stdout error handlers to surrogateescape.
This mode is off by default, but is automatically activated when using the “POSIX” locale.
Add the -X utf8 command line option and PYTHONUTF8 environment variable to control UTF-8 Mode.
At the bottom of the PEP are tables summarizing, with this info:
for open() UTF-8/strict
they have for e.g. (do we have similar to these fs* functions?):
os.fsdecode(), os.fsencode() UTF-8/surrogateescape
sys.stdin, sys.stdout UTF-8/surrogateescape
“On Windows, the encodings and error handlers are different”
[another table]
I.e. for os.fsdecode(), os.fsencode()
And then:
sys.stderr UTF-8/backslashreplace