Non-legal UTF-8 filenames (in e.g. TOML or JSON)

A major flaw with json is its inability to store non-UTF8 strings which inadvertently makes them unreliable for storing unix filenames (here a filename is defined as any sequence of bytes except NUL and /, including invalid UTF-8).
[…]
Some prior art on this would be Rust’s raw string literal notation and qsn which is a format based on those ideas. It is documented quite nicely here, along with good general coverage of this topic: http://www.oilshell.org/release/latest/doc/qsn.html

ZON strings are no different from Zig string literals in this context, so the answer is: yes, ZON strings store any sequence of bytes, agnostic to their interpretation. The literal in file must itself be UTF-8 encoded, because Zig sources - and by extension ZON - are always UTF-8 encoded, but you can represent any byte sequence which is not valid UTF-8 by using simple escape sequences (the main relevant one here being \xNN).

shell> touch \xff

julia> touch("\xff")  # The former didn't work, see:

-rw-rw-r-- 1 pharaldsson pharaldsson 0 feb 23 17:23 xff
-rw-rw-r-- 1 pharaldsson pharaldsson 0 feb 23 17:24 ‘’$‘\377’

I can open this file, but lets say I want to serialize the filename itself, or if Pkg has to handle such (it it even a possibility of it happening, if generated from within Julia? I suppose you can make such a Package/module).

You can do escape_string(filename) and it’s seemingly always needed to be careful, unless your code/Julia API already7 does it. You can do it redundantly with no ill effect, only a bit of expansion, it’s just then you must unescape too twice, we the users need to know if needed or not.

Does it happen by default for [filenames] in JSON packages? Probably not, it would mean you need to know if potential illegal UTF-8 (always, for non-validated strings, and filenames can’t be validated). It seems to me Pkg (using TOML) doesn’t escape (or only in once place), since it doesn’t deal with filenames. It’s unclear it TOML does it for you (for other contexts). Python and C++ have good path/filesystems APIs, and there’s talk of similar in Julia, without just naked strings, and then likely should handle escaping for you. Strictly speaking the escaping is need for all Julia strings… unless when they are known to be validated UTF-8.

On Windows with UTF-16, are all even-numbered byte sequences also valid? And even odd, like the file above?

I was looking into ZON encoding and if we should possibly support it (it’s not just for Zig or its packages system):

show(io, string) will write it in escaped form (with quotation marks) or print(io, escape_string(string)) if you don’t want quotation marks.

No, because \x escaping is not allowed in these formats as I understand it.

  • JSON.jl just writes the raw non-UTF8 bytes to the string. It can parse these back, but JSON parsers in other languages may give an error. e.g. JSON.parse(JSON.json(Dict("foo" => "\xff"))) works.
  • TOML.jl throws an error if you try to output strings with non-UTF8 bytes.

Of course, you could always use a different serialization format. The simplest is to save it as an array of byte values by outputting codeunits(str) instead of strDict("foo" => codeunits("\xff")) works just fine with both JSON and TOML.

1 Like

I assumed, but you’re right, JSON (and TOML) doesn’t support arbitrary byte-strings, and thus NOT all filenames, without help like QSN add-on:

https://www.oilshell.org/release/0.9.9/doc/qsn.html

QSN (“quoted string notation”) is a data format for byte strings […]

It’s an adaptation of Rust’s string literal syntax with a few use cases:

  • To print filenames to a terminal. Printing arbitrary bytes to a terminal is bad, so programs like coreutils already have informal QSN-like formats.
  • To exchange data between different programs, like JSON or UTF-8. Note that JSON can’t express arbitrary byte strings.
  • To solve the “framing problem” over pipes. QSN represents newlines like \n, so literal newlines can be used to delimit records. Oil uses QSN because it’s well-defined and parsable.

It’s both human- and machine-readable. Any programming language or tool that understands JSON should also understand QSN.

It’s a fucking annoying problem, either you output illegal UTF-8, by implicit agreement, violating the JSON standard, or you add QSNm that all don’t know of.

You want to support all possible filenames, at least in some situations, think implementing a backup system. You do not want to ignore those files or worse throw an error, and not back up any files… But even if you support on Unix with QSN, it’s unclear to me you could restore on Windows. I.e. is illegal UTF-8 mappable to some legal or illegal UTF-16, and then transportable back.

julia> length("\xff")
1

julia> length("\uff")  # Not same as above
1

julia> sizeof("\uff")
2

julia> bitstring('\uff')
"11000011101111110000000000000000"

If you have an arbitrary sequence of bytes, it’s arguably not a string at all, especially from the perspective of the format. You should just store it as a Vector{UInt8} via codeunits(str).

Can be argued yes, Unicode-wise, but is (still) currently in Julia a String.

That doesn’t help at all, or only makes it explicit you intent to store illegal too. It stores the same byte-string, and still unclear to users, you to store in JSON.

It is common to use base64 encoding if you want to store data in JSON.

1 Like

If you have filename = [0x66, 0x6f, 0x6f, 0x2e, 0x74, 0x78, 0x74] it’s pretty clear IMO, but in general a JSON file is not completely self-documenting — in general you need some documentation anyway to tell people what the fields mean.

And it solves the problem of storing an arbitary filename in a JSON file without violating the standard, and allowing any conforming implementation in any language to read it.

Right and very valid for contents of (some) files, but bloating, and bad for filenames, making them unreadable. Then QSN is better, and arguable for data too. “Ascii85 (also called Base85)” is in most cases a better alternative for binary data, just Base64 more supported, including in Julia (would be easy to support both).

If you’re worried about this, I would just give an error on !isvalid(filename) — at least in situations where you need to store the filename in a human-readable format like JSON — and tell users to rename their files. Non-Unicode-compatible filenames are likely to cause other problems (e.g. they won’t be portable across filesystems in general), and there’s no real need to support them in a typical application AFAICT (aside from low-level OS utilities).

3 Likes

If you construct a filename, with a (future) type for it, then it might be valid to reject non-legal UTF-8 strings (and even QSN-encoding such, by default). But if you e.g. read files from your directly, we must support all files there, including non-UTF-8 (at least I feel like, by default, as I explained with my backup example).

I marked my comment with a solution, for QSN (seems least bad, no solution seems prefect), that I propose Julia would adopt, when you cast to a String.

It can be input into this proposal:

You’re not just reading filenames. You’re storing them in a configuration file. You get to choose the format of your configuration file. You can:

  • Tell users that your configuration file doesn’t support non-Unicode filenames.
  • Store filenames in JSON or TOML as byte arrays or some other custom encoding.
  • Use a file format (not JSON or TOML) that allows strings that contain arbitrary bytes. Or use non-standard JSON (don’t worry about parsing it from anything but JSON.jl).

I don’t know what change in Julia you are proposing — Julia String already supports arbitrary byte streams, and Julia’s filesystem APIs work just fine with non-UTF8 filenames too. Your problem is with the JSON and TOML file formats, which are not part of Julia and are not controlled by Julia, and for those you have the three options above.

1 Like