Non-legal UTF-8 filenames (in e.g. TOML or JSON)

Palli · February 23, 2025, 5:33pm

A major flaw with json is its inability to store non-UTF8 strings which inadvertently makes them unreliable for storing unix filenames (here a filename is defined as any sequence of bytes except NUL and /, including invalid UTF-8).
[…]
Some prior art on this would be Rust’s raw string literal notation and qsn which is a format based on those ideas. It is documented quite nicely here, along with good general coverage of this topic: http://www.oilshell.org/release/latest/doc/qsn.html

ZON strings are no different from Zig string literals in this context, so the answer is: yes, ZON strings store any sequence of bytes, agnostic to their interpretation. The literal in file must itself be UTF-8 encoded, because Zig sources - and by extension ZON - are always UTF-8 encoded, but you can represent any byte sequence which is not valid UTF-8 by using simple escape sequences (the main relevant one here being \xNN).

shell> touch \xff

julia> touch("\xff")  # The former didn't work, see:

-rw-rw-r-- 1 pharaldsson pharaldsson 0 feb 23 17:23 xff
-rw-rw-r-- 1 pharaldsson pharaldsson 0 feb 23 17:24 ‘’$‘\377’

I can open this file, but lets say I want to serialize the filename itself, or if Pkg has to handle such (it it even a possibility of it happening, if generated from within Julia? I suppose you can make such a Package/module).

You can do escape_string(filename) and it’s seemingly always needed to be careful, unless your code/Julia API already7 does it. You can do it redundantly with no ill effect, only a bit of expansion, it’s just then you must unescape too twice, we the users need to know if needed or not.

Does it happen by default for [filenames] in JSON packages? Probably not, it would mean you need to know if potential illegal UTF-8 (always, for non-validated strings, and filenames can’t be validated). It seems to me Pkg (using TOML) doesn’t escape (or only in once place), since it doesn’t deal with filenames. It’s unclear it TOML does it for you (for other contexts). Python and C++ have good path/filesystems APIs, and there’s talk of similar in Julia, without just naked strings, and then likely should handle escaping for you. Strictly speaking the escaping is need for all Julia strings… unless when they are known to be validated UTF-8.

On Windows with UTF-16, are all even-numbered byte sequences also valid? And even odd, like the file above?

I was looking into ZON encoding and if we should possibly support it (it’s not just for Zig or its packages system):

github.com/ziglang/zig

introduce Zig Object Notation and use it for the build manifest file (build.zig.zon)

master ← zon

opened 07:03AM - 03 Feb 23 UTC

andrewrk

+4561 -4098

Note: the actual diff is much smaller; I moved parse logic from one file to anot…her, but it is unmodified except for an additional function. ## Description This branch introduces Zig Object Notation which can be parsed using the standard library. The same function used to parse zig files is augmented to also parse zon files, using an enum to toggle between the two modes. I considered using a separate implementation to parse zon files, however, I made the decision to reuse existing logic for the following reasons: * Shared logic is nice for maintenance * Shared logic keeps the binary size of the compiler smaller * Reusing the logic for rendering the AST is nice. We get `zig fmt` for free. * It also lets us reuse a bunch of error reporting logic Next, this branch switches to `build.zig.zon` instead of `build.zig.ini` for the manifest file and enhances the relevant error reporting. This is the last remaining planned breaking change to the package manager, other than these two: * #14286 * #14307 These will not be quite as invasive breakages, so, after this branch is merged, it starts to become more feasible for people to jump in and start using this new feature of zig. Closes #14290. ## Follow-up work to be done: * [x] #14530 * [x] #14531 * [ ] #14532 * [ ] #14534 * [ ] #5039

stevengj · February 23, 2025, 7:20pm

show(io, string) will write it in escaped form (with quotation marks) or print(io, escape_string(string)) if you don’t want quotation marks.

No, because \x escaping is not allowed in these formats as I understand it.

JSON.jl just writes the raw non-UTF8 bytes to the string. It can parse these back, but JSON parsers in other languages may give an error. e.g. JSON.parse(JSON.json(Dict("foo" => "\xff"))) works.
TOML.jl throws an error if you try to output strings with non-UTF8 bytes.

Of course, you could always use a different serialization format. The simplest is to save it as an array of byte values by outputting codeunits(str) instead of str — Dict("foo" => codeunits("\xff")) works just fine with both JSON and TOML.

Palli · February 23, 2025, 9:02pm

I assumed, but you’re right, JSON (and TOML) doesn’t support arbitrary byte-strings, and thus NOT all filenames, without help like QSN add-on:

https://www.oilshell.org/release/0.9.9/doc/qsn.html

QSN (“quoted string notation”) is a data format for byte strings […]

It’s an adaptation of Rust’s string literal syntax with a few use cases:

To print filenames to a terminal. Printing arbitrary bytes to a terminal is bad, so programs like coreutils already have informal QSN-like formats.

To exchange data between different programs, like JSON or UTF-8. Note that JSON can’t express arbitrary byte strings.

To solve the “framing problem” over pipes. QSN represents newlines like \n, so literal newlines can be used to delimit records. Oil uses QSN because it’s well-defined and parsable.

It’s both human- and machine-readable. Any programming language or tool that understands JSON should also understand QSN.

It’s a fucking annoying problem, either you output illegal UTF-8, by implicit agreement, violating the JSON standard, or you add QSNm that all don’t know of.

You want to support all possible filenames, at least in some situations, think implementing a backup system. You do not want to ignore those files or worse throw an error, and not back up any files… But even if you support on Unix with QSN, it’s unclear to me you could restore on Windows. I.e. is illegal UTF-8 mappable to some legal or illegal UTF-16, and then transportable back.

julia> length("\xff")
1

julia> length("\uff")  # Not same as above
1

julia> sizeof("\uff")
2

julia> bitstring('\uff')
"11000011101111110000000000000000"

stevengj · February 23, 2025, 9:08pm

If you have an arbitrary sequence of bytes, it’s arguably not a string at all, especially from the perspective of the format. You should just store it as a Vector{UInt8} via codeunits(str).

Palli · February 23, 2025, 9:13pm

Can be argued yes, Unicode-wise, but is (still) currently in Julia a String.

That doesn’t help at all, or only makes it explicit you intent to store illegal too. It stores the same byte-string, and still unclear to users, you to store in JSON.

nhz2 · February 23, 2025, 9:29pm

It is common to use base64 encoding if you want to store data in JSON.

stevengj · February 23, 2025, 9:33pm

If you have filename = [0x66, 0x6f, 0x6f, 0x2e, 0x74, 0x78, 0x74] it’s pretty clear IMO, but in general a JSON file is not completely self-documenting — in general you need some documentation anyway to tell people what the fields mean.

And it solves the problem of storing an arbitary filename in a JSON file without violating the standard, and allowing any conforming implementation in any language to read it.

Palli · February 23, 2025, 9:39pm

Right and very valid for contents of (some) files, but bloating, and bad for filenames, making them unreadable. Then QSN is better, and arguable for data too. “Ascii85 (also called Base85)” is in most cases a better alternative for binary data, just Base64 more supported, including in Julia (would be easy to support both).

stevengj · February 23, 2025, 9:43pm

If you’re worried about this, I would just give an error on !isvalid(filename) — at least in situations where you need to store the filename in a human-readable format like JSON — and tell users to rename their files. Non-Unicode-compatible filenames are likely to cause other problems (e.g. they won’t be portable across filesystems in general), and there’s no real need to support them in a typical application AFAICT (aside from low-level OS utilities).

Palli · February 23, 2025, 9:50pm

If you construct a filename, with a (future) type for it, then it might be valid to reject non-legal UTF-8 strings (and even QSN-encoding such, by default). But if you e.g. read files from your directly, we must support all files there, including non-UTF-8 (at least I feel like, by default, as I explained with my backup example).

I marked my comment with a solution, for QSN (seems least bad, no solution seems prefect), that I propose Julia would adopt, when you cast to a String.

It can be input into this proposal:

stevengj · February 23, 2025, 10:00pm

You’re not just reading filenames. You’re storing them in a configuration file. You get to choose the format of your configuration file. You can:

Tell users that your configuration file doesn’t support non-Unicode filenames.
Store filenames in JSON or TOML as byte arrays or some other custom encoding.
Use a file format (not JSON or TOML) that allows strings that contain arbitrary bytes. Or use non-standard JSON (don’t worry about parsing it from anything but JSON.jl).

I don’t know what change in Julia you are proposing — Julia String already supports arbitrary byte streams, and Julia’s filesystem APIs work just fine with non-UTF8 filenames too. Your problem is with the JSON and TOML file formats, which are not part of Julia and are not controlled by Julia, and for those you have the three options above.

Topic		Replies	Views
String encodings help General Usage	7	2230	January 6, 2018
Problem processing non utf8 string New to Julia	17	2163	June 1, 2018
Changes to the representation of Char Internals & Design	14	2851	December 12, 2017
Valid chars Offtopic question , strings	6	618	March 5, 2019
Julia's UTF-8 handling [vs. new Python's 3.7 UTF-8 PEP 540] Internals & Design	29	4690	January 24, 2018

Non-legal UTF-8 filenames (in e.g. TOML or JSON)

Related topics