Non-legal UTF-8 filenames (in e.g. TOML or JSON)

I assumed, but you’re right, JSON (and TOML) doesn’t support arbitrary byte-strings, and thus NOT all filenames, without help like QSN add-on:

https://www.oilshell.org/release/0.9.9/doc/qsn.html

QSN (“quoted string notation”) is a data format for byte strings […]

It’s an adaptation of Rust’s string literal syntax with a few use cases:

  • To print filenames to a terminal. Printing arbitrary bytes to a terminal is bad, so programs like coreutils already have informal QSN-like formats.
  • To exchange data between different programs, like JSON or UTF-8. Note that JSON can’t express arbitrary byte strings.
  • To solve the “framing problem” over pipes. QSN represents newlines like \n, so literal newlines can be used to delimit records. Oil uses QSN because it’s well-defined and parsable.

It’s both human- and machine-readable. Any programming language or tool that understands JSON should also understand QSN.

It’s a fucking annoying problem, either you output illegal UTF-8, by implicit agreement, violating the JSON standard, or you add QSNm that all don’t know of.

You want to support all possible filenames, at least in some situations, think implementing a backup system. You do not want to ignore those files or worse throw an error, and not back up any files… But even if you support on Unix with QSN, it’s unclear to me you could restore on Windows. I.e. is illegal UTF-8 mappable to some legal or illegal UTF-16, and then transportable back.

julia> length("\xff")
1

julia> length("\uff")  # Not same as above
1

julia> sizeof("\uff")
2

julia> bitstring('\uff')
"11000011101111110000000000000000"