Unicode diacritical marks in filenames

Anything with diacritics (dots hats, wiggles above/below) are a bit complicated

That is probably the main challenge here, and yeah, one should be careful with these. I once had a samba connection from Mac OS to Linux that exactly changed between the two encodings above, which messed up quite some stuff. Probably Browsers struggle the same.

But sure anything single letter, greek, fraktur, caligraphic,… should be fine.

1 Like

According to the Unicode standard, this has a different semantic meaning than the diacritic in ö — this mark is used for the N’Ko script of West-African languages. If you want ö, you should be using U+0308 “Combining Diaeresis”

There are still two ways to write ö, but "\u00f6" and "o\u0308" are canonically equivalent according to Unicode (they have the same semantic meaning), and are converted to one another if you normalize the string. You definitely need to have some understanding of normalization when comparing strings in Unicode or when dealing with Unicode filenames across filesystems.

(I don’t think this has anything to do with support for ö in README.md, so I’ve moved it to a new topic.)

This sounds implausible to me. There is no Unicode normalization that should convert the N’Ko combining character into an umlaut or vice versa.

Probably you are mis-remembering, and actually encountered U+0308, which indeed could have gotten normalized away (or normalized into existence) in a filename transferred across systems.

In particular, MacOS HFS+ filenames are NFD-normalized, so ö will get converted into o followed by a combining character U+0308 when it is saved to a MacOS HFS+ filesystem. Whereas most Linux filesystems perform no normalization by default IIRC.

(It would make a lot more sense for me for a filesystem to be normalization-insensitive and normalization-preserving, but I don’t know if there is any filesystem that does this? Update: it appears that ZFS can do this, but only if you set the normalization=formD property when the filesystem is created … but this is not the default, probably because it forces filenames to be UTF-8 and can conflict with legacy filenames.)

I seriously doubt that any modern browser will be confused by Unicode normalization nowadays when it comes to displaying text. They’ve had to deal with Unicode display for decades now; the main limitation is that they can’t render glyphs that don’t exist in the installed font(s).

Thanks for the detailed answer. I am not sure which unification it was, but yours surely sounds more plausible. It was while moving files from MacOS to Linux (via smb) and afterwards they were no longer found by my scripts. Sure, in the scripts, the unicode was not changed, in the file names it was normalised.

Whether browsers struggle with that, I do not know, that is correct.