Unicode diacritical marks in filenames

kellertuer · December 23, 2024, 2:55pm

Anything with diacritics (dots hats, wiggles above/below) are a bit complicated

you can write the German ö as exactly that letter
or use “◌߳” U+07F3 Nko Combining Double Dot Above Unicode Character before/with an o.

That is probably the main challenge here, and yeah, one should be careful with these. I once had a samba connection from Mac OS to Linux that exactly changed between the two encodings above, which messed up quite some stuff. Probably Browsers struggle the same.

But sure anything single letter, greek, fraktur, caligraphic,… should be fine.

stevengj · January 2, 2025, 11:34pm

According to the Unicode standard, this has a different semantic meaning than the diacritic in ö — this mark is used for the N’Ko script of West-African languages. If you want ö, you should be using U+0308 “Combining Diaeresis”

There are still two ways to write ö, but "\u00f6" and "o\u0308" are canonically equivalent according to Unicode (they have the same semantic meaning), and are converted to one another if you normalize the string. You definitely need to have some understanding of normalization when comparing strings in Unicode or when dealing with Unicode filenames across filesystems.

(I don’t think this has anything to do with support for ö in README.md, so I’ve moved it to a new topic.)

This sounds implausible to me. There is no Unicode normalization that should convert the N’Ko combining character into an umlaut or vice versa.

Probably you are mis-remembering, and actually encountered U+0308, which indeed could have gotten normalized away (or normalized into existence) in a filename transferred across systems.

In particular, MacOS HFS+ filenames are NFD-normalized, so ö will get converted into o followed by a combining character U+0308 when it is saved to a MacOS HFS+ filesystem. Whereas most Linux filesystems perform no normalization by default IIRC.

(It would make a lot more sense for me for a filesystem to be normalization-insensitive and normalization-preserving, but I don’t know if there is any filesystem that does this? Update: it appears that ZFS can do this, but only if you set the normalization=formD property when the filesystem is created … but this is not the default, probably because it forces filenames to be UTF-8 and can conflict with legacy filenames.)

I seriously doubt that any modern browser will be confused by Unicode normalization nowadays when it comes to displaying text. They’ve had to deal with Unicode display for decades now; the main limitation is that they can’t render glyphs that don’t exist in the installed font(s).

kellertuer · January 3, 2025, 6:56am

Thanks for the detailed answer. I am not sure which unification it was, but yours surely sounds more plausible. It was while moving files from MacOS to Linux (via smb) and afterwards they were no longer found by my scripts. Sure, in the scripts, the unicode was not changed, in the file names it was normalised.

Whether browsers struggle with that, I do not know, that is correct.

Topic		Replies	Views
Using `replace()` with unicode dot General Usage	2	467	December 28, 2023
Unicode mac windows portability General Usage unicode	2	220	November 14, 2023
Browser problems displaying maths Unicode characters General Usage	23	2448	March 18, 2021
Symbol to String General Usage question	9	933	December 28, 2022
Tab completion for umlauts, and some musings about unicode characters Internals & Design	6	1928	February 26, 2018

Unicode diacritical marks in filenames

Related topics