Symbol to String

I’m trying to convert a special Symbol to a String, but I’m having some trouble. Is the following behavior usual and why?

julia> String(:Ω) == "Ω"
false
1 Like

Greek omega and Ohm sign have different unicode characters so they’re not equal under == even though they look the same.

I understand that, but why unicode is different in those two cases?

The Unicode Consortium say:

For compatibility purposes, a few Greek letters are separately encoded as symbols in other character blocks. Examples include U+00B5 μ   in the Latin-1 Supplement character block and U+2126 Ω   in the Letterlike Symbols character block. The ohm sign is canonically equivalent to the capital omega, and normalization would remove any distinction. Its use is therefore discouraged in favor of capital omega. The same equivalence does not exist between micro sign and mu, and use of either character as micro sign is com- mon; for Greek text, only the mu should be used.

which I think is saying “Don’t use U+2126 Ω (or \ohm in Julia) - use U+0309 (\Omega in Julia)”.

4 Likes

Do they also say anything about why we can’t have subscript \_b, \_c, or \_d, but instead we can have \:turtle: :turtle: and \:person_in_steamy_room: :person_in_steamy_room:?

The reason why :\ohm<tab> gives the Omega symbol instead of ohm is that symbols are treated like variable names in Julia, and thus it makes sense to normalize them.

The thing to realize is that the Unicode strings stored internally for Julia symbols (e.g. variable names or quoted symbols) are automatically normalized to canonical form.

So, even if you type :Ω using the Ohm symbol U+2126 (e.g. via :\ohm<tab>), it will get normalized to Omega U+03A09:

julia> collect(String(:Ω)) # Ω is Ohm U+2126 (\ohm<tab>)
1-element Vector{Char}:
 'Ω': Unicode U+03A9 (category Lu: Letter, uppercase)

Another example would be accented Latin characters like ë, which often have two canonically equivalent representations (either a single special character or an unaccented character followed by a “combining” accent character), but you don’t want that to correspond to different variable names depending on how you type it (e.g. different input systems). Canonicalization (technically, NFC normalization) removes that distinction.

This is explained in the Julia manual:

Some Unicode characters are considered to be equivalent in identifiers. Different ways of entering Unicode combining characters (e.g., accents) are treated as equivalent (specifically, Julia identifiers are NFC-normalized). Julia also includes a few non-standard equivalences for characters that are visually similar and are easily entered by some input methods. The Unicode characters ɛ (U+025B: Latin small letter open e) and µ (U+00B5: micro sign) are treated as equivalent to the corresponding Greek letters. The middle dot · (U+00B7) and the Greek interpunct · (U+0387) are both treated as the mathematical dot operator (U+22C5). The minus sign (U+2212) is treated as equivalent to the hyphen-minus sign - (U+002D).

3 Likes

@stevengj Any progress on your proposal?

2 Likes

However, this does not happen if the the Symbol constructor is used (instead of typing :Ω):

julia> collect(String(Symbol(Char(0x2126))))
1-element Vector{Char}:
 'Ω': Unicode U+2126 (category Lu: Letter, uppercase)

That’s right — if you use the Symbol constructor then you can make a Symbol from any Julia string even if it is not a valid identifier, such as Symbol(" ") (as long as the string doesn’t contain \0). Because of this it takes the strings literally as-is, with no normalization.