String conversion from Symbol with Unicode does not yield a string, which is intended to be the same

Hi, I’m often use a conversion from Symbol to String. For example,
String(:a) == "a" # true

However, when adding \dot, it yields a string which is different from intended string. For example,

String(:a\dot) == "a\dot" # with tab completion, false

I have no idea how to result in the intended result. Why are they different?

The answer lies in unicode:

julia> "ȧ" |> collect
2-element Array{Char,1}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 '': Unicode U+0307 (category Mn: Mark, nonspacing)

julia> String(:ȧ) |> collect
1-element Array{Char,1}:
 'ȧ': Unicode U+0227 (category Ll: Letter, lowercase)

julia> String(Symbol("ȧ")) |> collect
2-element Array{Char,1}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 '': Unicode U+0307 (category Mn: Mark, nonspacing)

These two seemingly equivalent things are represented differently, though I cannot tell you why these entered like this differ. Presumably there’s some magic in the tab completion that simplifies/normalizes the representation to make it smaller?

2 Likes

Oh… I see. But it’s quite undesirable for my stuffs.
Do you have any idea to avoid this undesirable situation? :slight_smile:

Avoid hardcoding symbols via :a and instead do Symbol("a") - that stays consistent, as you can see from the third test.

I’m not sure if these two are supposed to be equivalent, so it’s also a good idea to open an issue about this :slight_smile:

1 Like

Identifiers in Julia are NFC-normalized. From the documentation:

Some Unicode characters are considered to be equivalent in identifiers. Different ways of entering Unicode combining characters (e.g., accents) are treated as equivalent (specifically, Julia identifiers are NFC-normalized). The Unicode characters ɛ (U+025B: Latin small letter open e) and µ (U+00B5: micro sign) are treated as equivalent to the corresponding Greek letters, because the former are easily accessible via some input methods.

See also this issue: canonicalize unicode identifiers · Issue #5434 · JuliaLang/julia · GitHub

I don’t know if the different behaviors of :... and Symbol(...) is intended. I think Symbol is meant to allow creating symbols for invalid identifiers, see for example this part of the documentation:

The syntax var"#example#" refers to a variable named Symbol("#example#") , even though #example# is not a valid Julia identifier name.

It seems reasonable that symbols declared with : would be normalized but maybe it’s worth filing an issue to clarify/document this potential gotcha?

5 Likes

This is very hard to search for, but I’m sure I’ve seen multiple discussions where it was made clear that : is not meant to be equivalent to Symbol. I think : is basically “turn the following into an expression”, which just happens to sometimes coincide with Symbol or other types:

julia> typeof(:1)
Int64

julia> typeof(:a)
Symbol

julia> typeof(:(a+b))
Expr

so if you want a symbol, you should probably use Symbol

4 Likes

You can use Unicode.normalize to get the normalized form

julia> "ȧ" |> collect
2-element Array{Char,1}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 '̇': Unicode U+0307 (category Mn: Mark, nonspacing)

julia> Unicode.normalize("ȧ") |> collect
1-element Array{Char,1}:
 'ȧ': Unicode U+0227 (category Ll: Letter, lowercase)
1 Like