Question About Unicode and Symbols

Hello

I’ve been experimenting with overloading getproperty for some structs in a personal project of mine.

I noticed the following throws an error (using getfield instead of getproperty in this MWE).

Does anyone know why? I type the u-bar in the REPL by typing u\bar then pressing tab.


struct MyStruct
    ū::Vector{Float64}
end

tmp = MyStruct(rand(5))

tmp.ū # no error
getfield(tmp, :ū) # no error
getfield(tmp, Symbol("ū")) # throws error, type MyStruct has no field ū

So it seems that :ū == Symbol("ū") is false.

1 Like

This has to do with unicode normalization and might be a bug? Specifically,

julia> codeunits(String(:ū))
2-element Base.CodeUnits{UInt8, String}:
 0xc5
 0xab

julia> codeunits("ū")
3-element Base.CodeUnits{UInt8, String}:
 0x75
 0xcc
 0x84

Julia normalizes symbols, but aparently doesn’t do so when you call Symbol(::String):

julia> codeunits(String(Symbol("ū")))
3-element Base.CodeUnits{UInt8, String}:
 0x75
 0xcc
 0x84
3 Likes

That’s not a bug. Symbol(::String) intentionally allows you to make a symbol out of any string as-is (as long as it does not contain '\0'), and is intentionally not restricted to valid Julia identifiers. See the discussion in julia#5462 (at which point in time the constructor was called symbol(::String)).

If you want to ensure a valid Julia identifier, do Meta.parse or, better yet, use the :symbol syntax.

That being said, I think there may be a bug in Symbol printing stemming from a bug in Base.isidentifier, which does not check normalization:

julia> "e\u0301" # e with acute accent, not NFC normalized
"é"

julia> Symbol("e\u0301") == :é   # correct: :é is normalized
false

julia> Base.isidentifier(Symbol("e\u0301"))   # incorrect: should check normalization
true

julia> Symbol("e\u0301")  # incorrect display: should check normalization
:é

See Base.isidentifier(::Symbol) should check normalization · Issue #52641 · JuliaLang/julia · GitHub