How to get "the tab sequence" of a unicode as we enter in the REPL in Julia code?

I want to write a function (or if it already exists)

function unicodesymbol2string(us::Symbol)
    # do something to find s
    return s
end

which behaves like

julia> unicodesymbol2string(:β)
beta
julia> unicodesymbol2string(:Σ)
Sigma
julia> unicodesymbol2string(:÷)
div

The returned string should correspond to the tab sequence listed at https://docs.julialang.org/en/v1/manual/unicode-input/

Just paste the symbol at the help> prompt (type ? in the REPL) and it will tell you the tab completion.

2 Likes

I want to know dynamically in my Julia code :slight_smile:

Use case: my software will accept a unicode symbol input by user, and I need to convert it to plain ASCII to be used by filename, dirname etc.

1 Like

many filesystems support unicode names for file/directories.


you can do a reverse search from here:
https://github.com/JuliaLang/julia/blob/0336f672db739c784e2ebfc4d6c3dab8ba713611/stdlib/REPL/src/latex_symbols.jl#L95

1 Like

Thank you for pointing to the right direction. I found the exact function what I am looking for:

julia> using REPL
julia> REPL.symbol_latex("β")
"\\beta"

With this funciton, my function becomes

function unicodesymbol2string(us::Symbol)
    return REPL.symbol_latex(String(us))[2:end]
end
8 Likes

(Which common filesystems these days, not including FAT, don’t support Unicode filenames?)

1 Like

Julia has better support for math symbols than I believe any other language, also supporting a lot of emojis e.g. :beer: Beer Mug and :baby_bottle: \:baby_bottle: (do not mix together).

That said, what’s the use case, as I would be a bit annoyed seeing e.g. my name Páll shown as as Pll, dropping a letter, (Palli is a nickname; I use it also to help foreigners, easier to say), or Icelandic Þ and Ð as \\TH and \\DH, so “working” partially for some (and fully for only a few, except for ASCII-only) names, like Þórður, to \\THr\\dhur (missing out on the ó)?

1 Like

I’ve never encountered filename with unicode from my work experience. My colleagues never use it. It may cause many problems, who knows? For example, they don’t know how to type the filename with unicode? Using plain ASCII can avoid such headaches.

3 Likes

It is not uncommon to use a Unicode filename in Korea. (programmers naturally try to avoid it. But sometimes it happens.)

And It causes headaches when the OS or third party app uses the Unicode name.

For example, we uses Google drive file stream in our work environment. And the default path for Google file stream drive is…

image
Unicode and whitespaces :cry:

4 Likes

I think you mean, you never use fullwidth Unicode (e.g. Chinese characters) vs halfwidth (e.g. ASCII but not only), understandable, still interesting.

Unicode, i.e. UTF-8, is ubiquitous on Linux (and Unicode, there UTF-16, on Windows) on filesystems. I’ve probably used UTF-8 only, for well over a decade, and filenames in Icelandic or e.g. German work fine. I often use ASCII subset for work reasons (and personally often out of habit), since I work for a British (actually Hong Kong) company, and English is the default language. I use Julia at work, and I guess I could use Unicode (math symbols) in the Julia code, so far haven’t, and it should work in filesystems, while I’m more afraid then of Linux and Windows servers cooperating well.

Should work fine on any recent (< 10 yo) OS. If not, the problem can easily be remedied.

They can copy-paste or open the file using an interactive file dialog.

There are exceptions (ancient mainframes etc), but generally a custom ad-hoc workaround (like the transcription proposed above) is likely to take more effort and be less robust than setting up a filesystem that can handle Unicode.

1 Like

We have many legacy C++ library and I actually don’t know if unicode filename works out of box using C++ iostream. No one of us probably want to update those C++ code. Before we transfer all those C++ codes to Julia, we’d better to adhere to filenames with pure ASCII characters. That is another reason I don’t like to save a data file with name such as “ϕ0.1_α0.5.dat”.

1 Like

It will on Mac and Linux, where filenames are all UTF-8 encoded — all of the C/C++ libraries treat filenames as an opaque collection of bytes and don’t care about the encoding, and so they handle arbitrary Unicode filenames automatically.

On Windows, unfortunately, you need to use special UTF-16 or wchar_t filesystem APIs (wiostream) to access Unicode filenames from C/C++. (The Win32 API took a wrong turn early in the development of Unicode and never recovered. However, there is hope that this will change soon and UTF-8 usage will become widespread on Windows — it’s apparently become possible to use the ordinary C/C++ APIs with UTF-8 encoded filenames, and this may yet become the default.)

5 Likes

I use it very often since Serbian language has letters š, đ, ž, č, ć which aren’t ASCII. Naming my files and folders in such way is way more readable and intuitive. I still haven’t encountered any problems with that, but maybe it’s because I always name my software development files in English.

People even use UTF-8 on (recent) mainframes (UTF-EBCDIC never got popular), while probably not for most programs, and not sure about for actual filenames: IBM Documentation

and on OS/2 (I see UTF-8 and UCS-2 at Alex Taylor: OS/2 Universal Language Support maybe only for REXX).

REPL.symbol_latex("ϵ") works but REPL.symbol_latex("ϵ̂") doesn’t.

Any hint as to how to deal with the latter case?

Is it going to get any better with Julia 1.10 and a parser written in Julia.

My use-case is translating equations to Matlab/dynare (no unicode).

ϵ̂ doesn’t have a single tab completion (it’s \epsilon<tab>\hat<tab>) so symbol_latex won’t work.
You have to break it up into characters similar to this logic.

1 Like

so in my case this would be a solution:

function translate_to_ascii(x::Symbol)
    s = Unicode.normalize(string(x), :NFD)
    latex = [REPL.symbol_latex(string(i))[2:end] for i in s]
    join(latex,"_")
end

translate_to_ascii(:α̂ₗ)
# "alpha_hat__l"

thanks for the prompt reply

Not quite, because REPL.symbol_latex returns "" for things that don’t have tab completions (e.g. ASCII characters).

something like this then:

function translate_to_ascii(x::Symbol)
    s = Unicode.normalize(string(x), :NFD)

    latex = String[]

    for i in ss
        out = REPL.symbol_latex(string(i))[2:end]
        if out == ""
            out = string(i)
        end
        push!(latex,out)
    end

    join(latex,"_")
end

translate_to_ascii(:l1α̂ₗ)
1 Like