Source files with non-ascii names: how?

Is there a nice escaping rule that allows avoiding non-ascii characters in source-files?

That is, some ancient tools and people cannot deal with anything that is not ascii-7-bit in source files, but want to interact with packages that define non-ascii things.

First, this is harmless for literals, as there is a plethora of escaping options (\u… or just binary blob literals).

How do I deal with with symbols? The obvious approach would be:

immutable foo
    µ::Int
end
A = foo(4)

const mu_sym = Symbol('\u03bc') #not '\u00b5', beware!

foo_mu0(A::foo) =  getfield(A,mu_sym)
foo_mu1(A::foo) =  getfield(A,Symbol('\u03bc'))
foo_mu2(A::foo) =  A.µ
foo_mu3(A::foo) =  getfield(A,:μ)

Now, all four ways of getting the value work, but the first two are slow (don’t do field/offset resolution at compile time?). How do I get the fast behavior?

Second, how do I access methods with non-ascii names?

Third, is there maybe some syntactic sugar for all of this? So that people without IDE/julia-mode can access e.g. infix-xor, and a minor refactor or automatic tool could get rid of all non-ascii chars in source-files (while preserving identical machine code and names of imports/exports)?

A concrete example could lead to a concrete suggestion for a solution.

1 Like

Don’t put user-facing unicode in your API.

1 Like

In this day and age, it is perfectly reasonable to tell people use an editor that supports Unicode and the UTF-8 encoding.

(Julia doesn’t even run on ancient systems where a lack of Unicode support might be hard to circumvent. And the only still-extant text editors I can find that don’t support unicode are ed and pico, and no sane programmer would use either of these…are there any other examples?)

Correction: apparently there may be one popular programming text editor that still doesn’t properly support Unicode, and that is the Matlab editor. But if you are using the Matlab editor to write Julia code then you may need to re-think your life choices.

13 Likes

Some popular terminals can have issues with Unicode. I ran into this when using MobaXterm with the Comet XSEDE cluster. Not sure where along the way it couldn’t handle the Unicode though since this terminal supposedly can handle it, but anyways I ended up with boxes instead of symbols. (There might be a setting to fix this with Bash terminals that’s not enabled by default?)

If you end up with boxes □ instead of symbols (one box per symbol), then it is handling Unicode fine. You just need a better font.

(If you see multiple � per symbol, or mojibake, you might instead need to set your terminal to the UTF-8 encoding. Especially on Windows. e.g. you mentioned MobaXterm: they have a setting to use UTF-8.)

1 Like

Obviously I would never do such a silly thing. Unfortunately, this ship has sailed already in base; for example, infix bitwise xor, after the deprecation of $.

Regarding life-choices, UTF8 handling, and fonts:

I misspoke. Most tools are capable of handling UTF8 fine; however, often there are no sensible fonts leading to the nice boxes □. This bites me when trying to do some quick fixes via ssh on less-than-optimally-configured machines.

My preferred approach would be to begin my own source files with a couple of aliases of the style “const mu_symb = Symbol(‘\u03bc’)”, and then just use these aliases whenever accessing functionality with non-ascii names (either in base or in packages). This way, I don’t need unicode fonts everywhere, and don’t need to remember ways of inputting symbols missing on my keyboard (even though you are of course right that proper UTF8 handling is still necessary).

I would be even happier if I could use something like “const mu_symb = parsed_symbol(‘u00b5’)” (because mu in the symbol table appears to end up being \u03bc, even if the source file uses \u00b5 - in other words, the parser tries to normalize different “unicode spellings” of mu, while the symbol constructor does not, afaik).

Infix xor (or dot-syntax for accessing badly-named fields) are only non-essential syntactic sugar; however, the discrepancy in the compiled code is rather brutal, hence my question of how to do better.

For badly-named methods, I know no pure ascii way at all; however, aliases for methods can be simply defined in a single separate source-file that quarantines problematic names.