Syntax: Escape hatch for unicode haters

foobar_lv2 · January 4, 2024, 11:55pm

Some people have a perennial problem with unicode in source files, and would prefer to simply never deal with that.

We have two interesting precedents inside the julia language to deal with that:

String literals permit unicode escapes, like "\u00b5 is a greek letter". In other words, unicode haters never need to fight their text editor when writing string literals, they can simply write the corresponding unicode escape sequence and be happy the parser fixes that (identical syntax tree!).
Many symbols and operators have unicode aliases. For example \u00b5 and \u03bc are two different unicode greek mu letters. They are really different in strings! But the parser maps both of them to the same symbol. If you use different unicode spellings of mu in symbol position, the parser will normalize them. For example

julia> Symbol("\u00b5")
:µ
julia> :µ == Symbol("\u00b5") #the left hand side was copy-pasted from the REPL output one line above
false
>using JuliaSyntax
julia> parsestmt(Expr, "ab\u03bc123") == parsestmt(Expr, "ab\u00b5123")
true

I think this is very pragmatic.

Now I’d also like to get a non-unicode escape hatch for symbol and operator positions, just as we have for string literals, even at the price of the same kind of inconsistencies we have with unicode aliases / normalization.

In other words, my input keyboard should not constrain the set of ASTs I can generate.

The way of doing that would be to find some gap in the existing julia syntax, i.e. something that is currently invalid, and then use it for this.

The first, simplest goal should be completeness. This means unicode escaping. Goal is that for every string that produces a result with parsestmt(Expr, str), there must exist an ascii string which produces an identical AST (on the expression level).

The second goal that can come afterwards is to introduce “nice” aliases for us unicode haters.

Ok, so let’s identify a gap. I think that \\u03bc is a gap? I.e. double backslash followed by lower-case u is probably never valid syntax (outside of string literals)?

Then we have our inconvenient escape hatch. I need to use some library that has a keyword-only parameter or function that contains a greek letter and can’t get the autocomplete / IDE to work with that? No problem, I just call someAnnoyingFunction; \\u03bc=1.2).

The second step – convenient shortcuts – should probably use the same latex-style tab-completions that the REPL already has. So I can write a \\xor b or a \\u22bb b instead of xor(a,b) or a ⊻ b.

Since the normalization happens early (during lexing / tokenization) there is no need to spend any thought on things like operator arity or precedence.

The only special thought to give is to not repeat the mistakes of java source file unicode escapes: We must make sure that e.g. \\u0023 is invalid instead of starting a comment (same with newlines, quotes, etc).

The java trouble is that unicode escapes in source files are resolved before the parser runs, and can therefore do things like terminate comment blocks. Most IDE parsers and syntax highlighters don’t see this, but javac does. Makes fun in generated code, and makes very fun obfuscated code contest entries.

The general question woulds be:

What do you think about the general goal (have an escape hatch)?
What are the complexity costs in terms of implementation?
What are the complexity costs for the general ecosystem? If something is valid syntax, like emoji variable names, then you will encounter it in places like here. Adding a \\xor b is extra syntax you will need to mentally parse.

More technically:

Is the identified gap in the syntax really a gap or do we need to trawl deeper?
Does the identified gap make for nice syntax?

PS. Its not so different to latex. Modern latex supports UTF8 input. Many people still use write e.g. "Danke sch\"on" in German latex, basically because latex is by construction very US keyboard centric (typing latex on German layout will damage your wrists, backslash is a very painful hand movement on German layout; most people type latex on US layout and therefore prefer escapes. Maybe polish or czech users could chime in? You also have lots of non-ascii glyphs in everyday texts, how do you input them in latex?).

stevengj · January 5, 2024, 12:17am

You can do something now with a string macro, no parser changes needed:

macro uvar_str(s::AbstractString)
   sym = Meta.parse(unescape_string(s))
   sym isa Symbol || throw(ArgumentError("expecting a single symbol"))
   return esc(sym)
end

Which lets you do e.g.:

julia> uvar"\ub5" = 3
3

julia> µ
3

julia> uvar"\u2208"('b', "foobar") # 'b' ∈ "foobar"
true

If you want to be a friendlier you could support REPL tab completions too (by calling the REPLCompletions module), or implement even more LaTeX syntax.

It would be an easy thing to put into a module, e.g. UnicodeHaters.jl or ASCIIJulia.jl for people who want this kind of thing.

PS. I think it’s generally good style for APIs to provide ASCII synonyms, but there are exceptions, e.g. the Gridap.jl package uses Unicode so extensively in its API that it’s hard to imagine it without it.

foobar_lv2 · January 5, 2024, 12:24am

julia> f(; μ = 1) = μ
f (generic function with 1 method)
julia> f(uvar"\ub5"=2)
ERROR: syntax: invalid keyword argument name "Main.μ" around REPL[72]:1

Also this does not work for infix / operator position.

stevengj · January 5, 2024, 12:27am

This is fixed by changing my function to return esc(sym), sorry.

If you’re writing out symbols longhand this way, it’s not clear to me that you’d want to use them infix — it seems much more readable to do use ordinary function-call syntax, and this is always allowed with any infix operator.

(Moreover, trying to introduce a new syntax for infix operators is a fraught subject, and runs perilously close to julia#16985: Custom infix operators.)

foobar_lv2 · January 5, 2024, 11:09am

nice, thank you!

But my more general point is really that there are two valid preferences: Some people want their editor/IDE smart and fancy, and their compiler/interpreter dumb; other people want a smart compiler/interpreter and a dumb editor.

Maybe I’m too old and grumpy, but I prefer a simple terminal+keyboard compatible mono-spaced experience with multi-letter names for coding, and beautiful set text with a staggering variety of accented glyphs for maths papers (latex being the bridge in between). I don’t begrudge physicists their wysiwyg latex editors, nor non-computer people their MS word, but that’s not for me. Two completely separate magisteria with completely separate aesthetics and requirements, which intersect in non-english language everyday letters, with the awesome compromise of UTF8 support plus appropriate ascii sequences (which are handled by the compiler, not the editor. An IDE for latex is optional; if the ascii sequences did not exist I’d need a hex-editor or an IDE to e.g. typeset greek letters, that would suck).

People who use a smart editor would never need to see this new syntax, their editor could simply display the unicode equivalent. The only affected people are ones who use non-julia-aware tools.

Languages with mandatory IDEs that store source-code in custom binary formats have been tried. It is almost definition that source code is stored in human readable text files that contain no checksums and can be handled with old-style dumb editors, no fancy tooling required.

I’d like to remind of the mess that is APL. Maybe it would have gotten a better reception if APL had included an ascii transliteration scheme from the very beginning, making non-standard keyboards, fonts and tooling optional, and permitting outsiders to have a chance of reading the code without needing to learn a different alphabet. For most people, APL sources are functionally not “source code”, they are a weird custom binary data format.

E.g. it is easy for me to pattern-match and recognize transliterated names of Chinese authors; impossible when written in proper unicode, that looks like mojibake to me. The scientific community has adapted to this – in most contexts both latinized / transliterated and “correct” names are acceptable, no slight intended or taken on either side, and ideally both forms are offered.

Yeah, going overboard with crazy infix operator salad sucks. I see it a lot in scala-land, this hurts readability a lot. I don’t really want high-level syntax, all my desires are on the lexer/tokenizer level.

There are some cases where the infix really improves readability, especially due to the reduced number of parentheses, as opposed to lisp.

Field access, indexing, plus, minus, multiplication, division, xor. Set operations. Setminus (&~ in scala, which is pure genius since & is set intersection and ~ is negation). Union types, as discussed in the other thread.

I think there is no strong reason for infix operators to be always single-letter – we look at A \cup B and see A \cup B, one form is ideal for monospaced fonts on a computer / terminal, the other is ideal for chalk on blackboard, physical paper, pdf/ps/dvi, and non-monospaced contexts like the graphical browser you’re presumably using to read this.

Tamas_Papp · January 5, 2024, 12:07pm

Given that Julia allows unicode and supports it integration into the language, these people may just find that it is not a good match for their tastes.

Not dealing with unicode in Julia is not a realistic goal, unless you are working alone or in a group that has this policy. It is so pervasive that if you ever have to read and/or modify someone else’s code (eg make a PR), chances are that it will contain Unicode.

I think that this is an artificial constraint. Even if you stick to plain vanilla ASCII, your keyboard does not have keys for all the characters but uses key combinations to enter them (eg Shift for capital letters and symbols above the top number line, etc). These key combinations are mapped to characters by software (the OS, the window manager, etc). Your editor is part of this software stack and can help you input unicode.

Personally I think it is a waste of a perfectly good syntax gap that Julia could use for something much better in the future.

Dan · January 5, 2024, 12:46pm

Based on the previous macro, this version might have more compassion towards source-code reader:

using REPL
macro lvar_str(s::AbstractString)
    if !haskey(REPL.REPLCompletions.latex_symbols, "\\"*s) 
        throw(ArgumentError("expecting relevant latex"))
    end
    new_s = REPL.REPLCompletions.latex_symbols["\\"*s]
    sym = Meta.parse(unescape_string(new_s))
    sym isa Symbol || throw(ArgumentError("expecting a single symbol"))
    return esc(sym)
end

julia> lvar"mu"
3

julia> lvar"pi"
π = 3.1415926535897...

foobar_lv2 · January 5, 2024, 1:06pm

What do you think about latex?

Latex is a very fine example of a language that has excellent support for both unicode and ascii for source files. (ok, awesome typesetting system, terrible programming language)

Supports unicode means in this context: If your native language is not US-English, then you can use your native keyboard layout, typing naturally, to input characters or accents that are not readily available on US keyboard layouts. Bibtex demonstrates that this kind of unicode support is really helpful!

Supporting ascii means: If you prefer to type on US layout for any reason, then non-ascii characters are still easy to use. You can input foreign accents or letters with easy to visually parse and remember ascii shorthands, without any editor or tooling support (the compiler does that job). Crucially, “what you see is what you type”, not “what you see is what you get”, which is the much better trade-off for me.

Supporting ascii is not just US centric – many Germans prefer US layout for latex (and shell and all programming), treating their native language like a foreign one, even at the price of needing ascii shorthands for our native Umlaute, due to the fact that backslash is hard to reach on German keyboard layouts. I would imagine that this applies to many non-US languages / keyboard layouts.

Programming languages should really have at least source-file unicode support for string literals and comments. Julia chose to also support unicode for operators and identifiers. I don’t like it, but meh, ship has sailed. The remaining problem is that it is suboptimal in supporting naive-plaintext tooling.

I’m not saying that julia is unusable for haters of unicode identifiers+operators in source files; it is an annoyance, but all (language, person) pairings require painful compromises.

It is. The vast majority of julia projects go extremely light on unicode, imo for very good reasons.

This is the direction of moving away from plain-text human-readable human-writeable source files – moving smarts out of the compiler and into the IDE.

Not all of your software stack is always the same. Does your git mergetool stuff support julia unicode repl completions? Can you use repl completions to search for symbols in the repl when in @less mode?

Forgoing plaintext sometimes makes sense, e.g. jupyter notebooks are quite useful, even if they are not quite human-readable plain-text (you really want jupyter aware tooling for editing, diffing, source-control, etc. You want a word processor, not a text editor to handle odf files, even though it’s “just pkzipped XML”).

But I think this is entirely unforced for julia? A standard transliteration between unicode and ascii inside the parser is not that expensive to maintain and would make many of these problems go away.

And for people like you who are willing to rely on julia-aware software stacks for code editing, this can be entirely invisible and zero-cost (your IDE can render \cup as \cup during display).

Tamas_Papp · January 5, 2024, 1:17pm

First, TeX predates unicode by decades, so it is understandable that they did not use it initially. Personally I make the editor (Emacs) hide them, eg if the text contains \alpha I will see \alpha in my editor. I would love to use Unicode symbols directly in LaTeX but in math that still has some corner cases, so I don’t.

I don’t understand what “human-writeable” means in this context (unless, of course, a magnetized needle and a steady hand).

Surely you use an editor? AFAIK all editors people use for programming have support for entering Unicode symbols in Julia code (Emacs, Vim, VSCode does, I am too lazy to look up the others).

Do you have a practical problem to address (eg your editor does not work with Unicode, and you want to edit Julia code with it), or is this discussion a matter of principle (you could use Unicode just fine, but you don’t want to)?

For practical purposes in ~~2023~~ 2024, I think that UTF8-encoded .jl files are plain text.

foobar_lv2 · January 5, 2024, 1:53pm

Interesting! I would absolutely and utterly hate this. But I think both are very valid preferences.

But then I don’t understand your strong opposition to my proposal? It would be invisible to you, emacs could just as easily display \alpha as \alpha in julia mode?

How do you enter \alpha into the text window in discourse? I type $\alpha$ .

I used to have unicode problems in the console/REPL; eventually I fixed my console fonts to properly display unicode, specifically for julia (archlinux/xfce; nowadays it works out-of-the-box).

I mainly use vscode for julia – solid syntax highlighting, ok editor, very mediocre IDE functionality for julia. Remembering \alpha in latex is simple – I see all the occurences of it. Remembering REPL completions is harder because I cannot see them anymore once they are done.

I also use the julia repl, jupyter-notebook with the browser client, less, grep, diff / git diff, @less, and meld as git mergetool. When posting here (or github or slack), I use the web input text field (I don’t always type in vscode and copy-paste into the browser). You see that a large part of these tools is not very julia-centric.

I can grep for a \mu, in the sense that the software formally supports that, but I will suffer trying to type that letter into my terminal. Same for github search (how do you do that?).

All these boil down to constant annoyances. I think some of them could be partially alleviated at relatively low cost. My proposal would fix the issue of “hate input substitutions”, as well as usability in git mergetool and quick editing (e.g. via ssh for people who are not julia-mode emacs/vi people), but would not fix the issue with grep / less, unless your project consistently uses escapes (which can be invisible in e.g. emacs, and probably could be done as a form of automatic code formatting).

I have no idea about accessibility concerns for this; e.g. people who need to use text-to-speech for coding or who need screen-readers. I hope these people have already been heard on this issue and they don’t need escape sequences for unicode (otherwise something like this should have happened years ago).

Tamas_Papp · January 5, 2024, 2:02pm

I have a shortcut set up in Emacs that opens a temporary buffer with julia-mode enabled w/ LaTeX entry. I enter what I want there and copy.

That said, I rarely ever search for Unicode on Github, since it is mostly operators and would give me too many hits. For local search I use a backend from Emacs, and the prompt allows me Unicode entry.

If I understand correctly, this would break grep & friends (please correct me if I misunderstand), hence my opposition. It would also complicate the tooling in the editors.

Are you aware of

help?> α
"α" can be typed by \alpha<tab>

?

foobar_lv2 · January 5, 2024, 2:26pm

I was not aware, nice! But this is only of limited help since it requires me to input $\alpha$ into the repl in the first place.

But interesting tooling you have, sounds cool. No browser - emacs integraion like Edit with Emacs – Get this Extension for 🦊 Firefox (en-US) ?

Would you agree to the statement that you personally require extensive tooling support by your operating system (emacs, clearly ) to deal with characters that are not native to your keyboard, specifically for julia?

Would you agree to the statement that it would be nice if the required tooling and learning curve could be reduced? Ie if eg new julians would not be expected to first adopt the emacs operating system?

You do not misunderstand. Same as today for e.g. xor, there would be multiple different valid plaintext spellings in source files, and unless you use a julia-aware grep tool or the relevant project enforces a uniform coding style, you’d need to search multiple spellings.

mbauman · January 5, 2024, 2:29pm

Does any programming language do this?

foobar_lv2 · January 5, 2024, 2:36pm

Latex. You can write sch\"on or schön.

Html nowadays permits the utf8 ö instead of the oldstyle ö.

Afaiu there exist APL dialects with ascii transliteration built-in (built-in to the parser, not just your code editor).

Java supports this, but in a really bad way (source-code unicode escapes are done before parsing, i.e. they can break context if you e.g. insert a unicode-escaped newline into a source-code comment).

Otherwise, the set of programming languages that permit full unicode variable or operator space is pretty limited, as far as I know.

Tamas_Papp · January 5, 2024, 2:38pm

Your IDE should have a solution for that (in eg in julia-repl [Emacs] it is C-u C-c C-c in general, or simply C-c C-c if you typed the ? before). Maybe someone knows this for VS Code.

I do not find it extensive. Programming requires a lot of tooling, for sure, this is a tiny, tiny part of it that I just set up ages ago and it keeps working.

You must be misunderstanding something, Emacs is not required to use Julia (as you are surely aware). It is just one possible IDE which I am familiar with it.

Technically, it is not a different “spelling” as they are not equal at the parser level. It is an alias defined with const, technically not different from

const apples = oranges

Yes, these make life a bit more difficult, but that’s a good argument against introducing more of them.

Henrique_Becker · January 5, 2024, 2:42pm

HTML is not a programming language. LaTex barely scrapes by as a programming language.

I think a “better” example is that C support trigraphs, which weren’t even for Unicode, they were the below:

  ??=         #
  ??(         [
  ??/         \
  ??)         ]
  ??'         ^
  ??<         {
  ??!         |
  ??>         }
  ??-         ~

This is, in any part of the code you could type ??= you got #.

This is actually scheduled to be removed before C23: C Programming/C trigraph - Wikibooks, open books for an open world

So not a great success case of this type of approach.

mbauman · January 5, 2024, 2:49pm

TeX — while Turing complete — isn’t really a programming language. It’s a typesetting engine. The analogy with TeX would only be relevant if it applied to tex \commands themselves, which I’ve only ever seen as ASCII.

It does look like C99 supports something like this, which is wild.^[1] It’s not what you want here, though: it does so with raw \UXXXX code points, not latex names.

foobar_lv2 · January 5, 2024, 2:49pm

Yes, but as you yourself pointed out your julia experience is significantly enhanced by integrating emacs into search/grep, mergetool, and (almost) into yoiur web-browser, making all of them julia-aware.

That’s why I joked that emacs is your operating system – you basically said “text input should be handled by emacs, that can be taught to be julia aware”. Good for you!

This is a nice setup. I admire it, and I mourn that operating system design has not moved towards something like that, i.e. towards unified customizable text input / editing. Instead qt, gtk, firefox, chrome, java-swing, win32, etc all ship their own stuff. This sucks for accessibility, and it sucks for julia.

Most people don’t customize their system/workflow to that level, unless they absolutely must due to physical disabilities. They take whatever text input box they get from their application/context. And then you have a very haphazard combination of text inputs, not all of which are julia-aware and most of which behave slightly differently (xterm+bash vs non-X-tty vs firefox textinput vs chromium textinput vs vi vs emacs vs less vs meld …).

sijo · January 5, 2024, 2:50pm

@foobar_lv2 just being curious: do you actually use an editor that doesn’t support \alpha<TAB> when writing Julia code?

Regarding the use of ?α to find the alias \alpha, in practice people use this when they see the symbol on screen and use copy-paste to find the alias.

Now to address your main points: I think the simplest goal (unicode escaping) is already achieved in practice by the operating system? For example in Linux you can type Ctrl-Shift-u 3b1 to get the α character (codepoing U+03B1). This works in the browser, in the terminal (including in less search for example), etc.

The second goal is trickier, but as I understand your proposal doesn’t adress it completely either: even if the language improves support for unicode aliases, this won’t help you find α when you’re using less or other tools…

Tamas_Papp · January 5, 2024, 2:59pm

But (again), other IDE’s support LaTeX entry just fine. I think VS Code does, too, but I have no personal experience with it.

As for accessibility, it would be good to hear from a person who has a disability that requires accessibility support and has problems with Unicode in Julia, and see what the specific problem is and how the community can help (but please, let’s open another discussion for that).

Topic		Replies	Views
Unicode: a bad idea, in general General Usage	83	4455	June 17, 2023
Warning against Unicode confusables Internals & Design unicode	51	2172	January 13, 2024
Naming: Remove all underscores to matter what? General Usage	123	7426	January 28, 2018
Keeping the syntax and the need to memorise syntax simple Internals & Design	100	7725	September 7, 2022
Source files with non-ascii names: how? General Usage	6	1462	November 4, 2017

Syntax: Escape hatch for unicode haters

Related topics