Syntax: Escape hatch for unicode haters

stevengj · January 12, 2024, 4:20pm

Note that there are currently over 3600 tab completions for Unicode symbols in the REPL, and similarly in editor plugins that borrow the REPL completions list. Since they are not part of the language, just a UI feature, this list can be freely edited over time. However, I think there’s no way that Julia will ever add so many new keywords to the parser itself, which would have to be supported indefinitely. Of course, you could prune down the list to the most commonly used symbols, but that would clearly be a contentious choice.

And as for your other suggestion of allowing \\u03BC for μ, I don’t understand how that would be easier than copy-pasting while editing someone else’s code, unless you’ve memorized the Unicode codepoint table.

You can copy-paste without touching the mouse: use the arrow keys to move to the symbol, hold down shift while using the arrow keys to select, then ctrl-c/v (or whatever the copy/paste shortcut is on your system).

But from your own admission this is a need that arises for you once in a blue moon — when you are replying to a discourse post that uses a symbol like μ and you need to edit their code in ways that introduces new uses of μ. You are asking for a whole new feature of the Julia parser so that you can avoid copy-pasting on discourse every once in a while (a few times a month at most?), in order that you can reply with an easier-to-write but more-difficult-to-read response that uses \\mu (or worse, \\u03BC).

It doesn’t seem like the solution here (a contentious new parser feature) is proportionate to the problem (a minor inconvenience on rare occasions when editing other people’s code on platforms that don’t support shortcuts).

sijo · January 12, 2024, 4:21pm

To me that’s wild! I mean the kerning may be worse than latex output but the symbols are still beautiful compared to latex code. Badly kerned

∀x ∈ ℝ⁺, ∃n ∈ ℕ : n ≤ x < n+1

is still more readable than

\forall x \in\mathbb{R}^+, \exists n\in\mathbb{N} : n \leq x < n+1

I would say even for latex users, but many of my correspondents don’t even know latex.

I would refactor that email ASAP and submit a PR

mbauman · January 12, 2024, 4:35pm

There’s a fundamental issue with the \\pi = 3.14 unicode-alternate syntax idea that ~~has not been discussed yet~~ Steven mentions as I compose this— and that’s the naming of the alternates themselves. There only exist two languages with a similar feature — Javascript and C — and both only allow codepoint identification with codepoint escapes like \u. And while I still think that is itself a misfeature, it doesn’t have half the difficulties that a named unicode synonym would have.

Let’s say we did use the current list of tab-completion identifiers as it stands. How would you write μm? It’s not \\mum. What about \\taurus? Is that τrus or ♉? So at a minimum this would require yet another syntactic element to delineate the end of a token. But the exact mapping of names to symbols is also fraught. HTML and LaTeX disagree in their entity names at times. And there can be and are multiple synonyms in our tab completions. Surely it’d be even more confusing to be able to write and refer to → as any of \\to, \\rightarrow, →, or \\u2192. It’s not a big deal as a tab completion feature — you do see what you get! — but it is a big deal when it becomes part of a source code file.

Naming is such a fundamental part of a programming language. In fact, it’s pretty much all a programming language does for us. Changing how names work isn’t a small thing.

DNF · January 12, 2024, 5:28pm

I made a brief aside as an example of why better tooling around unicode should be widespread, and to explain why I find the aggressively dismissive attitude to “weird foreign characters” annoying.

You decided to run with it.

goerz · January 12, 2024, 5:30pm

This doesn’t seem like something that should be part of Julia. We already have tooling that solves that problem: git hooks. You can set up a post-checkout hook that converts all unicode to some ascii-equivalent when you check out a repository and a pre-commit hook that converts it back when you commit. In the early days, this was commonly used to normalize Windows/Unix line-endings, before they built that into git as a feature.

fatteneder · January 12, 2024, 5:39pm

I propose \mu<TAB>m

Ininterrompue · January 12, 2024, 6:09pm

The second proposal I’d fully support. The amount of times I’ve seen, say, for i ∈ array in non-mathematical contexts over the years is pretty annoying, not only because it provides zero benefits for readability, but also in large part because looking from a new user perspective, many may not understand what ∈ even is and if it’s different from the keyword in.

It really is a social/cultural problem at the end of the day - for example, programmers writing in a language without convenient Unicode support would happily write um for micrometers and not have to bother with any of these questions.

I recall a post by ninjaaron years back which I didn’t think of much at the time but perhaps has some meaning now. The preponderance of mathematical symbols in a community presents a barrier to understanding for someone from a non-scientific background.

adienes · January 12, 2024, 6:14pm

and code will be all the prettier for it!

DNF · January 12, 2024, 6:46pm

What does ‘synonym’ mean in this case? Is it a way to simplify input? What would be displayed, is that user configurable?

\le<TAB> doesn’t really have much, if any, advantage over <=. There should be a significant improvement in readability before unicode symbols are used.

jar1 · January 12, 2024, 8:07pm

Fwiw the Julia syntax I don’t like is multi-character infix operators like <= and in, because they are so strange.

in looks like a variable name, why is it in the middle?

Surely x <= y means x = x < y following the standard OP= pattern?

By my lights, using these is “gratuitous”; Unicode names are normal and simple.

Jeff_Emanuel · January 12, 2024, 9:46pm

I checked more than twenty more-or-less common languages and all of them use <= for less than or equal to.

R, Python, Go, C, Java, JavaScript, C#, Haskell, Lua, Perl, Ruby, Tcl (I thought it is so odd that it might be an outlier), Rust, Visual Basic, Pascal, Swift, PHP, SQL, Lisp (not infix), Elixir.

Even Fortran now allows <= as an alternative to .LE., as does “the gross verbose.”

I finally found an outlier: Erlang. It uses `=<’

jar1 · January 12, 2024, 10:04pm

Several of those languages are older than Unicode; most couldn’t count on it being available. Only some of them have the OP= pattern that creates the inconsistency with += -= *= etc.

What I value much more than ascii in programming is simple composable rules. For syntax, that means creating visual patterns that correspond with semantic patterns. See Ken Iverson’s 1979 Turing Award lecture Notation as a tool of thought.

Jeff_Emanuel · January 12, 2024, 10:05pm

C and all of its derivatives have the OP= pattern, and all of them use <= for less than or equal to.

See languages listed at Augmented assignment - Wikipedia. Most of them are on my list above.

DNF · January 13, 2024, 8:12am

I do like this argument, but <= meaning less or equal is so widespread and entrenched that it would be just too confusing to use it for anything else.

greatpet · January 13, 2024, 12:55pm

The longest operator is isa, and it doesn’t have a shorter Unicode version, if I’m not mistaken. The two-character pipe operator |> has no shortcut, either.

Benny · January 13, 2024, 1:00pm

Pardon the anecdote, but as a young child, a science teacher attempted to teach me how to calculate the final temperature of mixing two liquids of different amounts, initial temperatures, and heat capacities, using his algorithm that incrementally adjusts the temperatures until they almost match (think while tol > 0.1). I heeded his assumptions that energy is conserved and only transferred as heat between the liquids, and used elementary algebra to derive a one-step formula that computes the same answer to which his loop-based algorithm converges. Despite clearly writing down this derivation and correctly answering every problem, I was awarded a 0 because in his words, “I don’t understand this, science is not math.”

Putting public school funding and teaching standards aside, I see the same anti-mathematical argument being made here against “Unicode”, or more accurately the subset excluding ASCII in the Basic Latin block. Programming is heavily based in mathematics, even if people would rather not recognize it, and it often involves mathematics directly. It is not perverse to use symbols from modern mathematics, itself a written lingua franca, when you program mathematics. The proposals aren’t truly excluding mathematics or Unicode, what they really do is an ASCII-only romanization (the correct linguistic term for the aforementioned “latinize”).

Despite previous arguments, romanization is not at all an assimilation into a lingua franca, nor are its purposes solely for people who only recognize the Latin alphabet. Romanization is also not a sufficient replacement of the written languages, in other words meaning and details are lost in transliteration. Romanization is also not as straightforward as portrayed. Transliteration can be done in many different ways, and standardization requires effort, consensus, and learning. Should we force people to learn yet another standard with the sole promise of allowing certain people to pretend they are not writing mathematics, Unicode, a foreign language, or whatever makes them uncomfortable? My vote is no.

Mason · January 13, 2024, 1:10pm

Indeed one does not have to spend long on places like Hacker News or whatever to find the classic programmer take that modern mathematics needs to be ‘reformatted’ and rewritten without any more special symbols and no single-letter names.

The anti-intellectual belief here is rooted in the idea that “I’m smart, but this topic confuses me, therefore there’s a problem with the topic, not that I actually need to put in work and engage with / learn the topic to understand it”. So they decide that the only reason they don’t understand modern advanced mathematics or physics or whatever is that the notation is bad, and if the notation was more like the way they typically write code, then everything would be clear as day.

Ininterrompue · January 13, 2024, 7:18pm

I don’t recall making any sort of radical proposal like this.

Mathematics has a notation honed from centuries of tradition and practical use, built specifically for mathematics. Programming has had a similar trajectory for its use case, with symbols used specifically for programming. It just so happens that programming has borrowed certain symbols from mathematics like = and < for their commonality (assignment and comparison, respectively) and convenience of being easily recognizable and typeable from a keyboard. (A symbol existing/not existing on a keyboard only reinforces its commonality/uncommonality.) Other notation is taken directly from English, including keywords. The history of programming and decades of experience has shown that English verbosity taken to an extreme like COBOL is undesirable, but is reasonable in moderation like Python.

It is not an original idea to espouse that an expression like for i ∈ array presents a barrier to understanding for programmers who are familiar with the idea of a for loop but less familiar with the “element of” symbol taken from mathematics, which they rarely encounter in code from other programming languages. Even for those who know what the symbol is, they may again ask, “is ∈ the ‘same’ as in, or are they subtly ‘different?’” for two reasons, one being that there already exist symbols like = having different meanings between math and programming, and the other being that it is not obvious if ∈ in Julia is being used in a mathematically rigorous way.

Perhaps upon examining code out in the wild they may get the idea that ∈ should be used in “idiomatic” Julia code. Perhaps they may have a negative first impression and be discouraged with the barrier to entry, if Unicode uncommonly used elsewhere is used in such a mundane context. I honestly don’t think these two scenarios are that far-fetched.

Mathematics may be a universal language, but it is not a language universally understood. We should be cognizant of the fact that not all programmers have had the opportunity to take higher level mathematics courses where they could become familiar enough with such notation to treat it as second-hand, nor do they necessarily have the time or motivation to self-learn. If there are enough people out there that have decided that they can’t be asked to put in extra work to learn the Unicode symbols encountered in Julia code, it should be acknowledged and there should be ways to accommodate this so that it doesn’t happen as often, instead of just telling people to get good or use the opportunity as a “learning experience.” I think that the cleaner solution is not through escape hatches and introducing a new syntax, but simply to curtail the excesses of Unicode usage so that people coming from other programming languages aren’t intimidated. I’m personally not intimidated, but new users who are less enthusiastic about the language might be.

Moreover, I think the size of this “language” barrier is being underestimated here. The notation and symbols of mathematics are more akin to a logogramic language like Chinese then an alphabetic language like French. Chinese characters are pronunciation-agnostic, in the sense that the character gives a poor indication of its pronunciation and meaning. You can sometimes guess based on the radical, some radicals being more reliable than others, but even this has many exceptions and the only real way to learn the characters and achieve fluency of reading is to memorize them. You either know it or you don’t. Obviously you can look it up, but every time someone does this they get more exasperated having to do extra work, and productivity inevitably slows. By contrast, a learner of French may not understand the meaning of a specific unknown word, but after a year of courses, is able to at least pronounce the word and perhaps even get some idea of its general meaning through context clues. As others have alluded to, mathematical symbols to the uninitiated are unable to be pronounced. Like ∀. Perhaps except in a silly way like “V with a line” or “upside-down A.”

And, of course, as with any language, one can only achieve fluency through repeated exposure to all kinds of content via reading/writing/listening/speaking. A programmer who has infrequent exposure to mathematical notation may find it more difficult to maintain their level of understanding. And he’ll find it more difficult to transfer this skill to other languages because they don’t use Unicode this liberally in the first place.

I suspect the general programming wisdom of descriptive, informative variable names comes in part from this idea from language learning. Perhaps the reader may not yet understand how a function is used, but if its name is informative enough, he’ll at least get an idea of its general meaning. Since most code out there is not explicitly mathematical in nature, this wisdom ended up winning out. The fact that there is some connection to mathematics is, practically speaking, a moot point for the vast majority who have been taught these two as separate fields, and that part of the style guide is a testament to that. (On the other hand, the idea that single letter variable names are reasonable in code closely representing mathematical formulas is due in large part to the information that the code is conveying being explicitly mathematical in nature, and therefore the style of compact, elegant notation from mathematics holds.)

goerz · January 13, 2024, 8:05pm

That’s true, but the target audience for many Julia packages is people with university degrees in a technical field. If you’re writing a package for, say, number theory, or in my case, quantum control, any user or contributor is going to be very well-versed in mathematical notation. In fact, they might be more comfortable with math than with general programming. In that context, unicode will make the code more accessible.

On the other hand, if the target audience for a Julia package is not someone with an advanced degree, they might want to avoid math-y syntax. As has been pointed out before in this thread, the judicious use of unicode is all about communicating with your target audience in the way that is most natural for them.

So I’d probably write code like all([isdir(target) for target in deploy_targets]) in one context, but amplitudes = [abs2(vₙ) for vₙ ϵ V] in another. Both in and ϵ have their justification.

bertschi · January 13, 2024, 8:31pm

As long as we can’t do this, I don’t see any problem with a few Unicode symbols used here and there … might prevent adoption of Julia in the Vatican though.

Topic		Replies	Views
Unicode: a bad idea, in general General Usage	83	4327	June 17, 2023
Warning against Unicode confusables Internals & Design unicode	51	2095	January 13, 2024
Naming: Remove all underscores to matter what? General Usage	123	7235	January 28, 2018
Keeping the syntax and the need to memorise syntax simple Internals & Design	100	7601	September 7, 2022
Source files with non-ascii names: how? General Usage	6	1452	November 4, 2017

Syntax: Escape hatch for unicode haters

Related topics