Syntax: Escape hatch for unicode haters

DNF · January 12, 2024, 11:50am

I’m reeling from the sheer amount of Anglo-centrism here. What’s standard and convenient is of course defined purely from the standpoint of the English alphabet.

There’s no transliteration happening in the system, if it did I wouldn’t mind. I am simply not allowed to enter my name correctly, or have to enter my name differently in different places, which is also stressful.

If you want some re-coding/normalization internally, go ahead, but let me enter my actual name, not some garbled version.

foobar_lv2 · January 12, 2024, 12:55pm

That would be great! But there appears to be no practical consensus on that. For example, stdlib contains a gem like

function consumed!(buffer::Buffer, n::Integer)
    @assert n ≤ buffer.size
    buffer.offset += n
    buffer.size -= n
end

This use of unicode is completely gratuitous. n <= buffer.size would do the same job. Base64 encoding is not scientific code.

If there were no non-scientific uses of unicode identifiers/operators in Base+stdlib, then I’d be much happier! (would require infix xor, though)

I deliberately brought up a diverse list of languages / charsets, in the hope that at least one of them is non-standard to you, personally. So that you could emphasize with the poor hypothetical baggage handler standing in the rain with two printed soaked passenger lists they have to cross out, because the airport was hit by yet another ransomware attack, three months after management laid off half the sysadmins.

You yourself brought up the example of air traffic.

Do you know that air traffic control is all in English? In fact, its not really English, its a highly specified technical dialect of English?

It is absolutely necessary and without alternative to have some way of vocal communication between pilots and ATC in different countries. This is not Anglo-centrism.

Historical developments lead to this language ending up as a English-based technical dialect. This is where you can complain about Anglo-centrism.

A lingua franca is necessary. Used to be Latin, some (ecclesiastical) specific domains still use Latin, afaiu some specific domains use a lot of French, most of the world uses “broken English”.

Go on and lobby that all worldwide ATC should be done in a different language, but don’t come up with stupid ideas like “use local language”.

Go on and lobby that scientific papers should again be written in Latin, instead of English, because 1. fuck anglo-centrism, and 2. moving technical communications from Latin to local vernacular was a giant historical mistake fueled by reformation and nationalism.

You would be in good company, e.g. Gauss preferred to write his papers in Latin instead of his native German.

I would say: We have a lingua franca again, thank god for Anglo-centrism, and I prefer my technical communications in English as opposed to my native German.

It is so cool that I can read French or Japanese authors without pain because humanity managed to standardize again!

(untranslated Russian papers from Soviet times are a pain, though. Most important non-English language in my parts of maths, and a big failure on my part that I never got around to learning)

It is important to the system that you can recognize and answer your transliterated name: E.g. there may come a moment where you are called over loudspeakers by an announcer who has never in their life seen any of the glyphs making up your name, and who quite possibly cannot create or distinguish the sounds that make up your name. In that moment, the announcer needs a pronounciation into some garbled combination of syllables that you are able to recognize as “oh, they are calling for me to go to check XYZ”.

Making everybody learn pronounciation of all languages scales quadratically; making everybody learn some snippets of a lingua franca, currently broken English, scales linearly.

Tamas_Papp · January 12, 2024, 1:08pm

I think this is a red herring.

Sure, technically a very wide range of characters can be used in Julia code, but practically it is usually math operators, Greek letters, and generally symbols that are used in mathematics. Given Julia’s target audience, those are not so foreign to most Julia users.

Also, even of people used all sorts of Unicode letters, I imagine that would be self-limiting. Yes, I can write a package which exports the generic function 🍕, and may even register it. But I imagine that no one in their right mind would use it (other than as a joke).

I don’t understand what this has to do with coding in Julia. Please try to keep the discussion specific to Julia code; this forum is not the right place to discuss Unicode in general.

Generally Julia syntax only has a very few infix operators composed of letters, eg in and isa, and I can’t recall any others. IMO the chances of adding extra ones is slim.

stevengj · January 12, 2024, 1:59pm

5 posts were split to a new topic: Efficiency of parsing ASCII vs. Unicode

foobar_lv2 · January 12, 2024, 1:43pm

Context was upthread @DNF complaining about airlines refusing to accept his non-English name for bookings. I pointed out that the issue is human problems, not (just) unicode handling by (very) legacy airline computer systems.

Yes.

But something that is quite foreign to many users is ≤ as a mono-spaced operator. It is not part of a standard keyboard layout. It is common in hand-writing, or printed paper or in pdf or maybe html if compiled from some other language, but in mono-spaced contexts the standard spelling / transliteration is <= (or \le if you’re writing latex of ≤ in html).

Julia is an extreme outlier among programming languages in using that (thank god <= is also valid! I would have hoped that all sane people use <=, but Base/stdlib make gratuitous use of ≤).

stevengj · January 12, 2024, 1:44pm

To be clear, that’s not a Unicode-only API (because there is an ASCII syntax xor(a,b)), so you’re not forced to use Unicode. You’re now raising a different complaint — that programmers using Unicode, thanks to its wider range of symbols, can write more compact and elegant-looking code than ASCII-only programmers.

(And you wanted to fix this by making the parser support Unicode escapes? e.g. a \u22BB b is somehow better than xor(a,b).)

You mean, foreign to type, not foreign to read. ≤ is a symbol that most people learn long before <=. But, of course, they aren’t required to type it, so what’s the problem? (People keep hypothetically mentioning editor support, but so far I haven’t heard of one common programming editor that can’t display ≤ in 2024.)

What I am sympathetic to is new users seeing people post code that uses these characters and thinking that they are required to use it too — getting flashbacks to APL with its special keyboard — or thinking that editors won’t be able to handle it (especially Anglo-centric users not realizing that Unicode support is virtually everywhere these days).

So, there is an education issue of making it clear that the base language and stdlibs functionality is fully accessible with ASCII, even if has become easy and attractive for many people to use a wider range of symbols.

foobar_lv2 · January 12, 2024, 1:55pm

For the specific xor issue, I proposed a \\xor b as syntax that would tokenize the same way as a \\u22BB b and a ⊻ b.

The a \\u22BB b spelling is for completeness, not convenience: To enable automatic tooling / code formatting that transliterates code. Such automatic tooling must preserve ASTs (because macros can operate on that!) and must be complete, i.e. be able to handle all possible valid julia files. Ideally such automatic tooling would also not need to introduce parentheses, as infix-to-function transformation would necessitate.

I think we would need the double-backslash to avoid syntax collisions with the matrix division operator in contexts where u22BB or xor are identifiers.

sijo · January 12, 2024, 1:59pm

That’s probably the core of our disagreement: for me it’s not just a saving grace, it’s the whole reason why I don’t really see the problem. I mean I understand that Unicode can be annoying depending on your tastes, but I see it as a trade-off more than a problem.

I understand you’re giving an analogy where it’s 100 times worse to make a point. But I think most people already agree that Unicode (like many other things) can be annoying. Of course if you take a small annoyance and make it 100 times worse it can become a problem…

foobar_lv2:

That would be great! But there appears to be no practical consensus on that. For example, stdlib contains a gem like
function consumed!(buffer::Buffer, n::Integer)
    @assert n ≤ buffer.size
    buffer.offset += n
    buffer.size -= n
end
This use of unicode is completely gratuitous. n <= buffer.size would do the same job. Base64 encoding is not scientific code.

OK good example: how is this particular piece of Unicode a problem in practice? What’s the concrete scenario here?

I feel it’s a bit like the passionate arguments against 1-based indexing. I’ve heard many people say they won’t use Julia just because of that. They give various technical and subjective justifications and insist that Julia is losing many developers because of this choice. It’s probably true unfortunately. But there are also many people that like 1-based indexing (especially coming from R, Matlab, Fortran, etc.), maybe it’s a good fit for Julia, arguably with more upsides than downsides. And anyway most people don’t care that much because it doesn’t matter much in practice.

Tamas_Papp · January 12, 2024, 2:14pm

It is perfectly valid Julia syntax and has been from the very beginning. Calling it “gratuitous” is just paraphrasing that you don’t like it. Naturally you are entitled to this opinion, but at this point in Julia’s life cycle Unicode symbols are pretty much a take it or leave it feature: one can

use them sparingly when they make sense,
go overboard with them (this is a spectrum, up to and including using emojis in exported APIs),
not use them at all in one’s own code, but accept the fact that others do, with all that this entails (eg if I am making a PR to a library that used Unicode symbols, I need to set up the tooling on my machine).

I guess (3) is the limit of how much you can practically opt out of Unicode in Julia. You may also contribute to style guides to limit (2) [which, again, I don’t think you need to do because in practice it does not happen]. But (1) is here to stay and it is pointless to argue about ≤.

stevengj · January 12, 2024, 2:23pm

Note that “set up the tooling” at minimum involves (a) an editor that can display Unicode characters (i.e. every common editor these days) and (b) the ability to copy-paste the occasional symbol. Who doesn’t have this?

You only really want tab completions or other shortcuts (which are also easy to set up) if you are using them really extensively, and it’s very rare that you would need to do this even if you are editing someone else’s project that uses such characters fairly often — most of their code will still be ASCII.

Mason · January 12, 2024, 2:26pm

The real question though is why wasn’t julia designed with first-class support for punchcards?

Julia code makes gratuitous use of spilling past the 80 character limit, so when I print out julia code onto my punch cards things break all the time! If I want to manually copy someone’s code, I have to insert line breaks with #= and =#, and those not only take up valuable characters, but they also make it harder for me to compare line numbers.

It’d be really great if we could force people to stop writing more than 80 characters in a line, and an escape hatch that lets us avoid using whacky exotic characters like {, }, and !.

And don’t get me started on case sensitivity!

foobar_lv2 · January 12, 2024, 2:40pm

So to qualify: This particular piece of unicode is not a big problem. But it rather demonstrates that many julia developers and projects tend to gratuitously use unicode operators outside of scientific contexts.

Base/stdlib sets a standard. People read that code and learn appropriate julia style from that. This piece of code will ingrain the habit “gratuitous use of unicode operators is acceptable”. New users will learn to use \le<TAB> instead of <=.

Due to documentation/spec verbosity, people are even more dependent on reading julia Base than e.g. java devs on reading SDK sources: Most julia devs code against the implementation of Base APIs. (only mediocre java devs like me code against the SDK implementation, while both bad and excellent devs code against the spec, insert the troglodyte-jedi meme here. Even worse in C! Coding against a spec is so much harder than coding against your specific compiler version!).

This was in response to an upthread claim that unicode was mostly used in julia in specific scientific domains with established maths-based conventions (e.g. \eta<TAB> for learning rate). “Gratuitous” in this context means unforced. Use of \xor<TAB> would not be gratuitous – I still don’t like it, but it is strongly encouraged by lack of infix alternatives in the language.

I actually mostly agree with you on all that! The main thing I’m arguing for here is to make life easier for people in camp (3) by giving them some syntactic sugar: Something ugly-but-complete to enable automatic code transliteration (that needs a \\u22bb b) and something less-ugly like a \\xor b for actual human use.

In some side-conversation I am also arguing that Base is actually going overboard with unicode \le<TAB> \ge<TAB>, \in<TAB> and \notin<TAB> – I’d be happier if @goerz 's claim was actually true that such use was in practice restricted to scientific contexts.

But the debate on what style guide is desireable is not for discourse but rather for the julialang github, and it would need to come with one giant semi auto-generated PR (goal: Minimize unicode, only use if ascii alternatives would make the code significantly harder to read).

For example, I would never suggest a PR for the single above use of \le<TAB> in stdlib/base64 – the overhead of reviewing and commit history pollution is not worth it. But I would strongly applaud a style guide that restricts use of \le<TAB> in Base, in favor of <=, if it came together with a mammoth PR that changes all uses in Base.

stevengj · January 12, 2024, 2:49pm

Why is it a problem if someone else’s code use ≤? Suppose you come across another package that uses ≤. Or someone who uses ≤ in example code for a post on discourse. How does that pose a practical difficulty for you?

The only problem that arises in my mind is if a newcomer doesn’t know that they can also type <=, i.e. if they see that example and think that they must type ≤. That could indeed be off-putting to some people! But it’s also a confusion that can be cleared up in a few seconds. (It’s not like ≤ is a symbol whose meaning is unfamiliar!)

sijo · January 12, 2024, 2:56pm

I think having two versions of an operator (ASCII and Unicode, like xor and ⊻) is already a lot. I hope we don’t make it three or four…

I’m still unsold on the real problem here. Going back to @jkopper’s argument:

You can do that with Julia most of the time. In some cases you might need to copy-paste a character on screen, a small price to pay. The typical thing that wouldn’t work is using this workflow of SSH to a random machine to hack on some mathematical code that’s rich in Unicode characters. I hope we don’t double the ways of writing operators just to support this use case.

To me it still feels that it’s more about a subjective dislike than a practical issue, and the cost of addressing this dislike would be too high.

foobar_lv2 · January 12, 2024, 3:13pm

If I want to engage with that code, I might eventually normalize/refactor it to <=, to avoid eye-bleed on a ≤ b && c <= a.

This is a very minor annoyance, due to the excellent interop support for unicode-lovers and unicode haters with respect to that operator. I still would prefer if the community standard “don’t go overboard with unicode” included such use. Community standards on code style are embodied in Base.

If I want to engage with code that uses μ as an identifier, then I either immediately refactor the code, or I choose not to engage at all.

This is due to the absolutely atrocious interop between unicode-lovers and unicode-haters for general code. I would not be forced into that choice if I could write e.g. μ ≤ b && c <= \\mu in replies to code containing μ ≤ b.

I have a feeling that many unicode-lovers here are making my and @jkopper 's point, with respect to the amount of non-trivial customization you personally needed to make the unicode portions of julia palatable. (amazing custom keyboard layouts or vim integration or emacs integration into the web-browser, or use an IDE to compose replies to discourse posts – the bad smell of APL)

sijo · January 12, 2024, 3:29pm

Yep I occasionally write my discourse answers in vim. I think it’s a minor inconvenience for the nice result of beautiful and readable math symbols in mathematical code.

By the way I sometimes do the same thing when writing emails with math in the text (using the Julia REPL for a quick symbol or a Julia editor for longer text). Thanks Julia for the nice latex-to-unicode converters

foobar_lv2 · January 12, 2024, 3:34pm

I think that would be firm julia 2.0 territory due to compat, and nobody is proposing that?

I think the real proposals are:

something like \\alpha as synonym for \alpha<TAB> (and \alpha<TAB> already has unicode synonyms that the parser normalizes. This proposal would not make grepping harder!)
Maybe try to softly discourage use of \le<TAB> et al, in favor of <= et al? Starting with Base and stdlib.

Both are pretty separate. The second point mostly came up because some people incorrectly believed that unicode use was in practice restricted to scientific code, which is not true (imo unfortunately, but that is my personal preference).

Luckily the kind of unicode that has widespread code in non-scientific code is also the kind that does not need new syntax, and where new users cannot come into the situation of desperately wondering how to input that in their git mergetool or browser window, due to the excellent US-keyboard-only alternatives. (except for infix xor)

stevengj · January 12, 2024, 3:36pm

This makes no sense to me. You called a ≤ b && c <= a “eye-bleed”, but code like μ ≤ b && c <= \\mu is okay?

foobar_lv2 · January 12, 2024, 3:50pm

I tend to simply write \mu, since all people who know maths know latex as well

I sometimes make the mistake of ending a German plaintext email with Mit freundlichen Gr\"u\ss{}en to recipicients who don’t know latex or maths…

This is a viewpoint I cannot understand. Beautiful math to me means latex output, with the nice kerning, interword distances, etc. Editable maths for display to me means latex source code with e.g. \mu. Editable math code for programming means mu instead. The translation between \mu and μ is automatic for latex users.

Sorry for being unclear.

a ≤ b && c <= a is something I would refactor in my projects, because of minor eye-bleed. Good enough for discourse though, hence only a very minor annoyance. If everything became that nice, I would be content.

μ ≤ b && c <= \\mu is more ugly, but much much better than the alternative of having to do a refactor for discourse, which is in turn much better for me than typing unicode, given the specific shortcomings of my tooling (discourse in browser, not IDE; not vim/emacs integration into browser; no custom keyboard layout; no touching the mouse during typing).

mbauman · January 12, 2024, 4:06pm

Isn’t that effectively the same situation, only reversed? You’re making assumptions about what your audience understands and are asking them to cater to your communication style. And there certainly exist folks who wouldn’t know that μm is \{mu}m. In fact, plain old ASCII backslashes themselves are fraught to enter in many text widgets!

Here’s where it’s different: unicode is designed for people to communicate. Yes, it might take some effort to write unfamiliar characters. Yes, you might want to take care around confusables. But code is read far more than it’s written. So write for your audience.

Topic		Replies	Views
Non-unicode versions of unicode functions in base/stdlib? Internals & Design	10	1315	May 16, 2021
Warning against Unicode confusables Internals & Design unicode	51	1945	January 13, 2024
Fun with Unicode: TemplateᐸTᐳ syntax and more General Usage syntax , unicode	4	94	August 9, 2024
Rationale behind excluding some unicode characters from identifiers Internals & Design	10	393	March 3, 2023
String conversion from Symbol with Unicode does not yield a string, which is intended to be the same New to Julia question , bug	6	767	December 5, 2020

Syntax: Escape hatch for unicode haters

Related topics