Syntax: Escape hatch for unicode haters

Mason · January 13, 2024, 10:19pm

Sorry, I didnt mean to accuse you of that, but I can see how it’d seem that way.

From my point of view, there’s a spectrum of positions between the one you expressed and the very extreme one I described.

Benny · January 14, 2024, 12:03am

Programming isn’t nearly as consistent as mathematics is. Even for loops have several semantics, ours is really a for-each loop. in is common, but it’s hardly the only one. C++ and Java use :, MATLAB uses = (which we also support), Javascript uses of (also in, but specifically for object properties, not iterable values).

I’ve seen many people have bad first impressions of Julia, which is sometimes valid because Julia doesn’t suit their needs. I’ve also seen people ask why ∈ exists if in already does. I have never seen these people intersect. People seem very satisfied with being able to use in as they are used to, and they aren’t bothered by being told the equivalent ∈ is only “more like” an infix operator i.e. how it’s dotted in broadcasting. People ask about anonymous function syntax far more often, and they also aren’t weirded out by the unfamiliar syntax when function () ... end already exists. Despite preferring “one way to do things” more than most, I’m not bothered by either. In either case, help mode would’ve helped.

Sorry, this just doesn’t make any sense to me. It is far easier to learn the handful of tab-completed mathematical operators (some of which do have ASCII equivalents, though the infix property may change) and perhaps Google tidbits of their origin, than it is to learn Julia itself (or mathematics for that matter). This feels like saying someone shouldn’t need to learn how to tie their shoes while training for a marathon; anyone would be tripped up if they undertake a large task but entertain a phobia of a particularly small aspect of it. I think the argument for Julia being 0-based is more compelling just because it demonstrably would accommodate more people who don’t want to learn, and I don’t even think that argument is compelling to begin with.

Going beyond primarily writing the built-in English and occasional mathematical symbols, Julia’s Unicode support allows people to write code and communicate with more people in their native languages, and they won’t need much tab-completion on their different keyboards. It’d be irrational to suggest “write α as alpha because I suspect people hate Unicode” to someone on a Greek Polytonic keyboard, or to suggest they stop writing in Greek for their peers.

Considering that literacy rates are high among the global mathematical community and among nations that use Han characters, I have doubts this is actually true. This is mainly based in anecdotal difficulty learning a language as adults and baselessly attributing it to the language’s characteristics. Logograms also have an advantage to serving as a lingua franca; the written form and the meaning can remain smoothly if speaking diverges or was different to begin with. On the other hand, alphabetical languages would either shift in spellings to reflect the phonetic divergence, or they would intentionally sacrifice orthographic transparency e.g. the written silent letters in English and French that were pronounced very long ago, which conversely is a chief gripe of adults learning to speak these languages (“I looked it up, now I have to learn IPA too?”) if they’re used to logograms associated with language-specific and perfectly phonetic alphabets. Children don’t often struggle with any of these difficulties.

Ininterrompue · January 14, 2024, 4:39am

I can be satisfied using only in if I only ever interacted with my own code, but it can present a problem while reading others’ code, especially if Unicode is frequently used elsewhere in there. The problem is compounded by the fact that there are multiple choices you could make in Julia, in, =, or ∈, and not everyone is immediately aware that these three don’t mean different things in a for loop. Plus, those who shy away from Unicode and use in are necessarily going to be less familiar with code that has Unicode.

I personally agree that it’s not too difficult. But I also wouldn’t brush off other people’s impatience that easily.

It’s far easier to become acquainted with 1-indexing as well, yet there are plenty out there that shy away from any 1-indexed language for illogical reasons. There is some self-selection bias, in that those who continue to use Julia are necessarily fine with 1-indexing. But indexing is a fixed, permanent part of the language – you have to learn 1-indexing if you want to use Julia. By contrast, most Unicode usage comes down to stylistic choice and exists on a spectrum. It’s not inevitable in the sense that 1-indexing is. If they don’t use it but are forced to learn it, it’s because other people are using it in code they have to read.

Given that English is the lingua franca of programming code, and given that Julia is written in English, such code would be restricted to those who understand Greek and would probably be a pain for them to read, based on anecdotes I’ve seen from non-native English speakers. You’d still have to use Latin characters for keywords and defined functions anyway. Nor would any non-Greek speakers want to interact with such code.

Just because you can use Greek or Cyrillic or Chinese characters in the code doesn’t mean you should.

However, a native Greek speaker could find utility in being able to write comments and documentation in Greek for his team who all speak Greek. It is here where one can reasonably afford to write in natural language. Still, I think it is more of a “nice thing to have” than something whose absence would significantly impair productivity or comprehension. The syntax, resources, learning materials, etc. for most programming languages out there are overwhelmingly in English. To be a competent programmer, you’re almost certainly going to have to know some English.

So, it depends on what the alpha is used for, but I doubt this hypothetical Greek programmer would be seriously deprived from having to write alpha instead of α because he would most likely in all ignorance of Julia use a Latin keyboard by default when getting started with the language, as he would if learning any other commonly used language out there.

It is generally accepted that logographic languages are more difficult to learn for Western speakers. Not just Westerners, really – anyone who is not familiar with the character set. There are of course other factors here, but the writing system is certainly playing a role. I know Mandarin, so I can attest to its learning process. Pronunciation-agnosticism is just one of the many difficulties – if you forget how to write a character, you cannot use its pronunciation as a hint to write it. You have to use a dictionary every single time. It’s true that English and French have some unreliable spelling rules, but at least you have an idea of what it looks like. Maybe -ough is tough to remember, but you can still write something on the paper. With Chinese, you don’t even know where to start. You cannot learn Chinese characters without rote memorization no matter if you’re a child or an adult. (Adults will struggle with other things more, like tones.)

Chinese schoolchildren are forced to learn Mandarin in school, so it’s not a surprise that literacy rates are high in China and Taiwan. The literacy rate doesn’t say much about the difficulty of the language, but rather points to other socioeconomic factors.

The point remains, you either know it or you don’t. Everyone knows the symbols on their keyboard, and not everyone knows Unicode. Therefore, if a developer has a choice, it is most accessible for others to read if one sticks with characters readily typeable on a keyboard. That’s the default, “when in doubt” position I’d take when writing new code that isn’t explicitly mathematical. I went back to the code MilesCranmer referenced in the other thread and found the same rendering difficulties on Firefox, Windows 10. But moreover, it became increasingly obvious to me that names consisting of just Latin characters would’ve perfectly sufficed where Unicode was used instead.

This is getting off-topic as is, so I won’t press the issue any further.

Benny · January 14, 2024, 6:39am

I agree, but this isn’t different from learning the semantics of a for block in other languages and whatever symbols they use. The documentation on loops in Julia explains it all plainly for anyone willing to do a Google search. If they won’t do that much, they will have issues learning any language.

That assumes that the people you are communicating with all know English, and perhaps in a team of programmers with several programming languages under their belts, that is a fair assumption. But programming doesn’t only show up in bilingual dev teams, it can show up in contexts where not everyone is a programmer. This has been especially true for academia, and this influences the programmers’ choices of languages and how they present their work. Being able to make code more aligned with the associated text or more presentable to a non-programming audience is not a quirk, it’s a feature. I’ll concede that English speakers would not understand a Jupyter notebook written in another language, but it’s fine if we’re not the intended audience.

Again, most Unicode usage by English speakers has been symbols, some from non-Latin alphabets, that are widely used in the mathematical or physical contexts the code involves. π is not merely “style”, it is the most widely used symbol for a mathematical constant, far more common than the romanization pi with the exception of contexts that limit writable symbols. TeX exists because that limitation is untenable. I found Julia’s non-ASCII support much easier to adapt to (if I can even call it that, I actually found it pleasant to write familiar symbols that mirror my text) than switching from 0-based to 1-based indexing, so I cannot relate to your opinion that avoiding 1-based indexing is illogical but avoiding non-ASCII symbols isn’t.

That link only mentions English speakers, not “Western speakers,” and it does not contradict anything I said about adults having difficulty learning a second language. Your particular complaints about a linguistic tradeoff isn’t really indicative of anything about the languages, let alone justify atypical usage of the Latin alphabet in inappropriate contexts.

bertschi · January 14, 2024, 9:23am

Well, in most cases I have seen Greek characters are used in math heavy code. In particular, if algorithms from a certain paper or community with an specific mathematical notation convention are implemented. In this case, Greek letters are nice as the code stays closer to the original formulae in the paper or book and anyone who wants to understand what the code does in depth needs to know these symbols anyways, i.e., when relating to the original paper or book.

stevengj · January 14, 2024, 1:00pm

This thread was originally about alternative input/parser methods for Unicode characters, and has now become a general debate over the utility of non-ASCII symbols in code. I think @mbauman’s comment from Warning against Unicode confusables - #29 applies here:

cormullion · January 14, 2024, 2:29pm

Can you post an image showing what Iosevka has for that? I’ll add it to JuliaMono for the next release.

ffevotte · January 14, 2024, 8:22pm

Screenshot_2024-01-14_21-21-57

cormullion · January 14, 2024, 9:20pm

That looks like Noto Emoji, not Iosevka,

ffevotte · January 14, 2024, 9:31pm

Ah, sorry. You must be right. It looks like a fallback has been used in this case… Maybe Iosevka does not support after all?

cormullion · January 14, 2024, 9:37pm

Pity - I thought it might have been some unusual Linear Algebra operator that I’d not heard of and omitted by mistake…

Tamas_Papp · January 15, 2024, 8:58am

Yes, I misspoke. I just looking at my Emacs config and realized that I set up Noto Color Emoji as a fallback ages ago.

Generally, I don’t think fonts should support every single UTF8 glyph under the sun, that’s why we have fallbacks.

foobar_lv2 · January 15, 2024, 2:27pm

Yes, this is an issue Ialso came up with when sketching the necessary changes in JuliaSyntax.jl: For tokens that contain both escaped and unescaped chars, one needs separators. A simple one could be \\:\mu:m, i.e. \\ opens a trigraph-sequence; if followed by colon, then it is multi-char, and followed by a sequence of blocks separated by colons that are either single-char escapes or sequences of plain characters. I’m not very sure about what detailed choice looks nice in the end – that kind of bikeshedding should probably come later.

I’d rather try to focus on: Does a syntax / tokenizer escape hatch make sense? What are the downsides and upsides of it?

We already support unicode synonyms like multiple different spellings of \mu<TAB> that the tokenizer normalizes (and that the Symbol constructor does not normalize).

When I originally used the word “synonym”, I gave the very technical and narrow definition of “produces identical ASTs after running through the parser”.

This is still what I believe is appropriate handling of such trigraph/escape sequences.

About what would be displayed, that’s up for syntax highlighters and IDEs to decide, not for the parser. Ideally IDE maintainers would make this configurable.

My concern is with improving support for tooling that is geared towards plain-text-files, not julia-specific tooling. Such non-julia tooling would of course display whatever is in the plain-text-file, i.e. \\alpha.

Multi-character operators and names and keywords are standard in structured computer-parsed languages – not just programming in some narrow sense referring to Turing, but also things like markdown. This is a lineage and precedent that is significantly more important for design-space than mathematical language.

FWIW, mathematicians have converged on latex for this purpose.

My goal in this thread was to make julia more accessible for people from exactly this tradition – i.e. people who would name their small guys epsilon or \varepsilon and would recoil in horror at seeing ε in code. (monospaced source-code and typeset maths are separate magisteria, connected by a command-line tool)

Thank you for the clear terminology! And it is really not about ascii, it is about “intersection of most common keyboard layout typeable characters”. This intersection happens to be roughly US-keyboard.

I am not really intending to force people to learn yet another standard or something.

But do you agree that romanization / transliteration is extremely advantageous to users who are unfamiliar with the unicode characters used in text they are handling?

Capability of visually distinguishing glyphs
Subvocalization, and capability of mentally referring to unfamiliar symbols
Capability of technically referring to unfamiliar symbols
Capability of dealing with the text in a wide range of not necessarily julia-aware tools

To expand on (3): Suppose you deal with code using variable names like ε̂. You want to refer to it. You may of course copy-paste that symbol into the repl to get help ("ε̂" can be typed by \varepsilon<tab>\hat<tab>). But you will need to keep that in your feeble brain!

Now imagine this was transliterated for you as \\varepsilon\hat. Now every single occurrence of this variable will remind you how to type it (importantly: the \hat is a suffix in julia, not a prefix as in latex. This would fuck me up without constant reminders). It will remind you how to pronounce it during subvocalization when mentally referring to it. The learning curve for dealing with that code base has been cut.

I am not trying to be an anti-intellectual brogrammer here. I am a mathematician. I like my non-latin symbols, and the depth of connotations they carry!

And I really love the latex model of dealing with these symbols in the contexts of keyboards and terminals. The latex model contains a strict separation between monospaced editing, and the output artifact of typesetting.

Ininterrompue:

Moreover, I think the size of this “language” barrier is being underestimated here. The notation and symbols of mathematics are more akin to a logogramic language like Chinese then an alphabetic language like French.
[…]
the only real way to learn the characters and achieve fluency of reading is to memorize them. You either know it or you don’t. Obviously you can look it up, but every time someone does this they get more exasperated having to do extra work, and productivity inevitably slows. By contrast, a learner of French may not understand the meaning of a specific unknown word, but after a year of courses, is able to at least pronounce the word and perhaps even get some idea of its general meaning through context clues. As others have alluded to, mathematical symbols to the uninitiated are unable to be pronounced. Like ∀. Perhaps except in a silly way like “V with a line” or “upside-down A.”

A big part in favor of the mathematics-inspired names is to carry over the deep connotations from years of study. ϱ as a density, ϑ as a homotopy parameter, that kind of thing.

However, these connotations translate just as well when you call these variables rho and theta, which are perfectly acceptable names for the non-math-programmer crowd, and are functionally short one-letter-names for physicists or mathematicians.

I am not proposing to take anything away from the language or forbid something. That would be julia 2.0, I’m not opening that can of worms.

I am proposing to have a third way, specifically for the benefit of the in crowd when interacting with the \in<TAB> crowd.

I am somewhat astounded by the kind of push-back I’m receiving here.

Ininterrompue:

Given that English is the lingua franca of programming code, and given that Julia is written in English, such code would be restricted to those who understand Greek and would probably be a pain for them to read, based on anecdotes I’ve seen from non-native English speakers. You’d still have to use Latin characters for keywords and defined functions anyway. Nor would any non-Greek speakers want to interact with such code.

Just because you can use Greek or Cyrillic or Chinese characters in the code doesn’t mean you should.

However, a native Greek speaker could find utility in being able to write comments and documentation in Greek for his team who all speak Greek. It is here where one can reasonably afford to write in natural language. Still, I think it is more of a “nice thing to have” than something whose absence would significantly impair productivity or comprehension. The syntax, resources, learning materials, etc. for most programming languages out there are overwhelmingly in English. To be a competent programmer, you’re almost certainly going to have to know some English.

So, it depends on what the alpha is used for, but I doubt this hypothetical Greek programmer would be seriously deprived from having to write alpha instead of α because he would most likely in all ignorance of Julia use a Latin keyboard by default when getting started with the language, as he would if learning any other commonly used language out there.
[…]
Everyone knows the symbols on their keyboard, and not everyone knows Unicode. Therefore, if a developer has a choice, it is most accessible for others to read if one sticks with characters readily typeable on a keyboard.

What?

I am advocating for learning from the excellent user-experience of TeX, where source-code is typically romanized and symbols are readily available on a keyboard, and π can be represented as \pi. But for e.g. greek people who want to typeset text in their native language, they can of course also write unicode π in their text! (otherwise they would presumably get very annoyed when using latex to e.g. typeset a CV or cover letter for some position)

You’re right that this thread has derailed too much and closing is probably the right call

Since you took the moderation decision to close this thread, could you maybe comment on how one could discuss the topic in a less derailed manner?

I.e.

Can we reasonably address the issues of unicode-in-source haters? By helping these people like me with their needs, not by depriving others of the things they love in julia? (however misguided I personally might think α in code is, that would be a very different discussion, and that would be strictly julia 2.0)
Should we maybe do something about discouraging the prevalence of non-ascii unicode in the general style guides? If so, we should also address their prevalence in Base / stdlib – that must be the paragon of julia coding style.

adienes · January 15, 2024, 2:39pm

well…

anyway, I don’t think comparisons to latex are very useful, since nobody but the author reads latex source code, but many people read Julia source code

stevengj · January 15, 2024, 2:56pm

For the reasons @mbauman and I noted above, I suspect that adding a huge number of new keywords like \\alpha to the Julia parser seems unlikely to happen, though of course someone can always propose a pull request.

A package (e.g. TextUniVars.jl) that provides e.g. uvar"\alpha\hat" variable names can happen now, as I noted in my very first response on this thread. Other possibilities include someone writing a new editor plugin that displays α as {\alpha} or something, i.e which transparently translates under the hood. Both of these things can be done by any sufficiently motivated person without any changes to Julia itself.

For occasional replies to discourse posts with Unicode identifiers, if you’re not willing to copy-paste a few times a year(?), I suppose you could always include using TextUniVars in your reply.

I suggested some possible actionable items here: Warning against Unicode confusables - #36 by stevengj … of course, people can propose whatever PRs to the style guide or to Julia base that they wish, as long as they can accept that the answer might be “no”.

Benny · January 15, 2024, 3:25pm

You’re not, really, because there’s a mismatch between source files in Julia, a programming language, and in TeX, a typesetting system. The TEX file isn’t the endgame, it’s intended to be rendered into a document. A popular practice is to preview parts of that document while you’re editing the TEX file. There’s no equivalent rendering of .jl files into a readable format because that’s what the .jl file is for. Unicode support isn’t nearly as complicated as TEX typesetting, so we can just write Unicode directly into the .jl file with ASCII and tab completion.

And how is that different from using TeX? People aren’t using LaTeX so they can ignore non-ASCII symbols, they are memorizing ASCII sequences for the exact purpose of representing those symbols to be rendered in a document. I doubt they would do that if they felt disgust seeing those symbols.

Breaking the parser to read this like Unicode symbols would in fact be imposing a new standard on people. Like it or not, \\alpha is just not parsed as α, and it’s not even the same keystrokes as tab completion. There’s also no point in repeated visual reminders of how to type tab completion sequences for Unicode if that Unicode never shows up, it makes far more sense just to use or make ASCII aliases that the parser can actually read. If the Unicode does show up, it’s even worse; you have \\alpha and α show up in the same source file, and our eyes are just supposed to recognize them both as the same name.

Taking that all into account, this option cannot be incorporated into Julia code directly. I see 2 options:

a precursor format that is rendered into .jl files, if you’re serious about emulating TeX. However, an unprecedented pre-source file for a programming language is a really steep ask. People would far prefer hitting the TAB button than more than double the size of their source folders.
A string macro that maps these untabbed sequences to their Unicode characters before the string is provided to the parser. If the macro throws an error upon any non-ASCII characters, it can perfectly isolate the untabbed sequences in the string away from Unicode usage in the normal Julia code. I’m fairly confident people would hate reading (and using up space) for Unicode"\mu" far more than µ, but at least you could pull this off now. Not sure off the top of my head how you’d intersperse actual ASCII characters, especially numbers, in those names but I’m sure that can be arranged.

DNF · January 15, 2024, 3:33pm

I think this must be the first time I’ve ever heard the user-experience of TeX (or LaTeX) being called ‘excellent’. I use LaTeX everywhere, if I can (even our wedding table layout plan(!)), because TeX is awsome. But the user experience is horrible. It’s something to use because it feels amazing when you make it work, but it’s a miserable journey, most of the time.

And I should definitely change my setup so I can use unicode in the LaTeX source code, that would be a welcome change. I’ve been meaning to look into that for years.

Benny · January 15, 2024, 3:45pm

XeTeX?

DNF · January 15, 2024, 5:23pm

Yes, I know, I just have to get it done.

foobar_lv2 · January 15, 2024, 5:37pm

That’s exactly my proposal! But such ASCII aliases that the parser can actually read must live in currently empty syntax-space, otherwise they could clash in code that contains both \alpha<TAB> and alpha.

In this case, my proposal would be to have both representations available, in the sense of “parser accepts it”.

This would be paired with command-line tools that replace trigraph-style backslash representations with unicode representations, and vice versa.

These command-line tools would need to be reversible / idempotent, i.e. julia-format --ihateunicode would be do nothing on source files that contain no non-ascii-encoded identifiers or operators, and julia-format --iloveunicode would do nothing on sourcefiles that contain none of the new escape sequences / trigraphs; and julia-format --ihateunicode | julia-format --iloveunicode would be equivalent to julia-format --iloveunicode, and julia-format --iloveunicode | julia-format --ihateunicode would be equivalent to julia-format --ihateunicode.

With that, it would become possible to use git-hooks to never again see unicode identifiers / operators, or to never ever see escapes / trigraphs, unless working with code that uses a mix (which is extra-ugly and should ideally be normalized by code-formatting / linting into either direction; but it should be valid parser-accepted julia in my proposal).

There is a tiny wrinkle with respect to denormalized unicode, like e.g. code files that use different spellings of \mu. Such code-files should be normalized by code-formatting / linter tools to U+03BC anyways – otherwise users who want to grep will be disappointed because they won’t find occurrences of the denormalized U+00B5 spelling!

Just like for mixed code-files, we probably should not attempt to guarantee that julia-format --ihateunicode | julia-format --iloveunicode is effect-free on such files; instead that would normalize the spelling.

They would sure do. I absolutely would. uvar"\mu" is a closer call; and \\mu would be a clear winner over µ for me. Typing is a different thing than reading.

But if that was e.g. a Chinese / kanji character, then I would prefer the uvar variant. People with at least a passing knowledge of these characters would almost surely prefer the real unicode, and this is a very valid preference.

So the question and my goal for this thread was: Can we make it easier for these groups to interact, to satisfy both their preferences?

Topic		Replies	Views
Non-unicode versions of unicode functions in base/stdlib? Internals & Design	10	1313	May 16, 2021
Warning against Unicode confusables Internals & Design unicode	51	1933	January 13, 2024
Fun with Unicode: TemplateᐸTᐳ syntax and more General Usage syntax , unicode	4	94	August 9, 2024
Rationale behind excluding some unicode characters from identifiers Internals & Design	10	390	March 3, 2023
String conversion from Symbol with Unicode does not yield a string, which is intended to be the same New to Julia question , bug	6	767	December 5, 2020

Syntax: Escape hatch for unicode haters

Related topics