Syntax: Escape hatch for unicode haters

danielwe · January 15, 2024, 7:53pm

Note that this has only been supported by default since 2018. Before that, you had to add \usepackage[utf8]{inputenc} to your preamble, see input encodings - What is the function \usepackage[utf8]{inputenc} used for and why we should add this in latex document? - TeX - LaTeX Stack Exchange. Either way, latex’s font selection scheme limits the possible input glyphs quite severely; for example, it won’t work with combining accents. You need xelatex or lualatex for full unicode support. This is just to say that the flexibility to choose between escape sequences and unicode glyphs isn’t inherent in (la)tex’s user experience design, but more of an afterthought that has evolved over time. (In case that matters for your comparison/argument.)

Benny · January 16, 2024, 1:22am

This only affirms to me that this is a usability nightmare. Mixing the ASCII-only way of writing and the normal way should be avoided at all costs; this may be more forgiven in a typesetting format where everything will be rendered for readability, but not in source code that may be fed into macros, which would all need to recognize 2 different substrings or subexpressions* as the same name. *(I’m not actually certain if you intend the parser to expand the ASCII sequences so Expr contain the Unicode characters, I’d hope so but printing and copy-and-pasting would become issues for these writers. In either case, string macros all have a new difficulty.) Modifying or copying source files is risky when you scale up, especially when the file extension and source header lines provide no easy indication which way the text was written. A dedicated string macro is a lot smoother than these tools.

I should clarify that the string macro I had in mind would process a string containing the entire source file’s text, not each variable. A macro on each variable would easily allow mixing the 2 ways of writing e.g. Unicode"\mu" === µ, which again hurts readability and metaprogramming. Instead, a Unicodify""" #= entire source file =# """ would be expanded before any macro calls inside the text, and while it wouldn’t need some command-line setting to be rendered or executed, @macroexpand1 or something similar could be used to roughly expand that outer macro call for anyone who wants to read the Unicode directly.

A minor(?) concern for appropriating the \u and \U escape sequences in strings: those are allowed to vary the number of hexadecimal digits e.g. "\u00b5m" === "\ub5m" === "µm". This poses an issue when writing trailing hexadecimal characters, e.g. "µ1" needs to be written with the full number of digits "\u00b51", not "\ub51". As long as people take care to write the full 4 or 6 digits in these cases, I don’t think it’ll be a problem for the Unicodify processing. Still, being able to write the same name with different sequences, even if avoiding direct Unicode, poses a metaprogramming problem again.

Even given such a string macro to discourage mixing the 2 ways of writing within a source file, this is not a good goal. The fact remains that you are imposing an unprecedented standard on everyone, because the ASCII-sequence writers still cannot avoid Unicode files written by others, and vice versa, especially if they share a code base. Communication will be incredibly disharmonious, e.g. editing or copying from each other’s source files. Ironically in an attempt to address the evidently scarce complaint of writing with too many characters, this escape hatch idea introduces difficult and perhaps intractable problems by allowing people to inscribe a name in more than one way. Taking an earlier complaint as an example, this wouldn’t reduce the ways we can write for-loops, but add for x \\in array as fully equivalent to for x ∈ array in addition to the alternate versions with in and =. There’s merit to sticking to fundamental decisions, even if a change could technically not be breaking, because some accommodations add unnecessary complexity for everyone.

foobar_lv2 · January 16, 2024, 10:07am

That’s why I am so insistent on this living in the tokenizer, and that ASTs must be equivalent between both ascii-only and unicode spellings, same as current unicode synonyms. Macros only see ASTs, so this is a complete non-issue.

Conversion between uvar"\mu" and \mu<TAB> is not code-formatting. It cannot be reasonably done automatically without human oversight, because macros can see it, and that could change semantics of the program.

This breaks IDEs and also causes escaping trouble with code using backslash as division operator.

Something that could be done today, in a package, is includeWithUnicodeEscapes("somefile.jl"), where the modified include would replace unicode-escapes before invoking the standard runtime parser. But this again breaks existing IDEs and syntax highlighting, and would introduce an incompatible dialect of julia.

See above, no meta-programming problems. My proposal would use the exact same mechanism that avoids meta-programming problems with the difference between 'μ': Unicode U+03BC and 'µ': Unicode U+00B5: The tokenizer normalizes both to 'μ': Unicode U+03BC before constructing the AST.

The ship “only one way to write for-loops” has sailed long ago. I don’t see big issue with having one more way that the parser technically accepts, especially if this would not be expected to see widespread use. (the for x \\in array is still essential in my proposal: Otherwise round-tripping code-formatting from for x ∈ array could not end up at the same spelling! And round-tripping must end up at bitwise identical files, otherwise this messes up diffs)

I don’t see it that way. People who use “weird” (non US keyboard) unicode in their code impose their standard on everyone who wants to interact with them. Sometimes this makes sense: E.g. Chinese coders who mainly write for a Chinese audience.

“unprecedented standard”? Is U+00B5 vs U+03BC not a precedent in julia? Is in vs ∈ in for-loop headers not a precedent?

in vs ∈ outside of for-loop headers is not a precedent: :(a ∈ b) and :(a in b) are different ASTs that only happen to have the same semantics in the default namespace, due to some const ∈= in in Base. This causes all the hypothetical macro havoc that you are so worried about. Outside of loop-headers where the tokenizer normalizes, the choice between ∈ and in is not mere code-formatting, because these are not synonyms.

And this “unprecedented standard” would be imposed on people only in the same way that reading and writing Chinese characters is imposed on them today.

(you would have a valid point if you complained that this annoying more complex standard gets imposed on IDE authors!)

goerz · January 16, 2024, 11:45am

Anybody can write a simple search-and-replace script in a few minutes that round-trips between unicode and some escaped representation (or even a representation that still parses as valid Julia). There is absolutely zero reason that anything like that would have to be incorporated into julia or the Julia parser.

P.S.: If you need some help: ChatGPT

foobar_lv2 · January 16, 2024, 12:16pm

I think the shortest way would be to fork JuliaSyntax.jl? Because we don’t want to touch string/character literals.

That still parses as valid julia and generates the same AST (to avoid interactions with metaprogramming)? I don’t see that.

Well, one would also want REPL-support, IDE-support and syntax-highlighting. Hence many projects that would need to standardize on a single escaped representation. And maybe one would like to avoid headaches a la “which of these files need pre-processing to unescape unicode”, by simply asking the julia-parser to do that. After the more involved REPL / IDE / JuliaSyntax.jl changes, that would be a very tiny code change. And thus it would be incorporated into the language.

Palli · January 16, 2024, 1:41pm

It likely will not. I mean given the opposition, I have no say in this.

You could, or anyone, and it could be a non-default parser. Before it was merged, it could still be used. I’m still amazed how dynamic Julia is, that you can switch out the parser. It’s not like this is unheard of maybe, I guess possible in Lisp (Racket?), but given its “non-syntax” not sure done. I just think unheard of in (most other) languages, e.g. C/C++.

But would you want to (spend time on it) even if you could, and assuming it would not be merged? Maybe, maybe not, you will convince more people after a ready PR. But at least I don’t want the parser bloated. I want it as fast as possible. Maybe it’s not a huge risk it will become (much) larger, but I’m skeptical of the gain even. Maybe I didn’t read your proposal well enough…

mbauman · January 16, 2024, 1:52pm

My guidance here would be the same as another recent syntax idea:

foobar_lv2 · January 16, 2024, 1:58pm

^ This. Maybe I will submit a PoC, or maybe I’ll give up on that.

This thread was conclusive in the sense that significant opposition exists. That still surprises me, and I still cannot really understand it – “meh, too much work” is a point of opposition that I anticipated.

However, most of this thread appeared to focus on denying my lived experience of “unicode in source-code really sucks for me, and IDE + tab-completion is a different ballgame than bring-your-own-editor”.

I also think that I am not alone in that – but as others mentioned upthread, maybe many such people don’t use julia. I can second that first reaction from many programmer julia-outsiders to unicode operators is “are you batshit insane?”.

I think that’s a pity: Stuff like unicode, or begin-end syntax, or one-based indexing are surface syntax.

I don’t particularly like julia’s surface syntax, but I really adore the dynamic multi-dispatch and corresponding compilation + specialization model + type system, and that to me is the heart of the language. So I can tolerate some annoying surface syntax.

stevengj · January 16, 2024, 2:00pm

This topic was automatically closed after 2 days. New replies are no longer allowed.

Topic		Replies	Views
Non-unicode versions of unicode functions in base/stdlib? Internals & Design	10	1315	May 16, 2021
Warning against Unicode confusables Internals & Design unicode	51	1949	January 13, 2024
Fun with Unicode: TemplateᐸTᐳ syntax and more General Usage syntax , unicode	4	94	August 9, 2024
Rationale behind excluding some unicode characters from identifiers Internals & Design	10	396	March 3, 2023
String conversion from Symbol with Unicode does not yield a string, which is intended to be the same New to Julia question , bug	6	767	December 5, 2020

Syntax: Escape hatch for unicode haters

Related topics