Some people have a perennial problem with unicode in source files, and would prefer to simply never deal with that.
We have two interesting precedents inside the julia language to deal with that:
- String literals permit unicode escapes, like
"\u00b5 is a greek letter"
. In other words, unicode haters never need to fight their text editor when writing string literals, they can simply write the corresponding unicode escape sequence and be happy the parser fixes that (identical syntax tree!). - Many symbols and operators have unicode aliases. For example \u00b5 and \u03bc are two different unicode greek mu letters. They are really different in strings! But the parser maps both of them to the same symbol. If you use different unicode spellings of mu in symbol position, the parser will normalize them. For example
julia> Symbol("\u00b5")
:”
julia> :” == Symbol("\u00b5") #the left hand side was copy-pasted from the REPL output one line above
false
>using JuliaSyntax
julia> parsestmt(Expr, "ab\u03bc123") == parsestmt(Expr, "ab\u00b5123")
true
I think this is very pragmatic.
Now Iâd also like to get a non-unicode escape hatch for symbol and operator positions, just as we have for string literals, even at the price of the same kind of inconsistencies we have with unicode aliases / normalization.
In other words, my input keyboard should not constrain the set of ASTs I can generate.
The way of doing that would be to find some gap in the existing julia syntax, i.e. something that is currently invalid, and then use it for this.
The first, simplest goal should be completeness. This means unicode escaping. Goal is that for every string that produces a result with parsestmt(Expr, str)
, there must exist an ascii string which produces an identical AST (on the expression level).
The second goal that can come afterwards is to introduce âniceâ aliases for us unicode haters.
Ok, so letâs identify a gap. I think that \\u03bc
is a gap? I.e. double backslash followed by lower-case u
is probably never valid syntax (outside of string literals)?
Then we have our inconvenient escape hatch. I need to use some library that has a keyword-only parameter or function that contains a greek letter and canât get the autocomplete / IDE to work with that? No problem, I just call someAnnoyingFunction; \\u03bc=1.2)
.
The second step â convenient shortcuts â should probably use the same latex-style tab-completions that the REPL already has. So I can write a \\xor b
or a \\u22bb b
instead of xor(a,b)
or a â» b
.
Since the normalization happens early (during lexing / tokenization) there is no need to spend any thought on things like operator arity or precedence.
The only special thought to give is to not repeat the mistakes of java source file unicode escapes: We must make sure that e.g. \\u0023
is invalid instead of starting a comment (same with newlines, quotes, etc).
The java trouble is that unicode escapes in source files are resolved before the parser runs, and can therefore do things like terminate comment blocks. Most IDE parsers and syntax highlighters donât see this, but javac does. Makes fun in generated code, and makes very fun obfuscated code contest entries.
The general question woulds be:
- What do you think about the general goal (have an escape hatch)?
- What are the complexity costs in terms of implementation?
- What are the complexity costs for the general ecosystem? If something is valid syntax, like emoji variable names, then you will encounter it in places like here. Adding
a \\xor b
is extra syntax you will need to mentally parse.
More technically:
- Is the identified gap in the syntax really a gap or do we need to trawl deeper?
- Does the identified gap make for nice syntax?
PS. Its not so different to latex. Modern latex supports UTF8 input. Many people still use write e.g. "Danke sch\"on"
in German latex, basically because latex is by construction very US keyboard centric (typing latex on German layout will damage your wrists, backslash is a very painful hand movement on German layout; most people type latex on US layout and therefore prefer escapes. Maybe polish or czech users could chime in? You also have lots of non-ascii glyphs in everyday texts, how do you input them in latex?).