It seems Greek ano teleia (U+0387) is one of the “Punctuation, other” characters (along with primes) permitted as part of identifiers. What is the rationale, here? Semantically, it doesn’t seem to make sense, and it looks a bit confusing – as it’s (at least in many fonts) very similar to the dot operator (U+22c5), i.e., \cdot
. Also, it’s essentially the same as – i.e., normalizes to – the middle dot (U+00b7), which is not permitted. Actually, it is even normalized to the middle dot when parsed as a symbol, even though the symbol cannot be written directly using this character!
julia> x·y = 42 # Using U+0387
42
julia> x⋅y = 42 # Using U+22c5
⋅ (generic function with 1 method)
julia> x·y = 42 # Using U+00b7
ERROR: syntax: invalid character "·"
julia> name = "x·y"
julia> name[2]
'·': Unicode U+0387 (category Po: Punctuation, other)
julia> String(Meta.parse(name))[2]
'·': Unicode U+00b7 (category Po: Punctuation, other)
… and just to underscore the point:
julia> Meta.parse(String(Meta.parse(name)))
ERROR: Base.Meta.ParseError("invalid character \"·\"")
Given that it seems to have been singled out for inclusion (at least in the pure-Julia rewrite of the parser, JuliaParser.jl
, it’s listed along, alongside the three primes), I guess there may be a reason for including it, but … it seems it might be just as sensible to disallow it, just like, say, the (canonical) middle dot character is?
1 Like
While we’re at it, I wouldn’t mind permitting the middle dot (and the ano teleia, for that matter), normalizing it (or them) to the dot operator. On a Mac, you get the middle dot (at least on my Norwegian keyboard) by pressing alt-shift-.
, which is easier than using \cdot
-tab (and it’s what I usually use to indicate the dot operator outside Julia or \mathrm{\TeX}).
That codepoint has the property Other_ID_Continue
, and UAX#31 recommends that it be allowed in programming-language identifiers (mainly for backward compatibility, it says). e.g. x·y
is also a valid identifier in Python 3 for the same reason.
Julia allows identifiers in a large superset of UAX#31, but I think it makes sense to continue to accept UAX#31 identifiers as a subset where possible, if only to ease interoperability with other languages.
Ah, I see.
Yes, indeed, following UAX#31 does seem to make sense – though I guess that would mean that this decision depends on the Unicode version in use? (It says “The exact list of characters covered by the Other_ID_Start and Other_ID_Continue properties depends on the version of Unicode.”)
Either way, the examples given in UAX#31 are:
U+1369 ETHIOPIC DIGIT ONE…U+1371 ETHIOPIC DIGIT NINE
U+00B7 ( · ) MIDDLE DOT
U+0387 ( · ) GREEK ANO TELEIA
U+19DA ( ᧚ ) NEW TAI LUE THAM DIGIT ONE
Rather than using the Other_ID_Continue
property, these (seemingly arbitrary) examples are explicitly listed as the ones accepted in JuliaParser.jl
. Or do these in fact cover all the cases (despite the phrasing “… includes characters such as the following”)?
Anyway: One issue here is that we don’t cover the ones listed – notably U+00b7 (which is part of the issue in my example, above). I came across the pull request you referenced earlier, but didn’t at the time realize that \cdotp
was, in fact, U+00b7. So … making \cdot
equal to \cdotp
would seem to muddle things even further, then, since we’re currently normalizing the ano teleia into \cdotp
(which, according to the UAX#31, should be permitted in identifiers, but currently isn’t).
So despite my initially preferring to make \cdotp
equal to \cdot
, I guess this would be an argument against that, and for rather making \cdotp
an identifier characer, in accordance with UAX#31. I’m not sure I like it, but it does seem more consistent – unless we either ban the ano teleia or stop normalizing it to \cdotp
in names.
Or…?
In the spanish keyboard layout U+00B7 is typed with SHIFT + 3. I think it is very convenient to identify U+00B7 with some common operator, say \cdot[TAB]
.
Is U+0387 so accessible by the Greek layout? Indeed if it is, I would map it to \cdot[TAB]
too. At least if there is no another character which prints U+22C5.
[UPDATE] This discussion is linked now with the Github issue thanks to @mlhetland. I would rather continue there, instead of discussing it here.
I agree that U+00b7 is convenient to type, and I’ve wanted to use it as \cdot
before as well. I don’t know how easily U+0387 is accessible on Greek keyboards – but I agree that if it is, we might to map it similarly. The fact that U+00b7 and U+0387 are “canonically the same,” Unicode-wise (as I understand it – and according to current Julia normalization) argues in favor of treating them the same way. Officially (according to UAX#31), that seems to mean that they should both be identifier characters, but since only one of them is, we’re not really in compliance with that, anyway, I guess?
At the very least, it seems odd to me that we normalize identifiers written with a valid identifier character into one containing an invalid identifier character – especially when it seems to be in violation of the motivation behind it (i.e., to comply with UAX#31).
1 Like
Oh, good catch. This is because Julia symbols are NFC-normalized, and the NFC of U+0387 is U+00b7. Given this, it is crazy to treat the two points differently in identifiers.
1 Like