Why is Greek ano teleia a valid identifier character?

mlhetland · July 17, 2018, 6:07pm

It seems Greek ano teleia (U+0387) is one of the “Punctuation, other” characters (along with primes) permitted as part of identifiers. What is the rationale, here? Semantically, it doesn’t seem to make sense, and it looks a bit confusing – as it’s (at least in many fonts) very similar to the dot operator (U+22c5), i.e., \cdot. Also, it’s essentially the same as – i.e., normalizes to – the middle dot (U+00b7), which is not permitted. Actually, it is even normalized to the middle dot when parsed as a symbol, even though the symbol cannot be written directly using this character!

julia> x·y = 42         # Using U+0387
42

julia> x⋅y = 42         # Using U+22c5
⋅ (generic function with 1 method)

julia> x·y = 42         # Using U+00b7
ERROR: syntax: invalid character "·"

julia> name = "x·y"

julia> name[2]
'·': Unicode U+0387 (category Po: Punctuation, other)

julia> String(Meta.parse(name))[2]
'·': Unicode U+00b7 (category Po: Punctuation, other)

… and just to underscore the point:

julia> Meta.parse(String(Meta.parse(name)))
ERROR: Base.Meta.ParseError("invalid character \"·\"")

Given that it seems to have been singled out for inclusion (at least in the pure-Julia rewrite of the parser, JuliaParser.jl, it’s listed along, alongside the three primes), I guess there may be a reason for including it, but … it seems it might be just as sensible to disallow it, just like, say, the (canonical) middle dot character is?

mlhetland · July 17, 2018, 6:17pm

While we’re at it, I wouldn’t mind permitting the middle dot (and the ano teleia, for that matter), normalizing it (or them) to the dot operator. On a Mac, you get the middle dot (at least on my Norwegian keyboard) by pressing alt-shift-., which is easier than using \cdot-tab (and it’s what I usually use to indicate the dot operator outside Julia or \mathrm{\TeX}).

stevengj · July 17, 2018, 7:41pm

That codepoint has the property Other_ID_Continue, and UAX#31 recommends that it be allowed in programming-language identifiers (mainly for backward compatibility, it says). e.g. x·y is also a valid identifier in Python 3 for the same reason.

Julia allows identifiers in a large superset of UAX#31, but I think it makes sense to continue to accept UAX#31 identifiers as a subset where possible, if only to ease interoperability with other languages.

stevengj · July 17, 2018, 7:48pm

mlhetland · July 18, 2018, 11:09am

Ah, I see.

Yes, indeed, following UAX#31 does seem to make sense – though I guess that would mean that this decision depends on the Unicode version in use? (It says “The exact list of characters covered by the Other_ID_Start and Other_ID_Continue properties depends on the version of Unicode.”)

Either way, the examples given in UAX#31 are:

U+1369 ETHIOPIC DIGIT ONE…U+1371 ETHIOPIC DIGIT NINE
U+00B7 ( · ) MIDDLE DOT
U+0387 ( · ) GREEK ANO TELEIA
U+19DA ( ᧚ ) NEW TAI LUE THAM DIGIT ONE

Rather than using the Other_ID_Continue property, these (seemingly arbitrary) examples are explicitly listed as the ones accepted in JuliaParser.jl. Or do these in fact cover all the cases (despite the phrasing “… includes characters such as the following”)?

Anyway: One issue here is that we don’t cover the ones listed – notably U+00b7 (which is part of the issue in my example, above). I came across the pull request you referenced earlier, but didn’t at the time realize that \cdotp was, in fact, U+00b7. So … making \cdot equal to \cdotp would seem to muddle things even further, then, since we’re currently normalizing the ano teleia into \cdotp (which, according to the UAX#31, should be permitted in identifiers, but currently isn’t).

So despite my initially preferring to make \cdotp equal to \cdot, I guess this would be an argument against that, and for rather making \cdotp an identifier characer, in accordance with UAX#31. I’m not sure I like it, but it does seem more consistent – unless we either ban the ano teleia or stop normalizing it to \cdotp in names.

Or…?

Iagoba_Apellaniz · July 18, 2018, 11:28am

In the spanish keyboard layout U+00B7 is typed with SHIFT + 3. I think it is very convenient to identify U+00B7 with some common operator, say \cdot[TAB].

Is U+0387 so accessible by the Greek layout? Indeed if it is, I would map it to \cdot[TAB] too. At least if there is no another character which prints U+22C5.

[UPDATE] This discussion is linked now with the Github issue thanks to @mlhetland. I would rather continue there, instead of discussing it here.

mlhetland · July 18, 2018, 11:34am

I agree that U+00b7 is convenient to type, and I’ve wanted to use it as \cdot before as well. I don’t know how easily U+0387 is accessible on Greek keyboards – but I agree that if it is, we might to map it similarly. The fact that U+00b7 and U+0387 are “canonically the same,” Unicode-wise (as I understand it – and according to current Julia normalization) argues in favor of treating them the same way. Officially (according to UAX#31), that seems to mean that they should both be identifier characters, but since only one of them is, we’re not really in compliance with that, anyway, I guess?

At the very least, it seems odd to me that we normalize identifiers written with a valid identifier character into one containing an invalid identifier character – especially when it seems to be in violation of the motivation behind it (i.e., to comply with UAX#31).

stevengj · July 18, 2018, 2:33pm

Oh, good catch. This is because Julia symbols are NFC-normalized, and the NFC of U+0387 is U+00b7. Given this, it is crazy to treat the two points differently in identifiers.

mlhetland · July 18, 2018, 4:10pm

Solved in PR #28167.

Topic		Replies	Views
Rationale behind excluding some unicode characters from identifiers Internals & Design	10	400	March 3, 2023
Support for displaying non-identifier characters in help via #19858 Internals & Design	9	1011	January 7, 2017
Issue with using Unicode in symbol names New to Julia question	8	389	May 30, 2024
Syntax: Escape hatch for unicode haters Internals & Design syntax , unicode	128	4495	January 16, 2024
Invalid unicode variable General Usage	3	1021	March 3, 2018

Why is Greek ano teleia a valid identifier character?

Related topics