Even in the internal code (the guts; which you might have to look at, coding is not just about APIs), it is possible to have fun with Unicodes:
Happy hacking!
Even in the internal code (the guts; which you might have to look at, coding is not just about APIs), it is possible to have fun with Unicodes:
I certainly do not know that. Youāve been consistent over time in arguing that unicode characters in code are bad in general, and have repeatedly dismissed any advantage to using them. And now, tellingly, you are posting a table of characters instead of real-world code to prove your point.
The arguments you advance very clearly dismiss unicode per se, even to the point of saying any sort of contact with it is unacceptable to you.
To both of you: if you think that unicode is fine, but some people over-use it, you should make that clear. Youāve both been painting it in an aggressively negative light, even though almost all uses are of the sort Iām describing: a few Greek letters here or there.
Thereās no disagreement that unicode can be used to write bad code. But if your goals are actually āto promote responsible useā, then you should both change tack, the current arguments and tone, and the repetitiveness of these attacks, are getting increasingly tiresome and aggravating, which is also why my hackles get raised.
Are these really problems you have ever encountered in practice? Poor naming of variables is certainly not a problem that is created by the use of unicode symbols, just because you can create confusing situations using unicode doesnāt mean that all uses of unicode will. Iād say that the following variables names are at least equally bad and confusing, even if they are ascii only. These names are obviously absurd, but no more so than your examples
Il1li1il1il1il1i1li1l1ili1li
li1li1l1i1li1li1l1il1i1li1li
li1l1il1i1li1li1li1l1il1ili1
li1l1l1il1il1il1i1li1li1li1i
Il1li1il1il1il1i1li1l1ili1li
Il1li1il1il1il1i1li1l1ili1li
li1li1l1i1li1li1l1il1i1li1li
li1l1l1il1il1il1i1li1li1li1i
I generally visualize complex code with symbol-overlay in Emacs. Here is a screenshot with the above code highlighted:
Each variant is colored differently. I have this bound to C-s s s.
At the danger of repeating myself once too often: Unicode used to write source code (not as data or as commentary) is a BAD IDEA.
It makes source code NOT WYSIWYG. Recall that programming fonts go to great lengths to distinguish 1 and l and 0 and O etc. Why? Because it avoids confusion. With Unicode it is so easy to spoof by substituting look-alikes! That flies in the face of the need to read source code unambiguously.
I think you mean āAmbiguous Unicode characters used to write source code is a bad ideaā
Because I donāt think thereās anything wrong with writing ĻĀ² = 1.2
in code.
Obviously some judgement is required to decide what could be ambiguous and itās easier to get wrong than if you restrict yourself to ASCII but itās a trade-off. Unicode does have benefits so use it as long as the benefits are not outweighed by issues such as legibility.
There are at least three symbols Ļ in unicode that look the same:
https://util.unicode.org/UnicodeJsps/confusables.jsp?a=Ļ&r=None
I think we all agree that writing obfuscated code is a bad idea (except as a party trick), Petr. (As others pointed out, you can do that with ASCII too.) But itās really a conceptually separate topic from the question of ruling out Unicode usage entirely, e.g. symbols like ā¤
or Ī±
versus <=
or alpha
, or of providing parser support for typing \\alpha
for Ī±
(or whatever), which arenāt easily confusable. Please start new threads for new topics.
(A linting tool might well want to warn about confusable symbols, either ASCII or Unicode, used in the same scope. There are well-established tables and algorithms to identify these. Warn against confusable identifiers used in the same scope? Ā· Issue #259 Ā· JuliaTesting/Aqua.jl Ā· GitHub)
Good catch! I think the right conclusion here is to not use āMALAYALAM DIGIT ZEROā in Julia code.
I mean, two ambiguous characters are a problem if you use both. If itās obvious that people use only one of them then thereās no problem.
If you disagree with me I guess you never use l
in your code: Unicode Utilities: Confusables
(Moved to new thread since this a different topic focus. See also previous discussion in Unicode: a bad idea, in general)
(I use ā = \ell TAB
when I need a single-character integer-valued l
identifier, just like I do in mathematical papers. Unicode to the rescue! )
Which has three lookalikes.
ā¦ that I donāt use, so no problem. (Whereas I use I
and |
and the number 1
fairly often, though to be fair I also try to use a programming font where these look distinct from l
.)
Well, my point was that you used \ell in order to avoid confusion with I, 1, l. Yet, At the same time you introduced possible confusion between ā, ā, and šµ.
Well, my point was that you used \ell in order to avoid confusion with I, 1, l. Yet, At the same time introduced possible confusion between ā, ā, and šµ.
Yes, and this illustrates the underlying weakness of your argument. I donāt use ā, ā, and šµ in the same code, and I think this is true of most people. I do use I, 1, l in the same code, and I think this is true of most people. Why are you worried about the former more than the latter?
Unicode confusables indeed give new ways to write intentionally obfuscated code, and it would be great if linting tools like Aqua.jl warned about them. But you havenāt made a good case for them being a problem in ordinary (non-malicious) usage.
Steven, it seems to me youāre thinking of this problem in terms of ācan I read my own codeā. In that case, of course you can, because you know which letter you chose. However, reading someone elseās code is different: It is not obvious when reading the source code which of the unicode characters they chose.
The difference is that I can reasonably expect to encounter situations where someone has written |
, I
, l
and 1
all in one codebase, but itād be a deliberate attempt to obfuscate code if someone used ā
, ā
, and šµ
together in one codebase.
Unicode may create more theoretical opportunities to get confused when reading code, but in practice (especially due to the latex shortcuts) it reduces this confusion because it gives us better, more distinct choices than those that are available with just ASCII.
if someone used
ā
,ā
, andšµ
together in one codebase.
It is not that likely. But my point was, they could choose any one of those letters. How is the reader to know which one it was, until they expend some of their time on analyzing it?
However, reading someone elseās code is different: It is not obvious when reading the source code which of the unicode characters they chose.
Assuming they arenāt trying to be intentionally perverse, it usually is obvious, because usually for mathematical symbols (which is what we are talking about here mostly) there is a single obvious choice (e.g. \ell
or \rho
etcetera).
In very rare cases where Iām not sure, I can always paste it into the REPL to find out. (Such cases are usually not confusables, though, but simply some symbol I havenāt used before.)
And, again, assuming the programmer is not malicious, they wonāt use two easily confusable symbols in the same scope, so even if they make a weird choice (e.g. APL ā“
U+2374 instead of Greek \rho = Ļ
U+03C1, which seems very unlikely and borderline malicious), if I try to edit the code and insert the wrong symbol it will normally give a runtime error. Linter support for confusable warnings will help here, too. (And yes, you can construct an example where it will still run without error. But this is straying pretty far into the hypothetical here. So far you havenāt given a single example of a problem occurring in the wild.)
It is not that likely. But my point was, they could choose any one of those letters. How is the reader to know which one it was, until they expend some of their time on analyzing it?
Julia normalizes ā
, and ā
to the same letter, and šµ
can be figured out fairly easily if one is confused via copy-paste:
help?> šµ
"šµ" can be typed by \bscrl<tab>
In almost all circumstances thereās a clear canonical unicode letter to use which is determined by the most favourable latex completion.