Warning against Unicode confusables

Even in the internal code (the guts; which you might have to look at, coding is not just about APIs), it is possible to have fun with Unicodes:


Happy hacking!

2 Likes

I certainly do not know that. Youā€™ve been consistent over time in arguing that unicode characters in code are bad in general, and have repeatedly dismissed any advantage to using them. And now, tellingly, you are posting a table of characters instead of real-world code to prove your point.

The arguments you advance very clearly dismiss unicode per se, even to the point of saying any sort of contact with it is unacceptable to you.

To both of you: if you think that unicode is fine, but some people over-use it, you should make that clear. Youā€™ve both been painting it in an aggressively negative light, even though almost all uses are of the sort Iā€™m describing: a few Greek letters here or there.

Thereā€™s no disagreement that unicode can be used to write bad code. But if your goals are actually ā€œto promote responsible useā€, then you should both change tack, the current arguments and tone, and the repetitiveness of these attacks, are getting increasingly tiresome and aggravating, which is also why my hackles get raised.

6 Likes

Are these really problems you have ever encountered in practice? Poor naming of variables is certainly not a problem that is created by the use of unicode symbols, just because you can create confusing situations using unicode doesnā€™t mean that all uses of unicode will. Iā€™d say that the following variables names are at least equally bad and confusing, even if they are ascii only. These names are obviously absurd, but no more so than your examples

Il1li1il1il1il1i1li1l1ili1li
li1li1l1i1li1li1l1il1i1li1li
li1l1il1i1li1li1li1l1il1ili1
li1l1l1il1il1il1i1li1li1li1i
Il1li1il1il1il1i1li1l1ili1li
Il1li1il1il1il1i1li1l1ili1li
li1li1l1i1li1li1l1il1i1li1li
li1l1l1il1il1il1i1li1li1li1i
10 Likes

I generally visualize complex code with symbol-overlay in Emacs. Here is a screenshot with the above code highlighted:

screenshot_2024-01-09-154115

Each variant is colored differently. I have this bound to C-s s s.

8 Likes

At the danger of repeating myself once too often: Unicode used to write source code (not as data or as commentary) is a BAD IDEA.

It makes source code NOT WYSIWYG. Recall that programming fonts go to great lengths to distinguish 1 and l and 0 and O etc. Why? Because it avoids confusion. With Unicode it is so easy to spoof by substituting look-alikes! That flies in the face of the need to read source code unambiguously.

2 Likes

I think you mean ā€œAmbiguous Unicode characters used to write source code is a bad ideaā€ :slight_smile:

Because I donā€™t think thereā€™s anything wrong with writing ĻƒĀ² = 1.2 in code.

Obviously some judgement is required to decide what could be ambiguous and itā€™s easier to get wrong than if you restrict yourself to ASCII but itā€™s a trade-off. Unicode does have benefits so use it as long as the benefits are not outweighed by issues such as legibility.

4 Likes

There are at least three symbols Ļƒ in unicode that look the same:
https://util.unicode.org/UnicodeJsps/confusables.jsp?a=Ļƒ&r=None

2 Likes

I think we all agree that writing obfuscated code is a bad idea (except as a party trick), Petr. (As others pointed out, you can do that with ASCII too.) But itā€™s really a conceptually separate topic from the question of ruling out Unicode usage entirely, e.g. symbols like ā‰¤ or Ī± versus <= or alpha, or of providing parser support for typing \\alpha for Ī± (or whatever), which arenā€™t easily confusable. Please start new threads for new topics.

(A linting tool might well want to warn about confusable symbols, either ASCII or Unicode, used in the same scope. There are well-established tables and algorithms to identify these. Warn against confusable identifiers used in the same scope? Ā· Issue #259 Ā· JuliaTesting/Aqua.jl Ā· GitHub)

2 Likes

Good catch! I think the right conclusion here is to not use ā€œMALAYALAM DIGIT ZEROā€ in Julia code.

I mean, two ambiguous characters are a problem if you use both. If itā€™s obvious that people use only one of them then thereā€™s no problem.

If you disagree with me I guess you never use l in your code: Unicode Utilities: Confusables :slight_smile:

5 Likes

(Moved to new thread since this a different topic focus. See also previous discussion in Unicode: a bad idea, in general)

4 Likes

(I use ā„“ = \ell TAB when I need a single-character integer-valued l identifier, just like I do in mathematical papers. Unicode to the rescue! :wink: )

9 Likes

Which has three lookalikes.

ā€¦ that I donā€™t use, so no problem. (Whereas I use I and | and the number 1 fairly often, though to be fair I also try to use a programming font where these look distinct from l.)

7 Likes

Well, my point was that you used \ell in order to avoid confusion with I, 1, l. Yet, At the same time you introduced possible confusion between ā„“, ā„“, and š“µ.

Yes, and this illustrates the underlying weakness of your argument. I donā€™t use ā„“, ā„“, and š“µ in the same code, and I think this is true of most people. I do use I, 1, l in the same code, and I think this is true of most people. Why are you worried about the former more than the latter?

Unicode confusables indeed give new ways to write intentionally obfuscated code, and it would be great if linting tools like Aqua.jl warned about them. But you havenā€™t made a good case for them being a problem in ordinary (non-malicious) usage.

9 Likes

Steven, it seems to me youā€™re thinking of this problem in terms of ā€œcan I read my own codeā€. In that case, of course you can, because you know which letter you chose. However, reading someone elseā€™s code is different: It is not obvious when reading the source code which of the unicode characters they chose.

2 Likes

The difference is that I can reasonably expect to encounter situations where someone has written |, I, l and 1 all in one codebase, but itā€™d be a deliberate attempt to obfuscate code if someone used ā„“, ā„“, and š“µ together in one codebase.

Unicode may create more theoretical opportunities to get confused when reading code, but in practice (especially due to the latex shortcuts) it reduces this confusion because it gives us better, more distinct choices than those that are available with just ASCII.

5 Likes

It is not that likely. But my point was, they could choose any one of those letters. How is the reader to know which one it was, until they expend some of their time on analyzing it?

Assuming they arenā€™t trying to be intentionally perverse, it usually is obvious, because usually for mathematical symbols (which is what we are talking about here mostly) there is a single obvious choice (e.g. \ell or \rho etcetera).

In very rare cases where Iā€™m not sure, I can always paste it into the REPL to find out. (Such cases are usually not confusables, though, but simply some symbol I havenā€™t used before.)

And, again, assuming the programmer is not malicious, they wonā€™t use two easily confusable symbols in the same scope, so even if they make a weird choice (e.g. APL ā“ U+2374 instead of Greek \rho = Ļ U+03C1, which seems very unlikely and borderline malicious), if I try to edit the code and insert the wrong symbol it will normally give a runtime error. Linter support for confusable warnings will help here, too. (And yes, you can construct an example where it will still run without error. But this is straying pretty far into the hypothetical here. So far you havenā€™t given a single example of a problem occurring in the wild.)

3 Likes

Julia normalizes ā„“, and ā„“ to the same letter, and š“µ can be figured out fairly easily if one is confused via copy-paste:

help?> š“µ
"š“µ" can be typed by \bscrl<tab>

In almost all circumstances thereā€™s a clear canonical unicode letter to use which is determined by the most favourable latex completion.

1 Like