Warning against Unicode confusables

PetrKryslUCSD · January 9, 2024, 2:19am

Even in the internal code (the guts; which you might have to look at, coding is not just about APIs), it is possible to have fun with Unicodes:

Happy hacking!

DNF · January 9, 2024, 6:22am

I certainly do not know that. You’ve been consistent over time in arguing that unicode characters in code are bad in general, and have repeatedly dismissed any advantage to using them. And now, tellingly, you are posting a table of characters instead of real-world code to prove your point.

The arguments you advance very clearly dismiss unicode per se, even to the point of saying any sort of contact with it is unacceptable to you.

To both of you: if you think that unicode is fine, but some people over-use it, you should make that clear. You’ve both been painting it in an aggressively negative light, even though almost all uses are of the sort I’m describing: a few Greek letters here or there.

There’s no disagreement that unicode can be used to write bad code. But if your goals are actually “to promote responsible use”, then you should both change tack, the current arguments and tone, and the repetitiveness of these attacks, are getting increasingly tiresome and aggravating, which is also why my hackles get raised.

baggepinnen · January 9, 2024, 2:05pm

Are these really problems you have ever encountered in practice? Poor naming of variables is certainly not a problem that is created by the use of unicode symbols, just because you can create confusing situations using unicode doesn’t mean that all uses of unicode will. I’d say that the following variables names are at least equally bad and confusing, even if they are ascii only. These names are obviously absurd, but no more so than your examples

Il1li1il1il1il1i1li1l1ili1li
li1li1l1i1li1li1l1il1i1li1li
li1l1il1i1li1li1li1l1il1ili1
li1l1l1il1il1il1i1li1li1li1i
Il1li1il1il1il1i1li1l1ili1li
Il1li1il1il1il1i1li1l1ili1li
li1li1l1i1li1li1l1il1i1li1li
li1l1l1il1il1il1i1li1li1li1i

Tamas_Papp · January 9, 2024, 2:45pm

baggepinnen:

Il1li1il1il1il1i1li1l1ili1li
li1li1l1i1li1li1l1il1i1li1li
li1l1il1i1li1li1li1l1il1ili1
li1l1l1il1il1il1i1li1li1li1i
Il1li1il1il1il1i1li1l1ili1li
Il1li1il1il1il1i1li1l1ili1li
li1li1l1i1li1li1l1il1i1li1li
li1l1l1il1il1il1i1li1li1li1i

I generally visualize complex code with symbol-overlay in Emacs. Here is a screenshot with the above code highlighted:

screenshot_2024-01-09-154115

Each variant is colored differently. I have this bound to C-s s s.

PetrKryslUCSD · January 12, 2024, 3:06pm

At the danger of repeating myself once too often: Unicode used to write source code (not as data or as commentary) is a BAD IDEA.

It makes source code NOT WYSIWYG. Recall that programming fonts go to great lengths to distinguish 1 and l and 0 and O etc. Why? Because it avoids confusion. With Unicode it is so easy to spoof by substituting look-alikes! That flies in the face of the need to read source code unambiguously.

sijo · January 12, 2024, 3:13pm

I think you mean “Ambiguous Unicode characters used to write source code is a bad idea”

Because I don’t think there’s anything wrong with writing σ² = 1.2 in code.

Obviously some judgement is required to decide what could be ambiguous and it’s easier to get wrong than if you restrict yourself to ASCII but it’s a trade-off. Unicode does have benefits so use it as long as the benefits are not outweighed by issues such as legibility.

PetrKryslUCSD · January 12, 2024, 3:16pm

There are at least three symbols σ in unicode that look the same:
https://util.unicode.org/UnicodeJsps/confusables.jsp?a=σ&r=None

stevengj · January 12, 2024, 3:17pm

I think we all agree that writing obfuscated code is a bad idea (except as a party trick), Petr. (As others pointed out, you can do that with ASCII too.) But it’s really a conceptually separate topic from the question of ruling out Unicode usage entirely, e.g. symbols like ≤ or α versus <= or alpha, or of providing parser support for typing \\alpha for α (or whatever), which aren’t easily confusable. Please start new threads for new topics.

(A linting tool might well want to warn about confusable symbols, either ASCII or Unicode, used in the same scope. There are well-established tables and algorithms to identify these. Warn against confusable identifiers used in the same scope? · Issue #259 · JuliaTesting/Aqua.jl · GitHub)

sijo · January 12, 2024, 3:21pm

Good catch! I think the right conclusion here is to not use “MALAYALAM DIGIT ZERO” in Julia code.

I mean, two ambiguous characters are a problem if you use both. If it’s obvious that people use only one of them then there’s no problem.

If you disagree with me I guess you never use l in your code: Unicode Utilities: Confusables

stevengj · January 12, 2024, 3:24pm

(Moved to new thread since this a different topic focus. See also previous discussion in Unicode: a bad idea, in general)

stevengj · January 12, 2024, 3:41pm

(I use ℓ = \ell TAB when I need a single-character integer-valued l identifier, just like I do in mathematical papers. Unicode to the rescue! )

PetrKryslUCSD · January 12, 2024, 3:43pm

Which has three lookalikes.

stevengj · January 12, 2024, 3:44pm

… that I don’t use, so no problem. (Whereas I use I and | and the number 1 fairly often, though to be fair I also try to use a programming font where these look distinct from l.)

PetrKryslUCSD · January 12, 2024, 4:26pm

Well, my point was that you used \ell in order to avoid confusion with I, 1, l. Yet, At the same time you introduced possible confusion between ℓ, ℓ, and 𝓵.

stevengj · January 12, 2024, 4:31pm

Yes, and this illustrates the underlying weakness of your argument. I don’t use ℓ, ℓ, and 𝓵 in the same code, and I think this is true of most people. I do use I, 1, l in the same code, and I think this is true of most people. Why are you worried about the former more than the latter?

Unicode confusables indeed give new ways to write intentionally obfuscated code, and it would be great if linting tools like Aqua.jl warned about them. But you haven’t made a good case for them being a problem in ordinary (non-malicious) usage.

PetrKryslUCSD · January 12, 2024, 4:59pm

Steven, it seems to me you’re thinking of this problem in terms of “can I read my own code”. In that case, of course you can, because you know which letter you chose. However, reading someone else’s code is different: It is not obvious when reading the source code which of the unicode characters they chose.

Mason · January 12, 2024, 5:04pm

The difference is that I can reasonably expect to encounter situations where someone has written |, I, l and 1 all in one codebase, but it’d be a deliberate attempt to obfuscate code if someone used ℓ, ℓ, and 𝓵 together in one codebase.

Unicode may create more theoretical opportunities to get confused when reading code, but in practice (especially due to the latex shortcuts) it reduces this confusion because it gives us better, more distinct choices than those that are available with just ASCII.

PetrKryslUCSD · January 12, 2024, 5:08pm

It is not that likely. But my point was, they could choose any one of those letters. How is the reader to know which one it was, until they expend some of their time on analyzing it?

stevengj · January 12, 2024, 5:08pm

Assuming they aren’t trying to be intentionally perverse, it usually is obvious, because usually for mathematical symbols (which is what we are talking about here mostly) there is a single obvious choice (e.g. \ell or \rho etcetera).

In very rare cases where I’m not sure, I can always paste it into the REPL to find out. (Such cases are usually not confusables, though, but simply some symbol I haven’t used before.)

And, again, assuming the programmer is not malicious, they won’t use two easily confusable symbols in the same scope, so even if they make a weird choice (e.g. APL ⍴ U+2374 instead of Greek \rho = ρ U+03C1, which seems very unlikely and borderline malicious), if I try to edit the code and insert the wrong symbol it will normally give a runtime error. Linter support for confusable warnings will help here, too. (And yes, you can construct an example where it will still run without error. But this is straying pretty far into the hypothetical here. So far you haven’t given a single example of a problem occurring in the wild.)

Mason · January 12, 2024, 5:10pm

Julia normalizes ℓ, and ℓ to the same letter, and 𝓵 can be figured out fairly easily if one is confused via copy-paste:

help?> 𝓵
"𝓵" can be typed by \bscrl<tab>

In almost all circumstances there’s a clear canonical unicode letter to use which is determined by the most favourable latex completion.

Topic		Replies	Views
Unicode: a bad idea, in general General Usage	83	4454	June 17, 2023
Syntax: Escape hatch for unicode haters Internals & Design syntax , unicode	128	4966	January 16, 2024
Naming: Remove all underscores to matter what? General Usage	123	7422	January 28, 2018
Fonts in VS Code VS Code	26	3896	August 29, 2022
Unicode \epsilon\_y New to Julia	33	5927	October 10, 2019

Warning against Unicode confusables

Related topics