Warning against Unicode confusables

Again, I repeat: With a good programming editor, I, l, 1 are distinguishable at first sight. No copying and pasting into the repl to figure it out is necessary. No malicious (or stupid) substitutions are possible.
image

1 Like

ā„“ and ā„“ are the same letter U+2113. Perhaps Petr meant \ell = ā„“ = U+2113 and \scrl = š“ = U+1D4C1, which are not normalized to equivalent identifiers (:ā„“ != :š“).

1 Like

I think we need to be careful with statements like this. I believe you want to encourage programmers to rarely use obscure symbols.

However, you are also effectively saying that all programmers should only program in English as supported by ASCII or perhaps Latin-1 (ISO-8859-1). Thatā€™s a very western ā€œEuropean-originā€ perspective that we need to be be conscious about. While American English has become a defacto standard for international correspondence and thus programming, there is validity in people also writing programs in their native human languages. I know you likely did not intend the statement to discount this.

To the larger question, I think we should aim on making code as readable as possible as a general goal. Thereā€™s some question of audience as well.

6 Likes

Which might make it impossible for speakers of other languages to understand their programs. :wink:

Some browsers perform NFC unicode normalization ā€” much like Julia does ā€” when sending discourse replies. Itā€™s likely there was a third distinct codepoint that was attempted to be entered hereā€¦ but thatā€™d be a moot point as Julia would normalize it similarly, avoiding the confusion for whatever that particular original codepoint was.

2 Likes

Yes, indeed. A copy-paste error. Sorry.

And to repeat myself: Unicode improves legibility and clarity of code, so itā€™s a good idea to use it responsibly.

7 Likes

I use vim, and if I hit ga it will show me detailed information about the character the cursor is on. I imagine other editors have similar features.

2 Likes

This seems likely to go in circles forever to no productive end. Unicode exists. Itā€™s not going anywhere. Style guides can and do address the matter. You can choose to endorse or enforce a particular style guide for code you control. And you can kindly suggest others follow some style guides, but with the understanding that itā€™s an opinion-based choice. You can start Tooling topics about better linting/editing/refactoring tools.

Arguing on the internet isnā€™t going to change opinions on the matterā€¦ and there are plenty of other places to argue ad nauseam. Letā€™s set a timer to prevent this from going on indefinitely and potentially escalating unnecessarily here.

18 Likes

I apologize in advance for piling onto this but I thought Iā€™d add a point I didnā€™t see earlier. I guess the usual point of these types of threads is to push the needle on the ā€œstylistic cultureā€ of Julia in one particular direction, so technically I see this all as productive (in some sense of the word :)).

In general I worry the popular use of unicode in Julia may be ever-so-slightly hurting Juliaā€™s rate of adoption. While it may be supported in other languages, it is much more common in Julia including in the standard library.

I remember when I was first starting out with Julia, I looked at some ODE solver examples, and saw the use of unicode everywhere (for variable names, and some API use like \epsilon). I remember this leaving me with a poor first impression of Julia based on my naive assumptions at that time ā€“ ā€œHow can I even write Julia code in my editor if it needs unicode? I canā€™t just remember all the commandsā€¦ā€ You have to keep in mind that a person encountering Julia for the first time is just going to skim some snippets and not really read deep into the complexities of the documentation and all the best tooling practices until theyā€™ve already made the leap (if youā€™ve ever looked at Google analytics for a blog, it can be a bit depressing!) In my mind I only knew that āŒ„+S is Ɵ (which in retrospect is the German beta, rather than the math oneā€¦) so I got initially discouraged.

Of course itā€™s clear we should be setting up our tooling correctly to work with this, but I do want to point out that seeing the use of unicode for the first time in code can be a bit of a distraction to the best parts of Julia, and it might leave users with a negative impression. I think the vast majority of beginners havenā€™t coded in unicode before so itā€™s quite weird to see it the first time.

The only other language I can think of with such prevalant unicode is Lean: 100 theorems in Lean, but this makes a bit more sense to me as it is exclusively about Maths and the visual presentation of theorems, whereas Julia is a more general programming language. Thus Julia should hope to attract a more general audience as well (who might similarly recoil at the sight of unicode math).


Slightly tangential but I was trying to read through the Julia source code today in the method abstract_eval_statement_expr: julia/base/compiler/abstractinterpretation.jl at 5b6a94da5af35a4aa91759cac2f8db7669a6ec2a Ā· JuliaLang/julia Ā· GitHub

This has unicode which is apparently failed to render on my very modern MacBook Pro in Firefox:

I have no idea why GitHub canā€™t render it but stuff like this is always a bit of a negative for unicode imo.

(Not to mention I still have no idea how to write unicode math on my phone ā€“ which I will occasionally use for reviewing PRs)

10 Likes

As a counter-point, unicode (as well as the other ā€œcontroversialā€ feature of 1-based indexing) were among the things that very much attracted me to Julia. As a computational scientist, it was a strong signal that this language is designed for scientific computing and that I should invest in it.

10 Likes

Thanks, that is a useful perspective. I guess we could coarsely simplify this to the following inference problem:

  • P(starts using julia | likes unicode)
  • P(starts using julia | hates unicode)
  • P(likes unicode)
  • P(hates unicode)

if we only care about attracting more users (inb4 this is obviously not a complete objective, I am aware), then say we optimize P(starts using julia).

My intuition is that P(hates unicode) > P(likes unicode) among general programmers, but perhaps P(hates unicode) ~ P(likes unicode) among computational scientists like yourself?

The remaining question is P(starts using julia | likes unicode) and P(starts using julia | hates unicode). I guess, how much does the unicode really matter if you do like it? Does it really impact your decision to like Julia?

And the other hand, how much does the unicode matter if you donā€™t like it?

My very rough guesstimation (which could be biased to my own taste!) is that P(starts using julia | hates unicode) is much smaller than P(starts using julia | likes unicode) is largeā€¦ And therefore we should be more wary about using unicode in APIs and examples based on that group. But I really am not sure in any of this.

I guess the other question is how much we want to get more general programmers into Julia which is historically very heavy on the computational scientist side? (And would that change this analysisā€¦?)

2 Likes

Iā€™ve always taken the ā€œJulia is a general purpose languageā€ with a grain of salt. It feels very much like a domain-specific language to me (for the domain of scientific computing, understood broadly). And thatā€™s fine! I believe ā€œembrace the nicheā€ is a good motto.

Iā€™m biased, of course.

Iā€™m sure someone has pointed it out already, but the use of the term ā€œUnicodeā€ here seems to have adopted the meaning ā€œthe subset of Unicode that isnā€™t on my ASCII keyboard.ā€ Even if we dismiss mathematics, physics, and non-Western languages (which we absolutely should not), keyboard layouts will differ even within Western languages. People have figured out how to avoid mixing visually similar Unicode characters for their symbols, and if someone doesnā€™t immediately recognize which in an unfamiliar context, they need only learn from other people. ASCII-ifying everything not only doesnā€™t make sense in many contexts, itā€™s not even feasible to do.

You can say the same about any language. Julia is undoubtedly based in English and requires some English knowledge to interact with, but even people who learned English may find it more feasible to program and communicate to their colleagues in their native language. With enough aliasing, even most variables can be renamed in other languages, which helps to align with docstrings and comments.

At its core, thereā€™s really nothing special about unicode here. Itā€™s all about picking good names ā€” names that your audience will understand. See Naming is hard, letā€™s do better.

I think everyone should be able to appreciate that audiences differ wildly throughout the Julia community, with many natural-language and domain-specific subgroups.

6 Likes

I agree that this is a real concern ā€” as I wrote in another post, itā€™s important for newcomers to realize that they can use Unicode symbols in their code, but they are not required to do so in order to program in Julia.

The question is, what are actionable ways to help address this concern? Unicode isnā€™t going away ā€” many people in computational science are not going to go back to writing alpha and beta as variable names once they realize that they can write Ī± and Ī², nor should they have to. Some possible things to work on might be:

  • In tutorial materials aimed at beginners, encourage people to be cautious about Unicode symbols. (You should introduce them at some point, because people will see them in other Julia code, but make sure to emphasize that they are optional, and definitely donā€™t include them without explanation.) I recently submitted a PR to the Think Julia book to correct just such an issue (replace "šŸ¢" with "turtle" in examples by stevengj Ā· Pull Request #61 Ā· BenLauwens/ThinkJulia.jl Ā· GitHub), though Iā€™m not sure when/whether it will be merged. Clarifying patches submitted to other tutorials might be helpful too.
  • Something in the Julia manual to emphasize this? A new FAQ, or edits somewhere else, a blog post? If you have a good idea for this, please feel free to submit a PR aimed at clarifying this issue to newcomers.
  • Maybe an addition to the style guide, suggesting that public APIs should typically be ASCII or have ASCII synonyms for accessibility. (Of course, people in specialized fields may choose to disregard this, just as people can disregard anything in the style guide if they wish. The point is not to shame people, but to make sure they appreciate the tradeoffs here.)
  • If there are isolated Unicode-only APIs in packages (i.e. some random little thing, not a package designed top-to-bottom to use Unicode symbols), a PR to add an ASCII symbol might well be welcome. (No, xor doesnā€™t count here: lack of an infix operator ā‰  lack of an API.)
  • Implement warnings about confusable symbols in linting tools. (I opened an issue for Aqua.jl).
  • Others?
11 Likes

Since it hasnā€™t been mentioned before, I believe @PetrKryslUCSD s point is what is known as a homograph attack in security circles. See also:

5 Likes

Note that the first examples of a homograph attack all involve mimicking ASCII characters. Confusability is not a reason to differentiate ASCII and the rest of Unicode. Barring malice, people have no issue avoiding similar symbols e.g. typing Cyrillic doesnā€™t accidentally involve Latin.

My speculation is that, because Julia allows for such liberal usage of Unicode characters, it tends to attracts an audience that use them. Along with this, some can (and do) go overboard with their usage. Itā€™s clear to me that not all usage improves readability. Some may dispute this, but it is absolutely true to me and many others. Already in that screenshot of yours I see problems to readability besides the failure to render: the blackboard L easily confused with capital L, and the \subset symbol, which, I know what the symbol is in isolation, but have no idea what itā€™s doing in this context. Plenty of other examples I could list off the top of my head that are prone to confusion. Anyway, I think your suspicions that itā€™s hurting first impressions and adoption is right on point.

At the risk of injecting a bit of humor (philosophy?), my dad (born in 1923) would say ā€œNever died when man walked on the moon!ā€ I understand the statement came from people that said something to effect of ā€œthat wonā€™t happen until man walks on the moon.ā€. And we all know what happenedā€¦

This Unicode discussion reminded me of his comment.

1 Like