Tab completion of \uXXXX in the REPL?

I would like to see better explicit support for UTF-8 encodings in Julia.

The incredible success of Unicode in extending character encodings
to pretty much all human languages is amazing. The ability to encode
glyphs into a sequence of bytes (i.e. a number) means that we can
unambiguously express glyphs in a form that can be exchanged in a
digital form.

However, my experience is that the encoding is the key here. While
the pictures/glyphs of the characters are good to read, for a computer
to say two characters are the same they need to have equal encodings.

If we were to add \u+0041 completions to the julia REPL mode
that would allow users to enter all possible characters into julia.

Similarly, I would like a way to extract the encoding from a visible
glyph or character so that it could be “directly” entered into julia via
the REPL mode. This could probably be implemented with the
existing Unicode package functionality and would help for cases where
your font is missing characters, etc.

It seems like you are using the word “encoding” in a non-standard way — it sounds like you mean the Unicode codepoint value. e.g. U+0041 is the codepoint for the ASCII character 'A', not an “encoding”.

In contrast, UTF-8 is an “encoding” of Unicode characters into byte sequences. For example, U+1F385 '🎅' is encoded as a sequence of 4 bytes in UTF-8:

julia> codeunits("🎅")
4-element Base.CodeUnits{UInt8, String}:
 0xf0
 0x9f
 0x8e
 0x85

whereas U+03B1 'α' is encoded as two bytes in UTF-8:

julia> codeunits("α")
2-element Base.CodeUnits{UInt8, String}:
 0xce
 0xb1

I showed how to get the UTF-8 encoding bytes with codeunits above, but it sounds like you really want the codepoint value, which you can get easily by e.g.:

julia> '🎅' # display information about the character in the REPL
'🎅': Unicode U+1F385 (category So: Symbol, other)

julia> UInt32('🎅') # codepoint as an integer value
0x0001f385

Note that a “visible glyph” might be more than one character, e.g. α̂ (a single “grapheme”) is two characters:

julia> collect("α̂")
2-element Vector{Char}:
 'α': Unicode U+03B1 (category Ll: Letter, lowercase)
 'Ě‚': Unicode U+0302 (category Mn: Mark, nonspacing)

and you can get information about how to type it easily by pasting it at the help?> prompt:

help?> α̂
"α̂" can be typed by \alpha<tab>\hat<tab>

You can also type codepoint values as \uXXXX escape sequences into a string and then copy-paste it:

julia> "\u03B1\u0302"
"α̂"

You can also add custom tab completions to the REPL, e.g.

using REPL: REPLCompletions
REPLCompletions.latex_symbols["\\alphahat"] = "α̂" # or "\u03B1\u0302"

will let you tab-complete \alphahat to α̂. (And, of course, all modern operating systems provide a variety of input methods for Unicode characters.)

See this post for how to directly use codepoint values as variable names like uvar"\u03B1\u0302" in Julia (which in practice will probably be about as popular as trigraphs).

Note that this is not quite true, especially for strings (or “glyphs” or graphemes) that consist of multiple characters. Unicode equivalence generally involves some form of normalization to do comparisons. (And Julia provides facilities for this. For source-code identifiers, Julia does NFC normalization + some custom normalizations.)

It’s 2024 — if your font won’t display characters that you want to use, get a better font. (And if your editor doesn’t support Unicode, stop using ed and get a better editor.)

13 Likes

I think OP means they want to tab complete \U1f385 into 🎅 or similar.

4 Likes

Thanks for the clarification. I was thinking in the context of Julia where its Unicode support
is UTF-8 encoding based. However, the Unicode codepoint value is best aligned with
what I meant.

This produces a string. You have to fiddle with it to get a symbol/variable name.

My point was to have this a feature of the julia environment via the REPL. From your
example it seems like it is definitely possible.

Yes, I followed that discussion with interest. However, if you ever were stuck with
[really] old style keyboard back in the day (as it were), you could actually need the
trigraphs.

Yes, this goes back to your earlier clarification: unicode codepoints is a better term.

I’m using JuliaMono and when I type a string with “\U3060” I get this:

JuliaMono_U3060

and not this:

U3060

which demonstrates that it is not always simple/possible to “get a better font”.
This is a bit contrived with mixing languages but I hope you get the point.

:slightly_frowning_face: Snark much?

Yes, details in my reply to @stevengj

For this issue you do just need appropriate fonts installed.

image

For tab completion of code points, that seems like a reasonable idea to me, you could file an issue or make a PR.

1 Like

This is what I call “and then the magic happens” solution as it doesn’t solve the
problem, just pushes it to some additional process to obtain the desired result. :slight_smile:

I’ve seen lots of examples on discourse here where users did not understand
that the problem was their font and how or where to find one. I find it unsatisfying
to have to fish around hand testing fonts by character until something works.

IDEA: It would be helpful to have a utility to check a font for code points that
match a template set of code points.

If you are using anything but the Windows Console (not the Windows Terminal), then your system should automatically use “fallback fonts” for characters that are missing from the current font, which should allow it to render basically any Unicode glyph that your system supports.

If you are a Japanese speaker who wants to use the U+3060 character, presumably your system is set up to render hiragana characters and they will exist in a fallback font.

If you are using the Windows Console, you should switch to the Windows Terminal — the primitive Unicode support in the Windows Console (e.g. the lack of fallback support) was an endless source of difficulty that led us to seriously consider bundling Julia with a better console on Windows, but thankfully it has now been obsoleted by the Windows Terminal.

4 Likes

There are about 150,000 Unicode glyphs at present, but unfortunately a font can hold only a maximum of 65,535 glyphs, so a font that contains every Unicode glyph is not technically possible.

5 Likes

I think this is more myth than reality — they were included defensively in the earliest C standards (and are finally being removed), but as far as I know, trigraphs never saw widespread use. If you had a system in the 70s that didn’t support the { character (e.g. EBCDIC only), you just didn’t program in C—it’s just too painful to write C code with ??< instead of {—and such systems rapidly disappeared.

In the same way, even though you can program Julia today with variables like uvar"\u3060" as I explained in my post, my guess is that this ability will virtually never be used. If your editor/terminalfont doesn’t support a variable name you want to use, people will either not use that variable name or switch to a better editor/terminal/font.

1 Like

Modern text-rendering systems (used essentially everywhere but in the legacy Windows Console) support fallback fonts — a single font doesn’t need to cover everything. (But there are fonts that cover almost everything except for Asian languages.)

indeed I’m very grateful :grinning:

3 Likes

There is evidence that trigraphs were in use in C standards discussion documents.

Exactly my point. It is not possible that any font (default or otherwise) can include all Unicode glyphs.

You can install fonts from a large collection like Noto and get most of what you’ll need.

Which is why modern text-rendering systems support fallback fonts, as I said.

It would help if you could give a concrete practical problem you want to solve. For example, “I want to use character X extensively in my code, but I don’t know how to type it and/or my system can’t render it.” — for what value of X is this true?

(If you are using the Windows Console, as I said, Unicode is a real problem, but the solution is to switch to the Windows Terminal. The only other example you’ve given is a hypothetical Japanese speaker whose system can’t render Japanese characters, which I find hard to swallow.)

@stevengj You seem to have missed the point. The original discussion was
about better explicit unicode support in Julia and not about how any given
operating system should be able to support unicode or what changes need to be
made outside of Julia to “fix” things.

REPL completions allowing one to enter Unicode characters by codepoint would
address this in Julia without forcing users to any specific IDE, editor, operating
system, configuration, etc.

Again, what concrete problem would \uXXXX tab completion in the REPL solve? Are there users out there who have memorized the codepoint values for lots of Unicode characters and prefer to type them this way?

For the rare cases where someone is given a Unicode codepoint number like U+1F385, not a character that they can copy-paste, and wants to get it into the REPL somehow, they can always type it into a string and copy-paste from that, but how often does it happen that someone thinks “how do I type U+1F385 as a variable”?

It’s certainly possible to add new tab completions to the REPL, but if there is a symbol lots of people want to type it’s usually better to add a human-readable tab completion like \:santa: (which tab-completes to U+1F385) than a tab completion for \u1f385 that no one will be able to remember.

(And people who want to type things like Asian languages invariably have their own input methods that they are already used to.)

1 Like

I have no administrator control on many computers that I work with.

I often am not permitted to use a desired editor or IDE.

I am not able to install my own fonts.

The very simple extension to the REPL allowing a TAB completion fallback
for cases where one cannot control all the aspects of programming and must
only used approved software [Julia!] would allow programming to take place.

Personally, I prefer \alpha and similarly, I would rather type a \Uacodepoint
rather than use a GUI, take my hands off of the keyboard, or get on the internet
to program.

Can you give a concrete example of a \uXXXX that you actually want to use in programming, would be willing to type repeatedly in the awkward \uXXXX form to the point where you would memorize the codepoint XXXX (rather than something you need once in a blue moon and can copy-paste) that doesn’t have a human-readable tab-completion already in the REPL, and that you wouldn’t add your own REPLCompletions tab completion for to your startup.jl (e.g. if it’s an idiosyncratic symbol that appears a lot in a specific project)?

(Why would you even know the codepoint XXXX? If you are googling a symbol to find a codepoint value, wouldn’t you just copy-paste the symbol itself rather than the codepoint?)

Right now this all seems very hypothetical.

5 Likes