Graphemes vs chars

Cthulhu has an internal TextWidthLimiter<:IO which allows you to print text, attempting (but not always succeeding) in limiting output to a certain number of characters. I’m considering splitting it out into its own package so that it can be used more broadly (and hopefully be made more robust).

One point I’m unsure of is how to handle the distinction between graphemes and Chars: the issue is that single graphemes at least sometimes take up the space of two Chars on my screen. This seems to introduce some inconsistencies in terminal manipulations, and I’m unsure of whether there is even a way to handle this robustly.

Here’s a demo which walks through some of the issues I’ve discovered. Note that here on discourse “éé” prints with no space between the "é"s, but when I try it in my terminal there is a space between them.

julia> using Unicode

julia> str = "exposé"
"exposé"

julia> collect(str)     # collect will treat the é as two Chars
7-element Vector{Char}:
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 'p': ASCII/Unicode U+0070 (category Ll: Letter, lowercase)
 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
 's': ASCII/Unicode U+0073 (category Ll: Letter, lowercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 '́': Unicode U+0301 (category Mn: Mark, nonspacing)

julia> using Unicode

julia> g = collect(graphemes(str))   # graphemes treats the é as a single entity
6-element Vector{SubString{String}}:
 "e"
 "x"
 "p"
 "o"
 "s"
 "é"

julia> c = g[end]
"é"

Now let’s see what happens when we mix printing c with terminal manipulations. “\e[$(n)D” means “go back n” and “\e[K” means "kill to the end of the line. Below, killstr gets built to print n times and then go backwards n times, followed by killing to the end of the line; if each grapheme (despite appearances) really has width 1, this should leave a blank line in all cases:

julia> n = displaysize(stdout)[2]    # current width of my terminal window
119

julia> killstr = repeat('x', n) * "\e[$(n)D\e[K"
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\e[119D\e[K"

julia> print(killstr)    # works as expected

julia> killstr = repeat(c, n) * "\e[$(n)D\e[K"
"ééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééé\e[119D\e[K"

julia> print(killstr)   # does not work as expected
ééééééééééééééééééééééééééééééééééééééééééééééééééééééééééée

So it works as expected when printing 'x' but not . Amusingly, note that the final character is 'e' and not , indicating that it stripped the accent mark.

Therefore, this is also a lie:

julia> textwidth(c)
1

This makes me think that when it comes to width-limited output, Char-iteration is to be prefered over graphemes despite the current internal implementation of TextWidthLimiter. However, if this is a Julia bug (or terminal setting issue) that should be fixed, it might be better to correct it first and then write the package with the correct implementation in mind.

I’d love any insights anyone wants to share.

I have another é:

julia> str = "é"
"é"

julia> collect(str)
1-element Vector{Char}:
 'é': Unicode U+00E9 (category Ll: Letter, lowercase)

Yours was:

julia> str = "é" 
"é"
# this shows in the REPL (Windows Terminal) with half a space 
# before and after the é , see picture below

julia> collect(str)
2-element Vector{Char}:
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 '́': Unicode U+0301 (category Mn: Mark, nonspacing)

image

My é is from german keyboard after pressing ´ and then e .

Glad :sweat_smile: to know it can be produced in different ways. But really my question is, “if there is a distinction between the grapheme and the Char, which is correct?” You’ve side-stepped the issue, but the issue remains.

The behaviour you see depends on how your particular terminal renders characters and handles ANSI escape characters. For example, using the Alacritty terminal, I can’t reproduce:

julia> print(repeat("\u65\u301", 17) * "\e[17D\e[K")

julia>

I tried to get to the bottom of this when writing a terminal application about two years ago - and I found out that there is no consistent behaviour you can exploit. Your best guess is to use something like textwidth, and hope your users have a terminal that behaves correctly.
A few fun edge cases:

  • is a single char, but is super wide. It’s highly different how wide this is. In my terminal, it has a width of 8 characters. On Discourse, it has a width of 1 when monospaced, but approximately the width of 7 m’s when not.
  • 👩🏼‍❤️‍💋‍👨🏻 is a single grapheme cluster (which is distinct from a grapheme!), but is one of many non-standard emojis which are often not handled correctly so it is highly different in how many columns this is rendered. Fun fact, this grapheme cluster has a width of 2, but is composed of 10 chars and 35 (!) codeunits.
  • There are Unicode symbols for “reverse text reading direction”, e.g. in the string “I enjoyed staying – באמת! – at his house.”, the ב is before the ת, since the string changes reading direction midway through. Try highlighting the sentence with your mouse by clicking and dragging through the sentence to see how it behaves. So, what should even happen if you ask your terminal to “go back one character”?

Also, the two different strings you talk about are “\u65\u301” and “'\ue9”. The former can be normalized to the latter:

julia> only(Unicode.normalize("\u65\u301"))
'é': Unicode U+00E9 (category Ll: Letter, lowercase)

In conclusion, you have no hope.

4 Likes

:laughing: That’s kind of what I expected. So I take it this isn’t a Julia bug but instead one at the terminal level? It seems someone should just fix the terminal programs…

Yes that’s what I think, unfortunately. For what it’s worth, I would imagine these specialized cases I showed above are rare, so in 99% of cases you can get away with not handling them. At least, just relying on textwidth is still better than most naive solutions, so having a package that does that is still useful.

Now that we have Preferences maybe the new package should allow this to be configured by the user.

Maybe running Unicode.normalize on the input as an initial parsing step will avoid many edgecases? (or isascii(str) ? str : normalize(str) if performance is a concern)

2 Likes

See also the infamous issue: Julia doesn't like Pizza · Issue #3721 · JuliaLang/julia · GitHub

See also set east asian neutral width to 1 by joshuarubin · Pull Request #83 · JuliaStrings/utf8proc · GitHub

TLDR: the Unicode standard does not fully specify the width of characters in monospaced fonts. There are terminals that don’t respect what information Unicode does provide (because the OS charwidth tables are out of date), but there are also characters where Unicode is ambiguous and different fonts and terminals disagree. textwidth reports what information Unicode provides, and it is usually pretty good … but there will probably always be characters where it doesn’t match your terminal.

(textwidth does not interpret ANSI escape sequences for terminals, but you could certainly add this on top of textwidth. Note that you can call textwidth on an individual Char as well as on a string.)

No. Use textwidth.

Just loop over chars and accumulate the textwidth of each character (+ special handling for ANSI escapes if needed).

(I doubt that the interaction of ANSI escapes with Unicode characters is reliable. The ANSI escape sequences were standardized by ECMA-48, whose current 5th edition was released in 1991, the same year as Unicode 1.0.)

1 Like