Indexing Unicode Strings

Why can’t array indexing check for valid indices automatically when dealing with unicode strings?
It would be nice if unicode string usage was seamless

https://github.com/JuliaLang/julia/issues/40974
https://github.com/JuliaLang/julia/issues/41033

https://docs.julialang.org/en/v1/manual/strings/#Unicode-and-UTF-8

What do you want str[n] to give you? The n-th character? In that case, string indexing would be O(n) since Julia uses UTF-8.

What is your use case? It is quite rare to want to get exactly the 532nd character for example.

3 Likes

One usecase I actually had yesterday was that I wanted to clip strings to 10 characters for compact printing.

My solution in the end was to use @printf("%.10s", text) instead of print(text[1:10]).

So maybe more an example of “if it doesn’t work, maybe you are using it wrong” :slight_smile:

“Character” (= codepoint) doesn’t mean what you think in Unicode, and your solution for this is buggy. Some Unicode codepoints are 0 characters wide, and some are 2 characters wide. For example, "föó" is 5 codepoints, 2 of which are “modifier” characters of 0 width (which add accents to the preceding character). (Moreover, the canonically equivalent string "föó" has 3 codepoints! See here if this confuses you.)

A better solution would be to use textwidth, which measures the width of strings and characters (approximately, because in some cases this depends on the font and the terminal). For example:

function clipwidth(s::AbstractString, maxwidth::Integer)
    width = 0
    for (i,c) in pairs(s)
        width += textwidth(c)
        width > maxwidth && return s[1:prevind(s, i)]
    end
    return s
end

In general, thinking of “character indices” in Unicode is very often a sign of misunderstanding Unicode, and in that sense Julia’s string indexing has the helpful side effect of catching a lot of bugs.

PS. That being said, you can get the n-th index of a string s with nextind(s,0,n), so you can do text[1:nextind(text,0,n)] to obtain the first n Unicode characters (codepoints) if that is really what you want.

PPS. Also, I think @printf got it wrong here: julia#41068.

5 Likes

Just to add, a shorter way to do this is with first:

julia> text = "föó"^5
"föóföóföóföóföó"

julia> text[1:nextind(text,0,10)]
"föóföó"

julia> first(text, 10)
"föóföó"
3 Likes

I believe this solution will break for Arabic and other languages where the text width of a depends on context (in fact, I believe code like this was responsible for an IOS crash). Not 100% sure how to work around it and get something actually correct though.

You don’t have to go to languages like Arabic. Different terminals disagree on something as simple as the width of an emoji character. See the infamous issue: Julia doesn't like Pizza · Issue #3721 · JuliaLang/julia · GitHub That’s why I wrote:

This is the best you can do in general without rendering the text yourself (not using the terminal), or perhaps by coding for a specific terminal (and font).

In the worst case this will simply display an unexpected width — it cannot cause a crash (unless there is a bug in the OS text-rendering software), because Julia neither knows nor cares how terminals actually display strings. Crashes typically come (in unsafe languages) from code that confuses string sizes (in bytes) for something else (codepoint counts, etcetera).

1 Like

Thanks! I didn’t know there could be zero space unicode characters, I only knew that they could need more then one array index.

In my case I actually didn’t care much about the exact position where it would crop, just that it would crop it to something much smaller than 200 characters :-). And of course that it won’t crash like the text[1:10] version.

This nextind() doesn’t seem to do what I’d expect. I think we’d all agree that ó is a single glyph. I think the behavior that I’d expect when I index into a Unicode string is:

text = “föó”^5 “föóföóföóföóföó” julia> text[1:nextind(text,0,10)] “föóföóföóf”
That is, to get the first 10 complete glyphs.

I think the justification for making a change to get this is the same reason you want straightforward indexing for roman character strings: it’s just a basic feature.
I love that Julia supports Unicode and I think this is just part of full featured Unicode support.

However, maybe a fast solution is not self evident.
Also, is this notion of “complete glyph” an abstraction that breaks down in any cases? Perhaps a different abstraction is necessary.

As Steven said, there is a difference between "föó" and "föó":

julia> str1 = "föó"
"föó"

julia> str2 = "föó"
"föó"

julia> collect(str1)
5-element Vector{Char}:
 'f': ASCII/Unicode U+0066 (category Ll: Letter, lowercase)
 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
 '̈': Unicode U+0308 (category Mn: Mark, nonspacing)
 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
 '́': Unicode U+0301 (category Mn: Mark, nonspacing)

julia> collect(str2)
3-element Vector{Char}:
 'f': ASCII/Unicode U+0066 (category Ll: Letter, lowercase)
 'ö': Unicode U+00F6 (category Ll: Letter, lowercase)
 'ó': Unicode U+00F3 (category Ll: Letter, lowercase)

You can normalize the first case:

julia> Unicode.normalize(str1) |> collect
3-element Vector{Char}:
 'f': ASCII/Unicode U+0066 (category Ll: Letter, lowercase)
 'ö': Unicode U+00F6 (category Ll: Letter, lowercase)
 'ó': Unicode U+00F3 (category Ll: Letter, lowercase)
4 Likes

There are at least four units of abstraction:

  • Code unit: the components of a Unicode encoding, e.g. bytes for UTF-8 or 16-bit words for UTF-16, accessed by codeunits(string) in Julia. There can be multiple code units per…
  • Codepoint: these are Unicode “characters”, corresponding to Char in Julia, the units of string iteration. These include things like combining characters that modify other characters, so there can be multiple codepoints per…
  • Grapheme: the smallest unit of a writing system. (e.g. a Latin character plus all accents and other modifiers). These are accessible by the graphemes(string) iterator in Julia’s Unicode package, which returns sub-strings because graphemes can contain an arbitrary number of codepoints. Nevertheless, there can be multiple graphemes in a single…
  • Glyph: a single on-screen symbol in typography. These depend on the font, which can contain ligatures that combine multiple graphemes into a single glyph.

Because glyphs depend on the font, you can’t count glyphs using string processing alone — you have to be rendering the text yourself and have low-level access to the text-rendering system, e.g. by calling something like Harfbuzz. The closest you can get to glyphs at the string level is to iterate graphemes.

Julia has excellent Unicode support, but Unicode is more complicated than most people realize.

10 Likes