Indexing Unicode Strings

“Character” (= codepoint) doesn’t mean what you think in Unicode, and your solution for this is buggy. Some Unicode codepoints are 0 characters wide, and some are 2 characters wide. For example, "föó" is 5 codepoints, 2 of which are “modifier” characters of 0 width (which add accents to the preceding character). (Moreover, the canonically equivalent string "föó" has 3 codepoints! See here if this confuses you.)

A better solution would be to use textwidth, which measures the width of strings and characters (approximately, because in some cases this depends on the font and the terminal). For example:

function clipwidth(s::AbstractString, maxwidth::Integer)
    width = 0
    for (i,c) in pairs(s)
        width += textwidth(c)
        width > maxwidth && return s[1:prevind(s, i)]
    end
    return s
end

In general, thinking of “character indices” in Unicode is very often a sign of misunderstanding Unicode, and in that sense Julia’s string indexing has the helpful side effect of catching a lot of bugs.

PS. That being said, you can get the n-th index of a string s with nextind(s,0,n), so you can do text[1:nextind(text,0,n)] to obtain the first n Unicode characters (codepoints) if that is really what you want.

PPS. Also, I think @printf got it wrong here: julia#41068.

5 Likes