Indexing Unicode Strings

stevengj · June 3, 2021, 12:11pm

“Character” (= codepoint) doesn’t mean what you think in Unicode, and your solution for this is buggy. Some Unicode codepoints are 0 characters wide, and some are 2 characters wide. For example, "föó" is 5 codepoints, 2 of which are “modifier” characters of 0 width (which add accents to the preceding character). (Moreover, the canonically equivalent string "föó" has 3 codepoints! See here if this confuses you.)

A better solution would be to use textwidth, which measures the width of strings and characters (approximately, because in some cases this depends on the font and the terminal). For example:

function clipwidth(s::AbstractString, maxwidth::Integer)
    width = 0
    for (i,c) in pairs(s)
        width += textwidth(c)
        width > maxwidth && return s[1:prevind(s, i)]
    end
    return s
end

In general, thinking of “character indices” in Unicode is very often a sign of misunderstanding Unicode, and in that sense Julia’s string indexing has the helpful side effect of catching a lot of bugs.

PS. That being said, you can get the n-th index of a string s with nextind(s,0,n), so you can do text[1:nextind(text,0,n)] to obtain the first n Unicode characters (codepoints) if that is really what you want.

PPS. Also, I think @printf got it wrong here: julia#41068.

Topic		Replies	Views
Indexing strings by Unicode code point instead of code unit? General Usage strings	14	2493	January 12, 2024
Problems with strings which contain unicode characters on julia 0.7-DEV General Usage	2	529	October 6, 2017
String slicing General Usage	3	2714	October 25, 2018
String indices : byte indexing feels wrong New to Julia strings , unicode	18	1399	December 5, 2023
Unexpected index of Unicode subscript `char` in `string`? General Usage	8	857	June 25, 2021

Indexing Unicode Strings

Related topics