Separate from the issue of UTF-8 indexing (as in findlast
) vs “character counting” (as in length
), you should also be aware that Unicode is more complicated than you think. For example:
julia> length("fübâr")
7
julia> collect("fübâr")
7-element Vector{Char}:
'f': ASCII/Unicode U+0066 (category Ll: Letter, lowercase)
'u': ASCII/Unicode U+0075 (category Ll: Letter, lowercase)
'̈': Unicode U+0308 (category Mn: Mark, nonspacing)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'̂': Unicode U+0302 (category Mn: Mark, nonspacing)
'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
This has nothing to do with Julia or how it encodes strings. It’s because a “character” like ü
might actually be represented by multiple Unicode codepoints (a u
followed by a “combining accent” in this case).
See also my answer in a previous thread: Substring function? - #31 by stevengj
In practice, you mostly find indices in strings by searching, e.g. by doing a regex search for r", *Km *[0-9]+$"
in this case, in which case these complications are mostly hidden.
But it can be confusing when slicing strings “visually” for things you enter by hand. For working “visually” with a string, the closest thing to a human-perceived “character” is actually something called a “grapheme” in Unicode, and Julia 1.9 should have a function to slice strings based on grapheme counts.