Counting special characters ü, å, ø, etc

Separate from the issue of UTF-8 indexing (as in findlast) vs “character counting” (as in length), you should also be aware that Unicode is more complicated than you think. For example:

julia> length("fübâr")
7

julia> collect("fübâr")
7-element Vector{Char}:
 'f': ASCII/Unicode U+0066 (category Ll: Letter, lowercase)
 'u': ASCII/Unicode U+0075 (category Ll: Letter, lowercase)
 '̈': Unicode U+0308 (category Mn: Mark, nonspacing)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 '̂': Unicode U+0302 (category Mn: Mark, nonspacing)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)

This has nothing to do with Julia or how it encodes strings. It’s because a “character” like might actually be represented by multiple Unicode codepoints (a u followed by a “combining accent” in this case).

See also my answer in a previous thread: Substring function? - #31 by stevengj

In practice, you mostly find indices in strings by searching, e.g. by doing a regex search for r", *Km *[0-9]+$" in this case, in which case these complications are mostly hidden.

But it can be confusing when slicing strings “visually” for things you enter by hand. For working “visually” with a string, the closest thing to a human-perceived “character” is actually something called a “grapheme” in Unicode, and Julia 1.9 should have a function to slice strings based on grapheme counts.

2 Likes