Counting special characters ü, å, ø, etc

stevengj · April 1, 2022, 12:39pm

Separate from the issue of UTF-8 indexing (as in findlast) vs “character counting” (as in length), you should also be aware that Unicode is more complicated than you think. For example:

julia> length("fübâr")
7

julia> collect("fübâr")
7-element Vector{Char}:
 'f': ASCII/Unicode U+0066 (category Ll: Letter, lowercase)
 'u': ASCII/Unicode U+0075 (category Ll: Letter, lowercase)
 '̈': Unicode U+0308 (category Mn: Mark, nonspacing)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 '̂': Unicode U+0302 (category Mn: Mark, nonspacing)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)

This has nothing to do with Julia or how it encodes strings. It’s because a “character” like ü might actually be represented by multiple Unicode codepoints (a u followed by a “combining accent” in this case).

See also my answer in a previous thread: Substring function? - #31 by stevengj

In practice, you mostly find indices in strings by searching, e.g. by doing a regex search for r", *Km *[0-9]+$" in this case, in which case these complications are mostly hidden.

But it can be confusing when slicing strings “visually” for things you enter by hand. For working “visually” with a string, the closest thing to a human-perceived “character” is actually something called a “grapheme” in Unicode, and Julia 1.9 should have a function to slice strings based on grapheme counts.

Topic		Replies	Views
Regular expressions returning offsets in bytes not characters General Usage question , regex	9	1306	July 7, 2017
Performance of length(::String) Performance	24	4026	July 28, 2018
Unexpected index of Unicode subscript `char` in `string`? General Usage	8	894	June 25, 2021
The correct use of findfirst General Usage strings	4	5916	March 26, 2019
"écru", what working with this ? ERROR: UnicodeError: invalid character index 2 (0xa9 is a continuation byte) General Usage	16	1945	May 2, 2018

Counting special characters ü, å, ø, etc

Related topics