Counting special characters ü, å, ø, etc

Hi, I have some strings with a name and a distance such as "Haugastøl, Km 0". I want to remove the km point with the chop() function but what’s interesting is that chop() and length() count ø as one character but not findlast().

julia> name = "Haugastøl"
"Haugastøl"

julia> length(name)
9

julia> findlast('l',name)
10

I don’t know if it’s normal to have these results, maybe findlast() should count special characters as one ?

1 Like

The reason is that in Julia strings are UTF-8 encoded and you can index into them using byte count or character count. Some functions use the first approach and some the second. This is explained here in the Julia manual and additionally here in my blog. If some of the explanations are not clear please comment and I can expand on them.

3 Likes

Note what happens when you try to access the 9th element with the Array interface:

julia> name[8]
'ø': Unicode U+00F8 (category Ll: Letter, lowercase)

julia> name[9]
ERROR: StringIndexError: invalid index [9], valid nearby indices [8]=>'ø', [10]=>'l'
Stacktrace:
 [1] getindex(s::String, i::Int64)
   @ Base ./strings/string.jl:226
 [2] top-level scope
   @ REPL[14]:1

julia> name[10]
'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)

(actually my blog just reminded me that there is even a third option which is “number of characters displayed” which can be different from the two basic ones I have listed)

I’m certain this could be done more efficiently but here is a method which returns the number you expected

julia> findfirst(==(findfirst('l', name)), collect(eachindex(name)))
9

Thank you for your quick responses !

@josuagrw I was exactly wondering what would be the result for the 9th index :slight_smile:
Thanks for your solution, I will try it.

@etas - going back to your original question. Do you know how to do what you wanted or you need a solution (if it is the latter could you please precisely define what you need then an efficient solution can be proposed). Thank you!

1 Like

If you want to do this sort of thing

many times for the same string, look up the indexin() function.

My question above was that most likely what @etas needs is Regex matching, but I need to understand exactly what is the pattern that should be identified (most likely not finding the l character as it is specific to only a given string and will not work in general).

Could also be done with type matching:

julia> foreach(display, name)
'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'u': ASCII/Unicode U+0075 (category Ll: Letter, lowercase)
'g': ASCII/Unicode U+0067 (category Ll: Letter, lowercase)
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
's': ASCII/Unicode U+0073 (category Ll: Letter, lowercase)
't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)
'ø': Unicode U+00F8 (category Ll: Letter, lowercase)
'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)

Separate from the issue of UTF-8 indexing (as in findlast) vs “character counting” (as in length), you should also be aware that Unicode is more complicated than you think. For example:

julia> length("fübâr")
7

julia> collect("fübâr")
7-element Vector{Char}:
 'f': ASCII/Unicode U+0066 (category Ll: Letter, lowercase)
 'u': ASCII/Unicode U+0075 (category Ll: Letter, lowercase)
 '̈': Unicode U+0308 (category Mn: Mark, nonspacing)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 '̂': Unicode U+0302 (category Mn: Mark, nonspacing)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)

This has nothing to do with Julia or how it encodes strings. It’s because a “character” like might actually be represented by multiple Unicode codepoints (a u followed by a “combining accent” in this case).

See also my answer in a previous thread: Substring function? - #31 by stevengj

In practice, you mostly find indices in strings by searching, e.g. by doing a regex search for r", *Km *[0-9]+$" in this case, in which case these complications are mostly hidden.

But it can be confusing when slicing strings “visually” for things you enter by hand. For working “visually” with a string, the closest thing to a human-perceived “character” is actually something called a “grapheme” in Unicode, and Julia 1.9 should have a function to slice strings based on grapheme counts.

2 Likes

I completely forgot to use Regex ! I will try that as we don’t need to know the counting methods used by the functions.
My strings come from a file and I want to remove the last part, beginning by ", Km". Regex can definitively do the job without complications.

Thank you @bkamins and @stevengj !