Counting special characters ü, å, ø, etc

etas · April 1, 2022, 9:49am

Hi, I have some strings with a name and a distance such as "Haugastøl, Km 0". I want to remove the km point with the chop() function but what’s interesting is that chop() and length() count ø as one character but not findlast().

julia> name = "Haugastøl"
"Haugastøl"

julia> length(name)
9

julia> findlast('l',name)
10

I don’t know if it’s normal to have these results, maybe findlast() should count special characters as one ?

bkamins · April 1, 2022, 9:59am

The reason is that in Julia strings are UTF-8 encoded and you can index into them using byte count or character count. Some functions use the first approach and some the second. This is explained here in the Julia manual and additionally here in my blog. If some of the explanations are not clear please comment and I can expand on them.

josuagrw · April 1, 2022, 10:00am

Note what happens when you try to access the 9th element with the Array interface:

julia> name[8]
'ø': Unicode U+00F8 (category Ll: Letter, lowercase)

julia> name[9]
ERROR: StringIndexError: invalid index [9], valid nearby indices [8]=>'ø', [10]=>'l'
Stacktrace:
 [1] getindex(s::String, i::Int64)
   @ Base ./strings/string.jl:226
 [2] top-level scope
   @ REPL[14]:1

julia> name[10]
'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)

bkamins · April 1, 2022, 10:01am

(actually my blog just reminded me that there is even a third option which is “number of characters displayed” which can be different from the two basic ones I have listed)

josuagrw · April 1, 2022, 10:13am

I’m certain this could be done more efficiently but here is a method which returns the number you expected

julia> findfirst(==(findfirst('l', name)), collect(eachindex(name)))
9

etas · April 1, 2022, 10:17am

Thank you for your quick responses !

@josuagrw I was exactly wondering what would be the result for the 9th index
Thanks for your solution, I will try it.

bkamins · April 1, 2022, 10:24am

@etas - going back to your original question. Do you know how to do what you wanted or you need a solution (if it is the latter could you please precisely define what you need then an efficient solution can be proposed). Thank you!

josuagrw · April 1, 2022, 10:33am

If you want to do this sort of thing

many times for the same string, look up the indexin() function.

bkamins · April 1, 2022, 10:36am

My question above was that most likely what @etas needs is Regex matching, but I need to understand exactly what is the pattern that should be identified (most likely not finding the l character as it is specific to only a given string and will not work in general).

josuagrw · April 1, 2022, 10:36am

Could also be done with type matching:

julia> foreach(display, name)
'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'u': ASCII/Unicode U+0075 (category Ll: Letter, lowercase)
'g': ASCII/Unicode U+0067 (category Ll: Letter, lowercase)
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
's': ASCII/Unicode U+0073 (category Ll: Letter, lowercase)
't': ASCII/Unicode U+0074 (category Ll: Letter, lowercase)
'ø': Unicode U+00F8 (category Ll: Letter, lowercase)
'l': ASCII/Unicode U+006C (category Ll: Letter, lowercase)

stevengj · April 1, 2022, 12:39pm

Separate from the issue of UTF-8 indexing (as in findlast) vs “character counting” (as in length), you should also be aware that Unicode is more complicated than you think. For example:

julia> length("fübâr")
7

julia> collect("fübâr")
7-element Vector{Char}:
 'f': ASCII/Unicode U+0066 (category Ll: Letter, lowercase)
 'u': ASCII/Unicode U+0075 (category Ll: Letter, lowercase)
 '̈': Unicode U+0308 (category Mn: Mark, nonspacing)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 '̂': Unicode U+0302 (category Mn: Mark, nonspacing)
 'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)

This has nothing to do with Julia or how it encodes strings. It’s because a “character” like ü might actually be represented by multiple Unicode codepoints (a u followed by a “combining accent” in this case).

See also my answer in a previous thread: Substring function? - #31 by stevengj

In practice, you mostly find indices in strings by searching, e.g. by doing a regex search for r", *Km *[0-9]+$" in this case, in which case these complications are mostly hidden.

But it can be confusing when slicing strings “visually” for things you enter by hand. For working “visually” with a string, the closest thing to a human-perceived “character” is actually something called a “grapheme” in Unicode, and Julia 1.9 should have a function to slice strings based on grapheme counts.

etas · April 1, 2022, 1:25pm

I completely forgot to use Regex ! I will try that as we don’t need to know the counting methods used by the functions.
My strings come from a file and I want to remove the last part, beginning by ", Km". Regex can definitively do the job without complications.

Thank you @bkamins and @stevengj !

Topic		Replies	Views
Unexpected index of Unicode subscript `char` in `string`? General Usage	8	864	June 25, 2021
Substring function? New to Julia strings , unicode	42	4019	July 18, 2022
Performance of length(::String) Performance	24	3941	July 28, 2018
String indices : byte indexing feels wrong New to Julia strings , unicode	18	1414	December 5, 2023
How do I find the number of bytes for a character? New to Julia strings , indexing , unicode	3	207	December 24, 2024

Counting special characters ü, å, ø, etc

Related topics