Hi, I have some strings with a name and a distance such as "Haugastøl, Km 0". I want to remove the km point with the chop() function but what’s interesting is that chop() and length() count ø as one character but not findlast().
The reason is that in Julia strings are UTF-8 encoded and you can index into them using byte count or character count. Some functions use the first approach and some the second. This is explained here in the Julia manual and additionally here in my blog. If some of the explanations are not clear please comment and I can expand on them.
(actually my blog just reminded me that there is even a third option which is “number of characters displayed” which can be different from the two basic ones I have listed)
@etas - going back to your original question. Do you know how to do what you wanted or you need a solution (if it is the latter could you please precisely define what you need then an efficient solution can be proposed). Thank you!
My question above was that most likely what @etas needs is Regex matching, but I need to understand exactly what is the pattern that should be identified (most likely not finding the l character as it is specific to only a given string and will not work in general).
Separate from the issue of UTF-8 indexing (as in findlast) vs “character counting” (as in length), you should also be aware that Unicode is more complicated than you think. For example:
This has nothing to do with Julia or how it encodes strings. It’s because a “character” like ü might actually be represented by multiple Unicode codepoints (a u followed by a “combining accent” in this case).
In practice, you mostly find indices in strings by searching, e.g. by doing a regex search for r", *Km *[0-9]+$" in this case, in which case these complications are mostly hidden.
But it can be confusing when slicing strings “visually” for things you enter by hand. For working “visually” with a string, the closest thing to a human-perceived “character” is actually something called a “grapheme” in Unicode, and Julia 1.9 should have a function to slice strings based on grapheme counts.
I completely forgot to use Regex ! I will try that as we don’t need to know the counting methods used by the functions.
My strings come from a file and I want to remove the last part, beginning by ", Km". Regex can definitively do the job without complications.