Yes, that’s the problem.
You were probably trying to test things out by coming up with your indices “visually” on some test strings, and noticed that the indices did not coincide with your visual expectation for Unicode strings. But this would be a problem even if Julia used vectors of codepoints, as my "ü"
example showed.
For example, if I gave you the string "ü,é,â,ỳ"
and asked you to cut out the 3rd and 4th comma-separated fields, analogous to cut -d -f3-4
, what “characters” (codepoints) do you think that corresponds to? Probably you would guess “the 5th to 7th characters”. But no, look what your substring
function returns:
julia> substring("ü,é,â,ỳ", 5,7)
"́,a"
Whoops, is your substring
function buggy? No, it’s just that “codepoints” in Unicode don’t necessarily correspond to what a human reader thinks of as a “character”.
Whereas if you actually implemented a cut
function, you would do a sequence of searches for the delimiter (e.g. with findnext
), which would yield a sequence of string indices (≠ codepoint counts), slicing would work just fine, and it would be absolutely irrelevant how many codepoints occurred between one delimiter and the next.
But because you hadn’t gotten that far, you jumped to the conclusion that Julia’s string handling is broken and we are missing extremely basic functionality like extracting substrings.
(The UTF-8 encoding that Julia employs is not unusual! It’s taking over most of the internet, it’s used in other modern languages like Go and Swift, and it’s been the subject of many, many discussions and revisions in Julia itself. This is not something we picked out of a hat because we hadn’t thought through basic functionality.)