Substring function?

stevengj · February 19, 2022, 2:28pm

Yes, that’s the problem.

You were probably trying to test things out by coming up with your indices “visually” on some test strings, and noticed that the indices did not coincide with your visual expectation for Unicode strings. But this would be a problem even if Julia used vectors of codepoints, as my "ü" example showed.

For example, if I gave you the string "ü,é,â,ỳ" and asked you to cut out the 3rd and 4th comma-separated fields, analogous to cut -d -f3-4, what “characters” (codepoints) do you think that corresponds to? Probably you would guess “the 5th to 7th characters”. But no, look what your substring function returns:

julia> substring("ü,é,â,ỳ", 5,7)
"́,a"

Whoops, is your substring function buggy? No, it’s just that “codepoints” in Unicode don’t necessarily correspond to what a human reader thinks of as a “character”.

Whereas if you actually implemented a cut function, you would do a sequence of searches for the delimiter (e.g. with findnext), which would yield a sequence of string indices (≠ codepoint counts), slicing would work just fine, and it would be absolutely irrelevant how many codepoints occurred between one delimiter and the next.

But because you hadn’t gotten that far, you jumped to the conclusion that Julia’s string handling is broken and we are missing extremely basic functionality like extracting substrings.

(The UTF-8 encoding that Julia employs is not unusual! It’s taking over most of the internet, it’s used in other modern languages like Go and Swift, and it’s been the subject of many, many discussions and revisions in Julia itself. This is not something we picked out of a hat because we hadn’t thought through basic functionality.)

Topic		Replies	Views
SubString doesn't work with unicode New to Julia question , unicode	13	1448	June 17, 2022
String indices : byte indexing feels wrong New to Julia strings , unicode	18	1414	December 5, 2023
Indexing strings by Unicode code point instead of code unit? General Usage strings	14	2521	January 12, 2024
Breakage due to changes in `String` slicing in v0.7 Internals & Design	35	2314	February 12, 2018
StringIndex idea (Julia 2.0) Internals & Design strings , indexing	72	3345	March 27, 2024

Substring function?

Related topics