Just thought a
substring
function would be useful out of the box, with documentation about how it differs from thestring[start:stop]
form in terms of unicode and performance.
Slicing a[m:n]
always makes a copy in Julia (at least, with the built-in types), whether for arrays or strings. If you want to use a view (i.e. create a SubString
object), the easiest way is to use @views
on a block of code, e.g.
julia> s = "αβγł€đŧŧŋ"
"αβγł€đŧŧŋ"
julia> @views s[1:5]
"αβγ"
julia> typeof(ans)
SubString{String}
Slicing with @views
works just fine for this.
The real question is, where are you getting these character indices that you want to pass to your substring
function? Usually you get indices to a substring from some previous iteration over the string, either from your own loop or from something like a findnext
call, and these give you codeunit indices that you can pass to s[m:n]
directly.
If you are counting codepoints as “characters”, e.g. you want the “first 3 characters” in a string, then the odds are high that you are making a mistake. For example, "ü"
is two codepoints (length("ü") == 2
) because it is u
followed by a combining character U+0308. See also this explanation: Myth: Counting coded characters or code points is important.
Because of Unicode’s complexity, wanting a substring from the m
-th codepoint (“character”) to the n
-th codepoint, as opposed to between two string indices (= code units), is actually an extremely uncommon operation (in non-buggy code). This is why it’s not built-in.