String slicing

I know that a lot has been written about string slicing safely with UTF-8 strings that will contain multiple code points per printable character. We have thisind, prevind, nextind and searching as ways to get valid indices. I don’t want to stir up anything here. There are many times when it is conceptually easier to think of a string as an array of n printable characters (length(s) = n). We also know that lastindex(s) is often > length(s).

Here is a simple way to slice strings as if they are arrays of printable characters so that the actual valid index positions are opaque:

function cutstr(s::AbstractString, from::Int, to::Int)
    from < 1        && error("Character number $from out of bounds")
    to > length(s)  && error("Character number $to out of bounds")
    to < from       && error("to character number must be greater or equal to from character number")
    last(first(s, to), to - from + 1)
end

You could dispense with my error checking and just use Julia’s normal errors returned by the last and first functions, but there is very little cost in time for the extra error checking.

A trivial example:

julia> a = "\alpha[TAB]" * "foo"
"αfoo"

julia> @btime cutstr(a,2,4)
  170.718 ns (2 allocations: 64 bytes)
"foo"

julia> @btime cutstr(a^5,2,14)
  474.153 ns (3 allocations: 144 bytes)
"fooαfooαfooαf"

It’s not genius, but it is short and obvious. It’s so short you don’t really need to wrap it in its own function–just use it inline. I tried writing a function that found the correct index point for the starting character and then looped with nextind to catenate the additional characters. Probably my bad coding, but it was slower and certainly too clumsy to use inline.

Anyone found something else–maybe even simpler and faster?

Here is another way, but it is twice as slow:

join(collect(s)[from:to])

A comprehension marginally faster than collect.

And even worse:

reduce(*, collect(s)[from:to])

Both of these suffer from array slicing and either the join or lots of string catenations.

https://github.com/JuliaLang/julia/pull/29796 might help a bit?

Yes. Thanks. I went and looked at the source code for first and used that. A bit different than what you suggest…

function cutstr3(s::AbstractString, from::Int, to::Int)
    from < 1        && error("Character number $from out of bounds")
    to > length(s)  && error("Character number $to out of bounds")
    to < from       && error("to character number must be greater or equal to from character number")

    s[nextind(s, 0, from):nextind(s, 0, to)]
end

A bit faster and only 1 allocation that is not dependent on the length of the string:

julia> biga
"αfooαfooαfooαfooαfoo"

julia> @btime cutstr3(biga, 2,14)
  246.384 ns (1 allocation: 48 bytes)
"fooαfooαfooαf"

Yours is better yet and doesn’t need to explicit error checks. But, as you provided it there is a logic error in calculating up: prevind counts down from the second argument. I think it should be:

up = max(1, prevind(s, ncodeunits(s)+1, length(s) + 1 - to))

To conclude: with logic change your suggestion is the fastest:

function cutstr4(s::AbstractString, from::Int, to::Int)
    lo = min(lastindex(s), nextind(s, 0, from))
    up = max(1, prevind(s, ncodeunits(s)+1, length(s) + 1 - to))
    return s[lo:up]
end

Timing:

@btime cutstr4(biga, 2,14)
  174.033 ns (1 allocation: 48 bytes)
"fooαfooαfooαf"

Let’s call it done. Learned more about using nextind and prevind with their built-in iterations. Looking at Julia source is a great way to learn.