String slicing

lewis · October 25, 2018, 12:43am

I know that a lot has been written about string slicing safely with UTF-8 strings that will contain multiple code points per printable character. We have thisind, prevind, nextind and searching as ways to get valid indices. I don’t want to stir up anything here. There are many times when it is conceptually easier to think of a string as an array of n printable characters (length(s) = n). We also know that lastindex(s) is often > length(s).

Here is a simple way to slice strings as if they are arrays of printable characters so that the actual valid index positions are opaque:

function cutstr(s::AbstractString, from::Int, to::Int)
    from < 1        && error("Character number $from out of bounds")
    to > length(s)  && error("Character number $to out of bounds")
    to < from       && error("to character number must be greater or equal to from character number")
    last(first(s, to), to - from + 1)
end

You could dispense with my error checking and just use Julia’s normal errors returned by the last and first functions, but there is very little cost in time for the extra error checking.

A trivial example:

julia> a = "\alpha[TAB]" * "foo"
"αfoo"

julia> @btime cutstr(a,2,4)
  170.718 ns (2 allocations: 64 bytes)
"foo"

julia> @btime cutstr(a^5,2,14)
  474.153 ns (3 allocations: 144 bytes)
"fooαfooαfooαf"

It’s not genius, but it is short and obvious. It’s so short you don’t really need to wrap it in its own function–just use it inline. I tried writing a function that found the correct index point for the starting character and then looped with nextind to catenate the additional characters. Probably my bad coding, but it was slower and certainly too clumsy to use inline.

Anyone found something else–maybe even simpler and faster?

lewis · October 25, 2018, 2:23am

Here is another way, but it is twice as slow:

join(collect(s)[from:to])

A comprehension marginally faster than collect.

And even worse:

reduce(*, collect(s)[from:to])

Both of these suffer from array slicing and either the join or lots of string catenations.

kristoffer.carlsson · October 25, 2018, 2:26am

https://github.com/JuliaLang/julia/pull/29796 might help a bit?

lewis · October 25, 2018, 3:03am

Yes. Thanks. I went and looked at the source code for first and used that. A bit different than what you suggest…

function cutstr3(s::AbstractString, from::Int, to::Int)
    from < 1        && error("Character number $from out of bounds")
    to > length(s)  && error("Character number $to out of bounds")
    to < from       && error("to character number must be greater or equal to from character number")

    s[nextind(s, 0, from):nextind(s, 0, to)]
end

A bit faster and only 1 allocation that is not dependent on the length of the string:

julia> biga
"αfooαfooαfooαfooαfoo"

julia> @btime cutstr3(biga, 2,14)
  246.384 ns (1 allocation: 48 bytes)
"fooαfooαfooαf"

Yours is better yet and doesn’t need to explicit error checks. But, as you provided it there is a logic error in calculating up: prevind counts down from the second argument. I think it should be:

up = max(1, prevind(s, ncodeunits(s)+1, length(s) + 1 - to))

To conclude: with logic change your suggestion is the fastest:

function cutstr4(s::AbstractString, from::Int, to::Int)
    lo = min(lastindex(s), nextind(s, 0, from))
    up = max(1, prevind(s, ncodeunits(s)+1, length(s) + 1 - to))
    return s[lo:up]
end

Timing:

@btime cutstr4(biga, 2,14)
  174.033 ns (1 allocation: 48 bytes)
"fooαfooαfooαf"

Let’s call it done. Learned more about using nextind and prevind with their built-in iterations. Looking at Julia source is a great way to learn.

Topic		Replies	Views
Truncate String New to Julia strings	13	2553	August 26, 2019
Breakage due to changes in `String` slicing in v0.7 Internals & Design	35	2313	February 12, 2018
String indexing New to Julia indexing	11	3384	April 11, 2020
Weird string slicing in korean Performance	3	478	December 29, 2022
Substring function? New to Julia strings , unicode	42	4007	July 18, 2022

String slicing

Related topics