Efficient way to split string at specific index

I’m currently trying to parse a file whose data lines are composed of 8 character columns without delimiter. Such as

<.. 1..><.. 2..><.. 3..><.. 4..><.. 5..><.. 6..><.. 7..><.. 8..><.. 9..><..10..>
11111111222222223333333344444444555555556666666677777777888888889999999911111111

So far, I’m working with indices and comprehension to do that with

split8_1(str::String) = [str[i+1:8] for i in 0:8:length(str)]

However, I wonder if there is a more elegant or maybe built-in way to do that. Currently split only works by specifying a delimiter, not an index/indices.

I’ve found a potential solution with Iterators.partition under the form of

split8_2(str::String) = join.(Iterators.partition(str, 8))
split8_3(str::String) = Iterators.partition(str, 8)

The join is necessary to get the substrings and not vectors of characters, but give a huge performance hit.

Here are the benchmarks for the three implementations

julia> @benchmark split8_2($str)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  3.311 μs …  1.391 ms  ┊ GC (min … max): 0.00% … 99.59%
 Time  (median):     3.956 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.413 μs ± 17.370 μs  ┊ GC (mean ± σ):  5.50% ±  1.41%

           █▆▅         
  ▁▁▁▁▁▁▂▄████▇▅▄▃▃▃▃▃▃▃▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  3.31 μs        Histogram: frequency by time        6.59 μs <

 Memory estimate: 3.16 KiB, allocs estimate: 53.

julia> @benchmark split8_3($str)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.000 ns … 3.100 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.200 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.187 ns ± 0.075 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                       █ 
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▂
  1 ns           Histogram: frequency by time        1.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark split8_1($str)
BenchmarkTools.Trial: 10000 samples with 870 evaluations.
 Range (min … max):  140.805 ns …  12.287 μs  ┊ GC (min … max): 0.00% … 98.36%
 Time  (median):     193.333 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   214.173 ns ± 359.497 ns  ┊ GC (mean ± σ):  6.23% ±  3.66%

                  ▁▂█▆▂▁          ▄▃▂
  ▂▂▂▂▂▃▅▂▂▃▂▂▃▃▅▅██████▅▃▃▂▂▃▄▅▇▇████▇▅▄▃▃▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁ ▃
  141 ns           Histogram: frequency by time          282 ns <

 Memory estimate: 176 bytes, allocs estimate: 2.

You can use @view in there to get the substrings directly, which should be fastest.

Just tested split8_5(str::String) = [@view str[i+1:8] for i in 0:8:length(str)], it actually is slower (in terms of meand and median) and takes more memory.

julia> @benchmark split8_5($str)
BenchmarkTools.Trial: 10000 samples with 737 evaluations.
 Range (min … max):  139.349 ns …  10.731 μs  ┊ GC (min … max): 0.00% … 95.73%
 Time  (median):     250.204 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   265.516 ns ± 457.921 ns  ┊ GC (mean ± σ):  8.00% ±  4.56%

      ▅    ▅█▆          ▁▁▆▆▁ ▃▄▃
  ▂▂▃▄█▇▅▆████▆▃▄▄▄▄▅█▆▇█████████▅▅▄▃▂▂▂▂▂▂▂▂▁▂▁▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁ ▃
  139 ns           Histogram: frequency by time          448 ns <

 Memory estimate: 336 bytes, allocs estimate: 1.

You should test the code:

julia> split8_5(str)
11-element Vector{SubString{String}}:
 "11111111"
 ""
 ""
 ""
 ""
 ""
 ""
 ""
 ""
 ""
 ""

Notice that i+1:8 is parsed as (i+1):8 which means you very quickly get 9:8, which is empty.

Here’s a fix:

split8_6(str::String) = [@view str[i.+(1:8)] for i in 0:8:length(str)-1]

Notice the parens and dot in i.+(1:8) and that i should only go to length(str)-1 (or actually, up to length(str)-8).

Performance looks extremely good to me, what are you unhappy about? 140ns is super fast. Of course, the correct version is slower, but it’s still just 200+ ns. Do you expect it to be faster?

2 Likes

Ouch, my bad, thanks for catching this.
I am actually not unhappy about the performance of the comprehension itself, I was just expecting a built-in function for that would be more efficient. The Iterators.partition feels very elegant and efficient, but once I use join or string or String there is a direct heavy penalty.

First of all, in Julia, your own functions are as fast as the ‘built-in’ ones, as long as you write them in a reasonable way, and often they are faster, because the built-in ones must be more general. This is one of the main attractions of Julia.

Secondly, the Iterators.partition is tricking you, it is a lazy data structure that doesn’t do any real work, until you start collecting or joining it.

6 Likes

your own functions are as fast as the ‘built-in’ ones, as long as you write them in a reasonable way

I just trust that people developing core Julia are a bit more reasonable than I am :slight_smile: . But you are right.

Secondly, the Iterators.partition is tricking you, it is a lazy data structure that doesn’t do any real work, until you start collecting or joining it.

Given what you said, I can restate my question : Why does collect(Iterators.partition(str::String, n::Integer))) not return Vector{String} instead of Vector{Vector{Char}}

EDIT: Took @DNF comment into account

Technically, it returns a lazy iterator, not a Vector. But I don’t know why you don’t get a vector of strings when collecting. I agree that would be less surprising.

1 Like

Another option would be something like:

julia> io = IOBuffer(randstring(8*10))
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=80, maxsize=Inf, ptr=1, mark=-1)

julia> data = read!(io, Matrix{UInt8}(undef, 8,10));

julia> mapslices(String, data; dims=1)
1×10 Matrix{String}:
 "cpMu5B2W"  "y6GglTbq"  "EkOKs5ZG"  "gnDhdJht"  "KItFYlGS"  "hQUfsLDB"  "4GBWzDeo"  "Q3obXdad"  "FJ1qtfKn"  "OkSSd44Y"

Thanks for the proposal, I just benchmarked it (adding a vec to the result to obtain the same output) and it seems much less efficient unfortunately. See the benchmark below (I am not sure this is the correct way to benchmark this given the IOBuffer though)

function split8_7(str::String)
    io = IOBuffer(str)
    data = read!(io, Matrix{UInt8}(undef, 8, 10))
    return vec(mapslices(String, data; dims=1))
end
julia> @benchmark split8_7($str)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  7.000 μs …  2.903 ms  ┊ GC (min … max): 0.00% … 99.17%
 Time  (median):     8.900 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.829 μs ± 40.202 μs  ┊ GC (mean ± σ):  5.74% ±  1.40%

  ▃▃▂▃▄▄▄▅▆███▅▄▄▃▂▁▁▁▁                                      ▂
  ███████████████████████▇███▇█████▇█▇▇▇█▇▇▇▇▇▆▆▆▅▅▄▆▄▄▄▄▃▄▄ █
  7 μs         Histogram: log(frequency) by time     17.5 μs <

 Memory estimate: 4.70 KiB, allocs estimate: 130.

The IOBuffer was used as an example in place of a file handle (from open(filename)), so you could avoid creating the initial string in the first place; but I’m not sure if/how much that would change the performance.

It’s because String is an iterable object over individual Char, and Iterators.partition partitions an iterable into sections of its eltype. Returning a String would break that contract, as well as force an allocation for each individual String, since they are immutable objects.

The fastest way to use Iterators.partition is

split8_3(str::String) = String.(Iterators.partition(str, 8))

but it still takes 1μs

I should have said SubString.

Since you have this:

julia> collect(Iterators.partition(1:5, 2))
3-element Vector{UnitRange{Int64}}:
 1:2
 3:4
 5:5

and

julia> collect(Iterators.partition([1,2,3,4,5], 2))
  3-element Vector{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}:
   [1, 2]
   [3, 4]
   [5]

why can’t a partition over String return a Substring? A SubString is to a String as a SubArray is to an Array, isn’t it?

The docstring of partition just says

Iterate over a collection n elements at a time.

which is vague enough as far as I can tell.

1 Like

Substring would be a good specialization here, yes.

Just for the sake of the exercise (I do not know if this is really relevant) I wanted to assess the relative performance of the current split implementation in two cases :

  • A space is added between initial blocks of characters (resulting in a longer string)
  • The same but with one character block missing (resulting in an identical initial string, but one less item in the output vector)
str = join([repeat(letter, 8) for letter in ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j')])
str_spaced = join([repeat(letter, 8) for letter in ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j')], " ")
str_spaced_short = join([repeat(letter, 8) for letter in ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i')], " ")
@benchmark split($str_spaced)
@benchmark split($str_spaced_short)

Which gives

aaaaaaaabbbbbbbbccccccccddddddddeeeeeeeeffffffffgggggggghhhhhhhhiiiiiiiijjjjjjjj
aaaaaaaa bbbbbbbb cccccccc dddddddd eeeeeeee ffffffff gggggggg hhhhhhhh iiiiiiii jjjjjjjj
aaaaaaaa bbbbbbbb cccccccc dddddddd eeeeeeee ffffffff gggggggg hhhhhhhh iiiiiiii

julia> @benchmark split($str_spaced)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.270 μs …  1.429 ms  ┊ GC (min … max): 0.00% … 99.78%
 Time  (median):     1.910 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.171 μs ± 14.289 μs  ┊ GC (mean ± σ):  6.57% ±  1.00%

         ▂▅█▆▃▂
  █▆▄▄▃▅███████▆▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  1.27 μs        Histogram: frequency by time        4.94 μs <

 Memory estimate: 1.25 KiB, allocs estimate: 3.

julia> @benchmark split($str_spaced_short)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.180 μs …  1.137 ms  ┊ GC (min … max): 0.00% … 99.80%
 Time  (median):     1.640 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.799 μs ± 11.361 μs  ┊ GC (mean ± σ):  6.31% ±  1.00%

    ▄▂     ▂ ▁▆█▃▆▄▅▂ 
  ▃▅██▄▄▃▃███████████▇▆▅▅▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  1.18 μs        Histogram: frequency by time        3.05 μs <

 Memory estimate: 1.25 KiB, allocs estimate: 3.

I expected that this would be more difficult than splitting on indices given that the delimiter(s) could be anywhere in the string, but I think it also shows that there is room for improvement for index splitting (except the comprehension)

@DNF @Sukera @halleysfifthinc @Jean_Michel Thanks for the discussion and proposals. I have opened an issue if you are interested Collecting an `Iterators.PartitionIterator{String}` should return a vector of `SubStrings` and not a `Vector{Vector{Char}}` · Issue #45768 · JuliaLang/julia · GitHub

You can write 'a':'j' instead.

1 Like

Thank you, I knew that this existed ! I’m sick today so it’s slightly difficult, I’ve been trying with "a":"j" and this fails, never ocurred to retry with chars :slight_smile:

1 Like
function spliteveryn(str,n)
    b, r = divrem(length(str),n)
    e=b*n
    vcat(getindex.(str,(:).(1:n:e,n:n:e)), str[end-r+1:end])
end

function spliteverynM(str,n)
    e=length(str)
    getindex.(str,(:).(1:n:e,n:n:e))
end