Efficient way to split string at specific index

BambOoxX · June 21, 2022, 11:04am

I’m currently trying to parse a file whose data lines are composed of 8 character columns without delimiter. Such as

<.. 1..><.. 2..><.. 3..><.. 4..><.. 5..><.. 6..><.. 7..><.. 8..><.. 9..><..10..>
11111111222222223333333344444444555555556666666677777777888888889999999911111111

So far, I’m working with indices and comprehension to do that with

split8_1(str::String) = [str[i+1:8] for i in 0:8:length(str)]

However, I wonder if there is a more elegant or maybe built-in way to do that. Currently split only works by specifying a delimiter, not an index/indices.

I’ve found a potential solution with Iterators.partition under the form of

split8_2(str::String) = join.(Iterators.partition(str, 8))
split8_3(str::String) = Iterators.partition(str, 8)

The join is necessary to get the substrings and not vectors of characters, but give a huge performance hit.

Here are the benchmarks for the three implementations

julia> @benchmark split8_2($str)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  3.311 μs …  1.391 ms  ┊ GC (min … max): 0.00% … 99.59%
 Time  (median):     3.956 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.413 μs ± 17.370 μs  ┊ GC (mean ± σ):  5.50% ±  1.41%

           █▆▅         
  ▁▁▁▁▁▁▂▄████▇▅▄▃▃▃▃▃▃▃▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  3.31 μs        Histogram: frequency by time        6.59 μs <

 Memory estimate: 3.16 KiB, allocs estimate: 53.

julia> @benchmark split8_3($str)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.000 ns … 3.100 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.200 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.187 ns ± 0.075 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                       █ 
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▂
  1 ns           Histogram: frequency by time        1.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark split8_1($str)
BenchmarkTools.Trial: 10000 samples with 870 evaluations.
 Range (min … max):  140.805 ns …  12.287 μs  ┊ GC (min … max): 0.00% … 98.36%
 Time  (median):     193.333 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   214.173 ns ± 359.497 ns  ┊ GC (mean ± σ):  6.23% ±  3.66%

                  ▁▂█▆▂▁          ▄▃▂
  ▂▂▂▂▂▃▅▂▂▃▂▂▃▃▅▅██████▅▃▃▂▂▃▄▅▇▇████▇▅▄▃▃▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁ ▃
  141 ns           Histogram: frequency by time          282 ns <

 Memory estimate: 176 bytes, allocs estimate: 2.

Sukera · June 21, 2022, 1:19pm

You can use @view in there to get the substrings directly, which should be fastest.

BambOoxX · June 21, 2022, 1:42pm

Just tested split8_5(str::String) = [@view str[i+1:8] for i in 0:8:length(str)], it actually is slower (in terms of meand and median) and takes more memory.

julia> @benchmark split8_5($str)
BenchmarkTools.Trial: 10000 samples with 737 evaluations.
 Range (min … max):  139.349 ns …  10.731 μs  ┊ GC (min … max): 0.00% … 95.73%
 Time  (median):     250.204 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   265.516 ns ± 457.921 ns  ┊ GC (mean ± σ):  8.00% ±  4.56%

      ▅    ▅█▆          ▁▁▆▆▁ ▃▄▃
  ▂▂▃▄█▇▅▆████▆▃▄▄▄▄▅█▆▇█████████▅▅▄▃▂▂▂▂▂▂▂▂▁▂▁▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁ ▃
  139 ns           Histogram: frequency by time          448 ns <

 Memory estimate: 336 bytes, allocs estimate: 1.

DNF · June 21, 2022, 1:53pm

You should test the code:

julia> split8_5(str)
11-element Vector{SubString{String}}:
 "11111111"
 ""
 ""
 ""
 ""
 ""
 ""
 ""
 ""
 ""
 ""

Notice that i+1:8 is parsed as (i+1):8 which means you very quickly get 9:8, which is empty.

Here’s a fix:

split8_6(str::String) = [@view str[i.+(1:8)] for i in 0:8:length(str)-1]

Notice the parens and dot in i.+(1:8) and that i should only go to length(str)-1 (or actually, up to length(str)-8).

Performance looks extremely good to me, what are you unhappy about? 140ns is super fast. Of course, the correct version is slower, but it’s still just 200+ ns. Do you expect it to be faster?

BambOoxX · June 21, 2022, 2:06pm

Ouch, my bad, thanks for catching this.
I am actually not unhappy about the performance of the comprehension itself, I was just expecting a built-in function for that would be more efficient. The Iterators.partition feels very elegant and efficient, but once I use join or string or String there is a direct heavy penalty.

DNF · June 21, 2022, 2:08pm

First of all, in Julia, your own functions are as fast as the ‘built-in’ ones, as long as you write them in a reasonable way, and often they are faster, because the built-in ones must be more general. This is one of the main attractions of Julia.

Secondly, the Iterators.partition is tricking you, it is a lazy data structure that doesn’t do any real work, until you start collecting or joining it.

BambOoxX · June 21, 2022, 2:19pm

your own functions are as fast as the ‘built-in’ ones, as long as you write them in a reasonable way

I just trust that people developing core Julia are a bit more reasonable than I am . But you are right.

Secondly, the Iterators.partition is tricking you, it is a lazy data structure that doesn’t do any real work, until you start collecting or joining it.

Given what you said, I can restate my question : Why does collect(Iterators.partition(str::String, n::Integer))) not return Vector{String} instead of Vector{Vector{Char}}

EDIT: Took @DNF comment into account

DNF · June 21, 2022, 2:38pm

Technically, it returns a lazy iterator, not a Vector. But I don’t know why you don’t get a vector of strings when collecting. I agree that would be less surprising.

halleysfifthinc · June 21, 2022, 2:56pm

Another option would be something like:

julia> io = IOBuffer(randstring(8*10))
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=80, maxsize=Inf, ptr=1, mark=-1)

julia> data = read!(io, Matrix{UInt8}(undef, 8,10));

julia> mapslices(String, data; dims=1)
1×10 Matrix{String}:
 "cpMu5B2W"  "y6GglTbq"  "EkOKs5ZG"  "gnDhdJht"  "KItFYlGS"  "hQUfsLDB"  "4GBWzDeo"  "Q3obXdad"  "FJ1qtfKn"  "OkSSd44Y"

BambOoxX · June 21, 2022, 4:06pm

Thanks for the proposal, I just benchmarked it (adding a vec to the result to obtain the same output) and it seems much less efficient unfortunately. See the benchmark below (I am not sure this is the correct way to benchmark this given the IOBuffer though)

function split8_7(str::String)
    io = IOBuffer(str)
    data = read!(io, Matrix{UInt8}(undef, 8, 10))
    return vec(mapslices(String, data; dims=1))
end

julia> @benchmark split8_7($str)
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (min … max):  7.000 μs …  2.903 ms  ┊ GC (min … max): 0.00% … 99.17%
 Time  (median):     8.900 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.829 μs ± 40.202 μs  ┊ GC (mean ± σ):  5.74% ±  1.40%

  ▃▃▂▃▄▄▄▅▆███▅▄▄▃▂▁▁▁▁                                      ▂
  ███████████████████████▇███▇█████▇█▇▇▇█▇▇▇▇▇▆▆▆▅▅▄▆▄▄▄▄▃▄▄ █
  7 μs         Histogram: log(frequency) by time     17.5 μs <

 Memory estimate: 4.70 KiB, allocs estimate: 130.

halleysfifthinc · June 21, 2022, 4:09pm

The IOBuffer was used as an example in place of a file handle (from open(filename)), so you could avoid creating the initial string in the first place; but I’m not sure if/how much that would change the performance.

Sukera · June 21, 2022, 4:10pm

It’s because String is an iterable object over individual Char, and Iterators.partition partitions an iterable into sections of its eltype. Returning a String would break that contract, as well as force an allocation for each individual String, since they are immutable objects.

Jean_Michel · June 21, 2022, 4:36pm

The fastest way to use Iterators.partition is

split8_3(str::String) = String.(Iterators.partition(str, 8))

but it still takes 1μs

DNF · June 21, 2022, 4:49pm

I should have said SubString.

Since you have this:

julia> collect(Iterators.partition(1:5, 2))
3-element Vector{UnitRange{Int64}}:
 1:2
 3:4
 5:5

and

julia> collect(Iterators.partition([1,2,3,4,5], 2))
  3-element Vector{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}:
   [1, 2]
   [3, 4]
   [5]

why can’t a partition over String return a Substring? A SubString is to a String as a SubArray is to an Array, isn’t it?

The docstring of partition just says

Iterate over a collection n elements at a time.

which is vague enough as far as I can tell.

Sukera · June 21, 2022, 5:03pm

Substring would be a good specialization here, yes.

BambOoxX · June 21, 2022, 5:16pm

Just for the sake of the exercise (I do not know if this is really relevant) I wanted to assess the relative performance of the current split implementation in two cases :

A space is added between initial blocks of characters (resulting in a longer string)
The same but with one character block missing (resulting in an identical initial string, but one less item in the output vector)

str = join([repeat(letter, 8) for letter in ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j')])
str_spaced = join([repeat(letter, 8) for letter in ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j')], " ")
str_spaced_short = join([repeat(letter, 8) for letter in ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i')], " ")
@benchmark split($str_spaced)
@benchmark split($str_spaced_short)

Which gives

aaaaaaaabbbbbbbbccccccccddddddddeeeeeeeeffffffffgggggggghhhhhhhhiiiiiiiijjjjjjjj
aaaaaaaa bbbbbbbb cccccccc dddddddd eeeeeeee ffffffff gggggggg hhhhhhhh iiiiiiii jjjjjjjj
aaaaaaaa bbbbbbbb cccccccc dddddddd eeeeeeee ffffffff gggggggg hhhhhhhh iiiiiiii

julia> @benchmark split($str_spaced)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.270 μs …  1.429 ms  ┊ GC (min … max): 0.00% … 99.78%
 Time  (median):     1.910 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.171 μs ± 14.289 μs  ┊ GC (mean ± σ):  6.57% ±  1.00%

         ▂▅█▆▃▂
  █▆▄▄▃▅███████▆▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  1.27 μs        Histogram: frequency by time        4.94 μs <

 Memory estimate: 1.25 KiB, allocs estimate: 3.

julia> @benchmark split($str_spaced_short)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.180 μs …  1.137 ms  ┊ GC (min … max): 0.00% … 99.80%
 Time  (median):     1.640 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.799 μs ± 11.361 μs  ┊ GC (mean ± σ):  6.31% ±  1.00%

    ▄▂     ▂ ▁▆█▃▆▄▅▂ 
  ▃▅██▄▄▃▃███████████▇▆▅▅▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  1.18 μs        Histogram: frequency by time        3.05 μs <

 Memory estimate: 1.25 KiB, allocs estimate: 3.

I expected that this would be more difficult than splitting on indices given that the delimiter(s) could be anywhere in the string, but I think it also shows that there is room for improvement for index splitting (except the comprehension)

BambOoxX · June 21, 2022, 5:37pm

@DNF @Sukera @halleysfifthinc @Jean_Michel Thanks for the discussion and proposals. I have opened an issue if you are interested Collecting an `Iterators.PartitionIterator{String}` should return a vector of `SubStrings` and not a `Vector{Vector{Char}}` · Issue #45768 · JuliaLang/julia · GitHub

DNF · June 21, 2022, 5:51pm

You can write 'a':'j' instead.

BambOoxX · June 21, 2022, 6:04pm

Thank you, I knew that this existed ! I’m sick today so it’s slightly difficult, I’ve been trying with "a":"j" and this fails, never ocurred to retry with chars

rocco_sprmnt21 · June 24, 2022, 3:02pm

function spliteveryn(str,n)
    b, r = divrem(length(str),n)
    e=b*n
    vcat(getindex.(str,(:).(1:n:e,n:n:e)), str[end-r+1:end])
end

function spliteverynM(str,n)
    e=length(str)
    getindex.(str,(:).(1:n:e,n:n:e))
end

Topic		Replies	Views
String slicing General Usage	3	2696	October 25, 2018
Optimize splitting vector of strings Performance strings	10	154	August 8, 2024
Performance: read data from ascii file, replace `split` General Usage performance	13	275	November 12, 2024
Performance of splitting string and parsing numbers Performance	29	836	December 29, 2022
Parse a string using multiple delimiters New to Julia	1	3414	July 22, 2017

Efficient way to split string at specific index

Related topics