Efficient way to split string at specific index

BambOoxX · June 21, 2022, 5:16pm

Just for the sake of the exercise (I do not know if this is really relevant) I wanted to assess the relative performance of the current split implementation in two cases :

A space is added between initial blocks of characters (resulting in a longer string)
The same but with one character block missing (resulting in an identical initial string, but one less item in the output vector)

str = join([repeat(letter, 8) for letter in ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j')])
str_spaced = join([repeat(letter, 8) for letter in ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j')], " ")
str_spaced_short = join([repeat(letter, 8) for letter in ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i')], " ")
@benchmark split($str_spaced)
@benchmark split($str_spaced_short)

Which gives

aaaaaaaabbbbbbbbccccccccddddddddeeeeeeeeffffffffgggggggghhhhhhhhiiiiiiiijjjjjjjj
aaaaaaaa bbbbbbbb cccccccc dddddddd eeeeeeee ffffffff gggggggg hhhhhhhh iiiiiiii jjjjjjjj
aaaaaaaa bbbbbbbb cccccccc dddddddd eeeeeeee ffffffff gggggggg hhhhhhhh iiiiiiii

julia> @benchmark split($str_spaced)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.270 μs …  1.429 ms  ┊ GC (min … max): 0.00% … 99.78%
 Time  (median):     1.910 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.171 μs ± 14.289 μs  ┊ GC (mean ± σ):  6.57% ±  1.00%

         ▂▅█▆▃▂
  █▆▄▄▃▅███████▆▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  1.27 μs        Histogram: frequency by time        4.94 μs <

 Memory estimate: 1.25 KiB, allocs estimate: 3.

julia> @benchmark split($str_spaced_short)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.180 μs …  1.137 ms  ┊ GC (min … max): 0.00% … 99.80%
 Time  (median):     1.640 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.799 μs ± 11.361 μs  ┊ GC (mean ± σ):  6.31% ±  1.00%

    ▄▂     ▂ ▁▆█▃▆▄▅▂ 
  ▃▅██▄▄▃▃███████████▇▆▅▅▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  1.18 μs        Histogram: frequency by time        3.05 μs <

 Memory estimate: 1.25 KiB, allocs estimate: 3.

I expected that this would be more difficult than splitting on indices given that the delimiter(s) could be anywhere in the string, but I think it also shows that there is room for improvement for index splitting (except the comprehension)

Topic		Replies	Views
String slicing General Usage	3	2713	October 25, 2018
Optimize splitting vector of strings Performance strings	10	164	August 8, 2024
Performance: read data from ascii file, replace `split` General Usage performance	13	285	November 12, 2024
Performance of splitting string and parsing numbers Performance	29	858	December 29, 2022
Parse a string using multiple delimiters New to Julia	1	3424	July 22, 2017

Efficient way to split string at specific index

Related topics