Just for the sake of the exercise (I do not know if this is really relevant) I wanted to assess the relative performance of the current split
implementation in two cases :
- A space is added between initial blocks of characters (resulting in a longer string)
- The same but with one character block missing (resulting in an identical initial string, but one less item in the output vector)
str = join([repeat(letter, 8) for letter in ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j')])
str_spaced = join([repeat(letter, 8) for letter in ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j')], " ")
str_spaced_short = join([repeat(letter, 8) for letter in ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i')], " ")
@benchmark split($str_spaced)
@benchmark split($str_spaced_short)
Which gives
aaaaaaaabbbbbbbbccccccccddddddddeeeeeeeeffffffffgggggggghhhhhhhhiiiiiiiijjjjjjjj
aaaaaaaa bbbbbbbb cccccccc dddddddd eeeeeeee ffffffff gggggggg hhhhhhhh iiiiiiii jjjjjjjj
aaaaaaaa bbbbbbbb cccccccc dddddddd eeeeeeee ffffffff gggggggg hhhhhhhh iiiiiiii
julia> @benchmark split($str_spaced)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.270 μs … 1.429 ms ┊ GC (min … max): 0.00% … 99.78%
Time (median): 1.910 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.171 μs ± 14.289 μs ┊ GC (mean ± σ): 6.57% ± 1.00%
▂▅█▆▃▂
█▆▄▄▃▅███████▆▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂ ▃
1.27 μs Histogram: frequency by time 4.94 μs <
Memory estimate: 1.25 KiB, allocs estimate: 3.
julia> @benchmark split($str_spaced_short)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.180 μs … 1.137 ms ┊ GC (min … max): 0.00% … 99.80%
Time (median): 1.640 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.799 μs ± 11.361 μs ┊ GC (mean ± σ): 6.31% ± 1.00%
▄▂ ▂ ▁▆█▃▆▄▅▂
▃▅██▄▄▃▃███████████▇▆▅▅▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
1.18 μs Histogram: frequency by time 3.05 μs <
Memory estimate: 1.25 KiB, allocs estimate: 3.
I expected that this would be more difficult than splitting on indices given that the delimiter(s) could be anywhere in the string, but I think it also shows that there is room for improvement for index splitting (except the comprehension)