Replicate @tturbo performance

This is because you’re totally memory bound. Try @turbo with vectors of length 1024 or less.

Did you start julia with multiple threads?

EDIT:
Interestingly, I hadn’t actually seen LLVM SIMD min/max functions before, but it is now.
LV should still do better for sizes like 255, i.e. the power of 2-1.

julia> using .Minmax

julia> x = se(6);

julia> @btime myminmax1_turbo($x)
  23.396 ns (0 allocations: 0 bytes)
(1, 64)

julia> @btime myminmax1_tturbo($x)
  33.715 ns (0 allocations: 0 bytes)
(1, 64)

julia> @btime myminmax1_basic($x)
  36.803 ns (0 allocations: 0 bytes)
(1, 64)

julia> @btime myminmax2_turbo($x)
  20.421 ns (0 allocations: 0 bytes)
(1, 64)

julia> @btime myminmax2_tturbo($x)
  23.642 ns (0 allocations: 0 bytes)
(1, 64)

julia> @btime myminmax2_basic($x)
  36.863 ns (0 allocations: 0 bytes)
(1, 64)

julia> @btime myminmax_mapreduce($x)
  40.438 ns (0 allocations: 0 bytes)
(1, 64)

Once upon a time, base Julia had trouble SIMDing this.

Still, LV is still a fair bit faster. Even with the threading check (these vectors are too small to thread).

I got to length 65k before I noticed LV using 2 threads.

1 Like

Indeed you are correct. Comparing performance of myminmax2_basic, myminmax2_turbo and myminmax2_tturbo across different input lengths, @turbo wins for the smallest inputs, then in the range from se(12) to se(13) @tturbo takes the lead, after that myminmax2_basic is tied with myminmax2_turbo but myminmax2_tturbo is suddenly much slower (comparing the median timings):

julia> @benchmark myminmax2_basic(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.457 ΞΌs … 43.592 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):      9.759 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   10.439 ΞΌs Β±  1.824 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–…β–ˆβ–†β–…β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–  ▁                                β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–‡β–‡β–†β–†β–†β–…β–†β–†β–†β–…β–…β–†β–„β–†β–†β–…β–…β–…β–„β–…β–„β–β–„ β–ˆ
  9.46 ΞΌs      Histogram: log(frequency) by time      17.3 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_turbo(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.418 ΞΌs … 93.085 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):      9.709 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   10.176 ΞΌs Β±  2.005 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–†β–ˆβ–†β–…β–…β–„β–„β–ƒβ–‚β–‚β–‚β–‚β–β–                                              β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–†β–†β–†β–†β–…β–†β–†β–…β–…β–…β–…β–…β–†β–…β–…β–„β–…β–…β–…β–…β–„β–…β–„β–ƒβ–β–ƒβ–„β–…β–…β–…β–„β–„β–ƒβ–β–„β–ƒβ–β–β–„ β–ˆ
  9.42 ΞΌs      Histogram: log(frequency) by time      19.1 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_tturbo(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 9767 samples with 5 evaluations.
 Range (min … max):   6.448 ΞΌs … 26.342 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     12.241 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   12.036 ΞΌs Β±  1.755 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                               β–‚β–‚β–ƒβ–…β–†β–ˆβ–…β–ƒβ–                       
  β–β–β–β–β–‚β–‚β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–β–‚β–β–β–‚β–‚β–‚β–ƒβ–ƒβ–„β–…β–†β–†β–…β–†β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–„β–ƒβ–‚β–‚β–ƒβ–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β– β–ƒ
  6.45 ΞΌs         Histogram: frequency by time        16.9 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_basic(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 9742 samples with 1 evaluation.
 Range (min … max):  18.815 ΞΌs … 119.003 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     23.825 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   24.777 ΞΌs Β±   4.714 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      β–β–ƒβ–„β–†β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–…β–…β–ƒβ–β–                                            
  β–„β–…β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–…β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–ƒ
  18.8 ΞΌs         Histogram: frequency by time         44.4 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_turbo(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 9707 samples with 1 evaluation.
 Range (min … max):  18.815 ΞΌs … 106.610 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     23.704 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   24.672 ΞΌs Β±   4.239 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

        β–β–ƒβ–†β–‡β–ˆβ–‡β–†β–…β–„β–ƒβ–                                             
  β–β–‚β–ƒβ–„β–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–…β–„β–„β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–ƒ
  18.8 ΞΌs         Histogram: frequency by time           42 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_tturbo(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 5236 samples with 1 evaluation.
 Range (min … max):  13.044 ΞΌs … 130.655 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     43.958 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   44.657 ΞΌs Β±   8.647 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                            β–‚β–ƒβ–…β–†β–ˆβ–†β–†β–ƒβ–‚                           
  β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–ƒβ–ƒβ–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–…β–„β–„β–ƒβ–‚β–ƒβ–‚β–β–‚β–‚β–β–β–β–β–β–β–β–‚β–ƒβ– β–ƒ
  13 ΞΌs           Histogram: frequency by time         72.2 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

This is with Julia started with four threads (same as above).

BTW, don’t think that I’m criticizing your work, I know that your packages are tremendously useful, it’s just that comparing microbenchmarks is something I enjoy very much :grin:. Also thank you for being so instructive here.

FWIW, I got

julia> using Random

julia> @benchmark myminmax2_basic(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.611 ΞΌs …  16.944 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     6.013 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   6.156 ΞΌs Β± 802.595 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–ƒβ–„β–‡β–ˆβ–†β–ƒ
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–‚β–‚ β–ƒ
  5.61 ΞΌs         Histogram: frequency by time        10.7 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_turbo(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.982 ΞΌs …  13.480 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     4.304 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.473 ΞΌs Β± 687.470 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒβ–ˆβ–‚β–†β–‡β–…β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚ β–ƒ
  3.98 ΞΌs         Histogram: frequency by time        8.16 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_tturbo(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  3.263 ΞΌs …  13.234 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     4.667 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   4.710 ΞΌs Β± 791.557 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–‚β–ƒβ–‚β–‚β–‚β–ƒβ–‚β–„β–„β–‚β–„β–†β–ˆβ–‡β–‚β–ƒβ– ▁ ▁▂  ▁         ▁                        β–‚
  β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–‡β–…β–„β–†β–ˆβ–†β–…β–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–†β–„β–…β–„β–†β–‡β–†β–†β–…β–„β–„β–ƒβ–ƒβ–β–β–β–β–ƒβ–ƒβ–† β–ˆ
  3.26 ΞΌs      Histogram: log(frequency) by time      9.35 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_basic(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  17.183 ΞΌs … 67.228 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     22.326 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   23.755 ΞΌs Β±  5.780 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

      β–‚β–†β–ˆβ–ˆβ–†β–„β–
  β–‚β–‚β–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–„β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  17.2 ΞΌs         Histogram: frequency by time        54.7 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_turbo(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 9785 samples with 3 evaluations.
 Range (min … max):  10.911 ΞΌs … 34.302 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     14.028 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   14.655 ΞΌs Β±  2.598 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

         β–ƒβ–†β–ˆβ–‡β–„β–
  β–‚β–‚β–‚β–‚β–ƒβ–„β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–„β–„β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  10.9 ΞΌs         Histogram: frequency by time        28.6 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_tturbo(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 6380 samples with 7 evaluations.
 Range (min … max):  7.464 ΞΌs … 24.428 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     9.435 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   9.811 ΞΌs Β±  1.682 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

         β–„β–…β–ˆβ–‡β–‚ β–‚β–…β–ƒ
  β–‚β–ƒβ–„β–…β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–β–β–‚β–β–β–β–‚ β–ƒ
  7.46 ΞΌs        Histogram: frequency by time        18.1 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

So for 15 it helped, but I saw the regressed mean time for 14 as well.

Looking at only minimum times tends to be misleading for multithreaded or allocating code.

it’s just that comparing microbenchmarks is something I enjoy very much

Feel free to make PRs that adjust the threading ramp up or heuristics.

1 Like

You should try with x = rand(10^8) or more and you should see a difference in your benchmarks.