Replicate @tturbo performance

Elrod · August 23, 2022, 10:26am

This is because you’re totally memory bound. Try @turbo with vectors of length 1024 or less.

Did you start julia with multiple threads?

EDIT:
Interestingly, I hadn’t actually seen LLVM SIMD min/max functions before, but it is now.
LV should still do better for sizes like 255, i.e. the power of 2-1.

julia> using .Minmax

julia> x = se(6);

julia> @btime myminmax1_turbo($x)
  23.396 ns (0 allocations: 0 bytes)
(1, 64)

julia> @btime myminmax1_tturbo($x)
  33.715 ns (0 allocations: 0 bytes)
(1, 64)

julia> @btime myminmax1_basic($x)
  36.803 ns (0 allocations: 0 bytes)
(1, 64)

julia> @btime myminmax2_turbo($x)
  20.421 ns (0 allocations: 0 bytes)
(1, 64)

julia> @btime myminmax2_tturbo($x)
  23.642 ns (0 allocations: 0 bytes)
(1, 64)

julia> @btime myminmax2_basic($x)
  36.863 ns (0 allocations: 0 bytes)
(1, 64)

julia> @btime myminmax_mapreduce($x)
  40.438 ns (0 allocations: 0 bytes)
(1, 64)

Once upon a time, base Julia had trouble SIMDing this.

Still, LV is still a fair bit faster. Even with the threading check (these vectors are too small to thread).

I got to length 65k before I noticed LV using 2 threads.

nsajko · August 23, 2022, 11:13am

Indeed you are correct. Comparing performance of myminmax2_basic, myminmax2_turbo and myminmax2_tturbo across different input lengths, @turbo wins for the smallest inputs, then in the range from se(12) to se(13) @tturbo takes the lead, after that myminmax2_basic is tied with myminmax2_turbo but myminmax2_tturbo is suddenly much slower (comparing the median timings):

julia> @benchmark myminmax2_basic(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.457 μs … 43.592 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      9.759 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.439 μs ±  1.824 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▆▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁  ▁                                ▂
  ████████████████████████████████▇▇▇▆▇▇▆▆▆▅▆▆▆▅▅▆▄▆▆▅▅▅▄▅▄▁▄ █
  9.46 μs      Histogram: log(frequency) by time      17.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_turbo(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.418 μs … 93.085 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):      9.709 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.176 μs ±  2.005 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▆▅▅▄▄▃▂▂▂▂▁▁                                              ▂
  █████████████████▇▇█▆▆▆▆▅▆▆▅▅▅▅▅▆▅▅▄▅▅▅▅▄▅▄▃▁▃▄▅▅▅▄▄▃▁▄▃▁▁▄ █
  9.42 μs      Histogram: log(frequency) by time      19.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_tturbo(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 9767 samples with 5 evaluations.
 Range (min … max):   6.448 μs … 26.342 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.241 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.036 μs ±  1.755 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▂▂▃▅▆█▅▃▁                       
  ▁▁▁▁▂▂▃▃▂▂▂▂▁▂▁▁▂▂▂▃▃▄▅▆▆▅▆▆██████████▇▆▄▃▂▂▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁ ▃
  6.45 μs         Histogram: frequency by time        16.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_basic(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 9742 samples with 1 evaluation.
 Range (min … max):  18.815 μs … 119.003 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     23.825 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.777 μs ±   4.714 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▁▃▄▆█▇███▅▅▃▁▁                                            
  ▄▅▅▇██████████████▇▆▅▅▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  18.8 μs         Histogram: frequency by time         44.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_turbo(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 9707 samples with 1 evaluation.
 Range (min … max):  18.815 μs … 106.610 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     23.704 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.672 μs ±   4.239 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁▃▆▇█▇▆▅▄▃▁                                             
  ▁▂▃▄▃▆████████████▆▆▅▄▄▄▃▃▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  18.8 μs         Histogram: frequency by time           42 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_tturbo(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 5236 samples with 1 evaluation.
 Range (min … max):  13.044 μs … 130.655 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     43.958 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   44.657 μs ±   8.647 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                            ▂▃▅▆█▆▆▃▂                           
  ▁▁▁▁▁▁▁▁▂▂▂▂▂▂▁▂▃▃▂▂▂▂▃▃▅▇███████████▇▆▅▅▄▄▃▂▃▂▁▂▂▁▁▁▁▁▁▁▂▃▁ ▃
  13 μs           Histogram: frequency by time         72.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

This is with Julia started with four threads (same as above).

BTW, don’t think that I’m criticizing your work, I know that your packages are tremendously useful, it’s just that comparing microbenchmarks is something I enjoy very much . Also thank you for being so instructive here.

Elrod · August 23, 2022, 12:19pm

FWIW, I got

julia> using Random

julia> @benchmark myminmax2_basic(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.611 μs …  16.944 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.013 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.156 μs ± 802.595 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▄▇█▆▃
  ████████▆▅▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▂ ▃
  5.61 μs         Histogram: frequency by time        10.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_turbo(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.982 μs …  13.480 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.304 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.473 μs ± 687.470 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▂▆▇▅▂
  ████████▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▄▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂ ▃
  3.98 μs         Histogram: frequency by time        8.16 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_tturbo(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  3.263 μs …  13.234 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.667 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.710 μs ± 791.557 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂▃▂▂▂▃▂▄▄▂▄▆█▇▂▃▁ ▁ ▁▂  ▁         ▁                        ▂
  ▄██████████████████████▇▇█▇▅▄▆█▆▅████▆▆▆▄▅▄▆▇▆▆▅▄▄▃▃▁▁▁▁▃▃▆ █
  3.26 μs      Histogram: log(frequency) by time      9.35 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_basic(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  17.183 μs … 67.228 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     22.326 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.755 μs ±  5.780 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▂▆██▆▄▁
  ▂▂▃▆███████▇▆▅▄▄▄▃▃▃▃▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  17.2 μs         Histogram: frequency by time        54.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_turbo(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 9785 samples with 3 evaluations.
 Range (min … max):  10.911 μs … 34.302 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     14.028 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.655 μs ±  2.598 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▃▆█▇▄▁
  ▂▂▂▂▃▄▇██████▇▅▄▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  10.9 μs         Histogram: frequency by time        28.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_tturbo(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 6380 samples with 7 evaluations.
 Range (min … max):  7.464 μs … 24.428 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.435 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.811 μs ±  1.682 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▄▅█▇▂ ▂▅▃
  ▂▃▄▅▆▇██████▇████▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▁▁▁▂ ▃
  7.46 μs        Histogram: frequency by time        18.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

So for 15 it helped, but I saw the regressed mean time for 14 as well.

Looking at only minimum times tends to be misleading for multithreaded or allocating code.

it’s just that comparing microbenchmarks is something I enjoy very much

Feel free to make PRs that adjust the threading ramp up or heuristics.

gitboy16 · August 23, 2022, 12:53pm

You should try with x = rand(10^8) or more and you should see a difference in your benchmarks.

Topic		Replies	Views
Multithreading in LoopVectorization.jl General Usage	8	841	June 30, 2021
Can't understand what LoopVectorization is doing General Usage	7	753	September 1, 2021
@tturbo on function call Performance question , loopvectorization , tturbo	4	166	February 24, 2025
LoopVectorization multithreading for multidimensional arrays Numerics loopvectorization	24	1196	March 17, 2022
VectorizationBase seems to wrongly detect the number of the physical cores New to Julia question , package	5	373	January 5, 2023

Replicate @tturbo performance

Related topics