Replicate @tturbo performance

gitboy16 · August 22, 2022, 4:57pm

Hi,

Is there an easy way to replicate the performance of @tturbo but using just base Julia?

The following Julia code might be used if needed for this discussion:

function myminmax(x)
  a  = b = x[1]
  @tturbo for i in 2:length(x)
    b = b > x[i] ? b : x[i]
    a = a < x[i] ? a : x[i]
  end
  a, b
end

Thank you

Oscar_Smith · August 22, 2022, 4:58pm

Not easily (you would essentially have to write the threading and unrolled vectorized code yourself).

gitboy16 · August 22, 2022, 5:00pm

Thank you. Is there some doc I can read in order to learn how to write unrolled vectorized code?

gbaraldi · August 22, 2022, 8:26pm

Unfortunately it isn’t as simple as that so I will recommend two things
The first is a short video talking about what SIMD is and how it looks on julia:

The second is a whole MIT course that goes deep into how these things work:

I recommend both if you are interested, but the course is a bit of a time sink as expected

Elrod · August 22, 2022, 8:38pm

This should be slightly faster than your version:

function myminmax(x)
  a = typemax(eltype(x))
  b = typemin(eltype(x))
  @tturbo for i in eachindex(x)
    b = b > x[i] ? b : x[i]
    a = a < x[i] ? a : x[i]
  end
  a, b
end

I’m the author of LoopVectorization.jl
It is of course possible to replicate the performance using just Base Julia.
After all, all the packages involved are pure Julia libraries, and they work by generating Julia code (which includes llvmcalls.

But my intention is for it to be extremely difficult to replicate the performance.

gitboy16 · August 22, 2022, 8:39pm

Thank you very much @gbaraldi

gitboy16 · August 22, 2022, 8:49pm

@Elrod , it would be great if you could share some of your knowledge / tricks and educate me (us) on the topic. Thank you

DNF · August 22, 2022, 9:36pm

So is this faster than b = max(b, x[i])?

Edit: Some simple tests indicate that max is significantly faster, though the measurements are quite noisy.

Elrod · August 22, 2022, 10:13pm

That is a bug. It isn’t unrolling ifelse.

The problem is that LoopVectorization.jl doesn’t do instcombine, so it treats it as a separate comparison and selection when optimizing.
Later, LLVM optimizes them into the same thing, but by then LoopVectorization already made it’s optimization choices.
The rewrite is going to fix this by happening later, after instcombine is able to optimize them into the same operations, so it’d make the same decision in both cases.

I increased the latency of comparison instructions a bit on LoopVectorization main. It still doesn’t make the same decision for both versions, but it’s more similar and IMO pretty reasonable, so it should help.

lmiq · August 22, 2022, 10:32pm

Elrod · August 22, 2022, 10:39pm

Those videos may help to start with.
There are also lots of reading materials online that can help, e.g. Algorithms for Modern Hardware - Algorithmica or Agner Fog’s optimization resources Software optimization resources. C++ and assembly. Windows, Linux, BSD, Mac OS X

It’s a deep comment, so general open ended “how to write fast code” is going to necessarily end up being a book (like linked above); I may write something some day, but for now I’m still focusing on writing loop compilers or fast code instead of text.
But I’m happy to answer any specific questions.

cgeoga · August 23, 2022, 1:30am

@gitboy16, is there a reason you want to learn and re-implement all of the grinding @Elrod has already done? Is there something about LoopVectorization that isn’t working for you right now or something? I have to imagine that would be faster for you to try and discuss specific issues than to try and re-implement a macro like @tturbo.

rocco_sprmnt21 · August 23, 2022, 5:09am

I played around with various shapes of the loop a bit. Below are the results obtained for the different solutions.
I would have expected a better result for the mM2 function, as (in theory) it makes fewer comparisons than the others.
Even the mM, while making the same comparisons as the myminmax, makes fewer assignments as it avoids the reassignment of b = b and a = a.

julia> using  LoopVectorization, BenchmarkTools

julia>

julia> function myminmaxtt(x)
           a  = b = x[1]
           @tturbo for i in eachindex(x)
             b = b > x[i] ? b : x[i]
             a = a < x[i] ? a : x[i]
           end
           a, b
         end
myminmaxtt (generic function with 1 method)

julia> function myminmax(x)
           a  = b = x[1]
           for i in 2:length(x)
             b = b > x[i] ? b : x[i]
             a = a < x[i] ? a : x[i]
           end
           a, b
         end
myminmax (generic function with 1 method)

julia>   function mM(x)
             m= M  = x[1]
             for e in x
                 M =max(M,e)
                 m =min(m,e)
             end
             m, M
           end
mM (generic function with 1 method)

julia>   function mM1(x)
             m= M  = x[1]
             for e in x
                 if e>M
                     M=e
                     continue
                 end
                 if e<m 
                     m=e
                 end
             end
             m, M
           end
mM1 (generic function with 1 method)

julia>   function mM2(x)
             m= M  = x[1]
             for e in x
                 if e>M
                     M=e
                 elseif e<m
                     m=e
                 end
             end
             m, M
           end
mM2 (generic function with 1 method)

julia>   v=rand(1:10^5, 10^6)
1000000-element Vector{Int64}:

julia>   @btime myminmax(v);
  157.200 μs (1 allocation: 32 bytes)

julia>   @btime myminmaxtt(v);
  53.300 μs (1 allocation: 32 bytes)

julia>   @btime mM(v)
  157.000 μs (1 allocation: 32 bytes)
(1, 100000)

julia>   @btime mM1(v)
  695.400 μs (1 allocation: 32 bytes)

julia>   @btime mM2(v)
  696.300 μs (1 allocation: 32 bytes)
(1, 100000)
(1, 100000)

gitboy16 · August 23, 2022, 7:19am

@cgeoga , yes learning is great (at least in my view). Some people might simply like to drive a car (or not) and other prefers to know how the car works…let’s just say that I like both…byb6he way I am trying to re-implement anything, just trying to understand what is going under the hood. Usually people who know how a car works tend to know how to drive it better than people who don’t…

gitboy16 · August 23, 2022, 7:20am

Thank you @Elrod . It gives me plenty of materials to read during my holidays

nsajko · August 23, 2022, 8:27am

I think that this is because the calculations being more interdependent inhibits out-of-order execution and auto-vectorization.
BTW, I did what should be a more rigorous comparison with benchmarking between a couple of these different implementations with an interesting result, see my following post.

gitboy16 · August 23, 2022, 8:34am

@Elrod , maybe I can re-formulate my question; assuming you can only use base Julia, how would you re-write myminmax function to be the fastest possible?

nsajko · August 23, 2022, 8:35am

I tried comparing a couple of implementations, coming to the conclusion that LoopVectorization actually isn’t very useful for this example: all the non-threaded implementations seem to have the same throughput (except the mapreduce-based implementation, which is slightly slower than the others).

Minmax.jl

module Minmax

using LoopVectorization

export
  myminmax1_tturbo,
  myminmax1_turbo,
  myminmax1_basic,
  myminmax2_tturbo,
  myminmax2_turbo,
  myminmax2_basic,
  myminmax_mapreduce,
  se

function myminmax1_tturbo(x)
  a = b = first(x)
  @tturbo for i in eachindex(x)
    e = x[i]
    b = ifelse(b > e, b, e)
    a = ifelse(a < e, a, e)
  end
  a, b
end

function myminmax1_turbo(x)
  a = b = first(x)
  @turbo for i in eachindex(x)
    e = x[i]
    b = ifelse(b > e, b, e)
    a = ifelse(a < e, a, e)
  end
  a, b
end

function myminmax1_basic(x)
  a = b = first(x)
  for e in x
    b = ifelse(b > e, b, e)
    a = ifelse(a < e, a, e)
  end
  a, b
end

function myminmax2_tturbo(x)
  a = b = first(x)
  @tturbo for i in eachindex(x)
    e = x[i]
    b = max(b, e)
    a = min(a, e)
  end
  a, b
end

function myminmax2_turbo(x)
  a = b = first(x)
  @turbo for i in eachindex(x)
    e = x[i]
    b = max(b, e)
    a = min(a, e)
  end
  a, b
end

function myminmax2_basic(x)
  a = b = first(x)
  for e in x
    b = max(b, e)
    a = min(a, e)
  end
  a, b
end

f(a) =
  (a, a)

g(a, b) =
  let (q, w) = a, (e, r) = b
    (min(q, e), max(w, r))
  end

myminmax_mapreduce(x) =
  mapreduce(f, g, x)

se(n) =
  rand(1:2^n, 2^(n + 2))

end  # module Minmax

Benchmarking REPL session

[root@aceramd nsajko]# printf '%s' -1 > /proc/sys/kernel/sched_rt_runtime_us  # Enable real time scheduling.
[root@aceramd nsajko]# chrt -f 99 /home/nsajko/tmp/julia-df3da0582a/bin/julia -O3 --min-optlevel=3 -g 2 --threads 4  # Start with real time scheduling and four threads.
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.0-DEV.1171 (2022-08-23)
 _/ |\__'_|_|_|\__'_|  |  Commit df3da0582a7 (0 days old master)
|__/                   |

julia> include("Minmax.jl")
Main.Minmax

julia> using BenchmarkTools, .Minmax

julia> inp = se(6);

julia> myminmax1_tturbo(inp)
(1, 64)

julia> myminmax1_turbo(inp)
(1, 64)

julia> myminmax1_basic(inp)
(1, 64)

julia> myminmax2_tturbo(inp)
(1, 64)

julia> myminmax2_turbo(inp)
(1, 64)

julia> myminmax2_basic(inp)
(1, 64)

julia> myminmax_mapreduce(inp)
(1, 64)

julia> @benchmark myminmax1_tturbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 595 samples with 1 evaluation.
 Range (min … max):  862.772 μs …  1.145 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     997.426 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   998.703 μs ± 40.537 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▁▁  ▃▅▁▃▆█▂▃█▃▄▄▂▇█▅▂▃▃▄▃▁▂▁               
  ▃▁▁▁▁▁▁▁▁▃▁▃▃▃▁▃▄▅▅████████████████████████████▅▇▇▄▆▆▃▃▃▄▄▄▄ ▅
  863 μs          Histogram: frequency by time          1.1 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_tturbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 594 samples with 1 evaluation.
 Range (min … max):  863.123 μs …  1.129 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     979.581 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   979.502 μs ± 41.878 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▂▇▅▅▃▃ ▅▄▁▁▆▇▆▅▄▄▇▁▆█▅▆▁▇▇▇▁▅                 
  ▄▁▃▁▃▁▃▃▃▃▄▄▆▃▄▅██████▇██████████████████████▆▇▆▇▄▃▃▆▃▃▄▄▃▃▃ ▅
  863 μs          Histogram: frequency by time         1.08 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_turbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 1029 samples with 1 evaluation.
 Range (min … max):  919.319 μs …  1.382 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.081 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.082 ms ± 29.876 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                  ▂▂▃▆█▆▅▆▆▅▃▁                  
  ▂▁▁▁▁▂▂▁▁▁▁▁▂▁▂▂▁▂▁▁▁▂▂▂▃▂▃▃▄▃▅▇█████████████▇▄▃▃▃▂▃▂▂▂▂▁▂▁▂ ▄
  919 μs          Histogram: frequency by time         1.17 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_turbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 1031 samples with 1 evaluation.
 Range (min … max):  828.248 μs …  1.321 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.079 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.080 ms ± 28.116 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                            ▃▆██▇▅▅▃▂           
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂▁▁▂▂▁▁▁▁▁▁▂▂▂▃▄▄▇▇██████████▆▄▃▃▃▁▂▃▂ ▃
  828 μs          Histogram: frequency by time         1.16 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_basic(i) setup = (i = se(18))
BenchmarkTools.Trial: 1032 samples with 1 evaluation.
 Range (min … max):  887.068 μs …  1.291 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.072 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.073 ms ± 27.482 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                        ▅▆██▄▇▇▄▅▄▁             
  ▂▁▁▁▁▁▁▁▂▁▁▁▂▁▂▁▁▁▂▁▁▁▂▁▁▁▁▁▂▃▃▄▄▄▆▆█████████████▆▄▄▃▃▃▃▁▂▂▂ ▄
  887 μs          Histogram: frequency by time         1.15 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_basic(i) setup = (i = se(18))
BenchmarkTools.Trial: 1032 samples with 1 evaluation.
 Range (min … max):  828.277 μs …  1.336 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.072 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.073 ms ± 30.028 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                          ▄█▇█▇▇▄▃              
  ▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂▁▁▂▁▁▂▂▂▂▂▂▃▃▄▅▆██████████▆▅▄▃▂▂▂▂▂▃▂ ▃
  828 μs          Histogram: frequency by time         1.16 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_tturbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 595 samples with 1 evaluation.
 Range (min … max):  747.014 μs …  1.169 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     992.245 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   989.631 μs ± 41.371 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                       ▂▂▂█▆▇▆▅▇█▅███▁▂▁        
  ▂▁▁▁▁▁▁▁▁▂▁▁▂▁▁▁▁▁▁▂▂▁▂▂▁▂▁▂▂▃▆▅▅▅▆█▆█████████████████▇▆▆▆▃▂ ▄
  747 μs          Histogram: frequency by time         1.07 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_tturbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 576 samples with 1 evaluation.
 Range (min … max):  812.738 μs …  1.111 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     997.535 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   996.026 μs ± 41.331 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▄▂▂▄▆▂▄▆▄▃▃▄▁▂▃▆▆█▃▄▄▆▂▂▄        
  ▂▁▁▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▂▂▁▃▄▂▂▄▆▇██████████████████████████▆▅▆▆▃▄ ▄
  813 μs          Histogram: frequency by time         1.08 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_turbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 976 samples with 1 evaluation.
 Range (min … max):  895.303 μs …  1.279 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.087 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.096 ms ± 35.680 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                    ▁▆█▇█▄▁                     
  ▂▁▁▁▁▁▁▁▁▂▁▁▁▁▂▁▁▂▁▁▁▁▁▁▁▂▂▂▂▃▄▄▆▇███████▆▅▄▃▄▅▅▆▄▅▃▃▃▃▃▄▃▂▂ ▃
  895 μs          Histogram: frequency by time          1.2 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_turbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 981 samples with 1 evaluation.
 Range (min … max):  841.272 μs …  1.302 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.069 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.079 ms ± 37.738 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                       ▁▅█▆▁                    
  ▂▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▁▁▁▁▂▂▁▂▂▃▃▄▄▆███████▅▃▂▃▃▃▄▄▆▄▃▃▃▃▃▃ ▃
  841 μs          Histogram: frequency by time         1.18 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_basic(i) setup = (i = se(18))
BenchmarkTools.Trial: 980 samples with 1 evaluation.
 Range (min … max):  950.176 μs …  1.286 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.070 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.082 ms ± 38.737 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                            ▃▅▅█▄▃                              
  ▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▃▂▂▃▂▃▄▄▄▆▇███████▇▆▄▄▃▃▃▂▃▂▂▃▃▄▄▅▅▃▃▃▃▃▃▃▃▃▃ ▃
  950 μs          Histogram: frequency by time         1.19 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_basic(i) setup = (i = se(18))
BenchmarkTools.Trial: 978 samples with 1 evaluation.
 Range (min … max):  819.450 μs …  1.285 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.073 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.084 ms ± 42.823 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                       ▅█▄▁                     
  ▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▂▂▁▁▁▁▂▁▁▂▂▁▁▁▂▂▂▃▄▇█████▆▅▄▄▄▄▄▄▅▄▄▃▃▃▃▃▂▂▂ ▃
  819 μs          Histogram: frequency by time         1.21 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax_mapreduce(i) setup = (i = se(18))
BenchmarkTools.Trial: 977 samples with 1 evaluation.
 Range (min … max):  850.108 μs …  1.292 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.088 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.099 ms ± 38.489 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                       ▁█▇▇▄                    
  ▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▁▁▂▁▁▂▁▁▁▂▂▂▂▂▃▃▃▆██████▇▅▄▃▂▂▃▃▄▆▅▃▃▃▃▃▃ ▃
  850 μs          Histogram: frequency by time          1.2 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax_mapreduce(i) setup = (i = se(18))
BenchmarkTools.Trial: 977 samples with 1 evaluation.
 Range (min … max):  924.488 μs …  1.435 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.089 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.099 ms ± 37.189 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                   ▅▅██▄                        
  ▂▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▆███████▆▄▄▃▃▃▃▃▄▄▄▄▄▃▃▃▃▃▃▄▃ ▃
  924 μs          Histogram: frequency by time          1.2 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_tturbo(i) setup = (i = se(20))
BenchmarkTools.Trial: 144 samples with 1 evaluation.
 Range (min … max):  2.571 ms …  2.918 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.781 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.779 ms ± 30.961 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                        ▁    ▁█▂▆▄▁           
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▁▁▃▃▃██▃████████▇▅▃▄▄▁▁▁▃ ▃
  2.57 ms        Histogram: frequency by time        2.84 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_turbo(i) setup = (i = se(20))
BenchmarkTools.Trial: 205 samples with 1 evaluation.
 Range (min … max):  2.490 ms …  3.179 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.645 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.648 ms ± 59.019 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                        ▁ ▁▂▅▄█▄                              
  ▂▁▁▁▁▁▁▃▁▂▁▁▁▁▁▁▁▁▁▃▅▇█▇███████▄▃▂▂▂▂▁▃▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▂ ▃
  2.49 ms        Histogram: frequency by time        2.82 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_basic(i) setup = (i = se(20))
BenchmarkTools.Trial: 205 samples with 1 evaluation.
 Range (min … max):  2.355 ms …  3.367 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.629 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.629 ms ± 59.624 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                             ▄█▃▄             
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▃▂▂▂▄▃▃▅▇█▇█████▄▃▂▃▁▁▁▁▁▂ ▃
  2.36 ms        Histogram: frequency by time        2.71 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_tturbo(i) setup = (i = se(20))
BenchmarkTools.Trial: 144 samples with 1 evaluation.
 Range (min … max):  2.435 ms …  3.182 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.780 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.774 ms ± 60.085 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                    █▄                        
  ▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▁▃▃▄▄▆▅███▇▂▂▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  2.43 ms        Histogram: frequency by time        3.02 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_turbo(i) setup = (i = se(20))
BenchmarkTools.Trial: 205 samples with 1 evaluation.
 Range (min … max):  2.426 ms …  2.933 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.627 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.628 ms ± 35.231 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                   ▆▅█▇▇▄▅▁                   
  ▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃▃▃▆██████████▅▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▃
  2.43 ms        Histogram: frequency by time        2.75 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_basic(i) setup = (i = se(20))
BenchmarkTools.Trial: 205 samples with 1 evaluation.
 Range (min … max):  2.497 ms …  2.743 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.627 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.626 ms ± 24.625 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                     ▂ ▂  ▃▅█▄▂▃▃▄▃           
  ▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▃▁▄▃▃▃▅▅▆▆█▅█▇▇█████████▅▅▆▆▃▄▁▃▃ ▄
  2.5 ms         Histogram: frequency by time        2.67 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax_mapreduce(i) setup = (i = se(20))
BenchmarkTools.Trial: 204 samples with 1 evaluation.
 Range (min … max):  2.342 ms …  2.939 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.671 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.667 ms ± 38.347 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                              ▁  ██▆▅         
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▁▁▁▂▁▃▁▂▁▃▆██▇████▅▄▁▂▂▁▃ ▃
  2.34 ms        Histogram: frequency by time        2.74 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Summary of the timings:

median timings for smaller input:

myminmax1_tturbo: 979.581 μs
myminmax1_turbo: 1.079 ms
myminmax1_basic: 1.072 ms
myminmax2_tturbo: 992.245 μs
myminmax2_turbo: 1.069 ms
myminmax2_basic: 1.070 ms
myminmax_mapreduce: 1.088 ms

median timings for bigger input:

myminmax1_tturbo: 2.781 ms
myminmax1_turbo: 2.645 ms
myminmax1_basic: 2.629 ms
myminmax2_tturbo: 2.780 ms
myminmax2_turbo: 2.627 ms
myminmax2_basic: 2.627 ms
myminmax_mapreduce: 2.671 ms

Interestingly, it seems the threaded implementations are slower for longer input vectors?

NB: I ran Julia as a real time root process on Linux for greater benchmarking accuracy. Also, I ran this on a laptop, with the charger plugged in. Pretty sure that charging improves performance. Finally, the CPU is an AMD Ryzen 3.

PS: I’m working on a package which should enable nicer analysis and visualization of multi-way microbenchmark-based comparisons like this one.

nsajko · August 23, 2022, 8:39am

PPS: the myminmax_vmapreduce implementation is missing because of this: https://github.com/JuliaSIMD/LoopVectorization.jl/issues/423

nsajko · August 23, 2022, 10:11am

Just one small additional note: my above post seems to show that LoopVectorization doesn’t provide significant benefit over base Julia for the myminmax problem (although it’s a very simple problem). However, there’s also a downside to LoopVectorization: much worse effect inference:

julia> Base.infer_effects(myminmax1_tturbo, (Vector{Int},))
(!c,!e,!n,!t,!s,!m)′

julia> Base.infer_effects(myminmax1_turbo, (Vector{Int},))
(!c,!e,!n,!t,!s,!m)′

julia> Base.infer_effects(myminmax1_basic, (Vector{Int},))
(!c,+e,!n,!t,+s,?m)

julia> Base.infer_effects(myminmax2_tturbo, (Vector{Int},))
(!c,!e,!n,!t,!s,!m)′

julia> Base.infer_effects(myminmax2_turbo, (Vector{Int},))
(!c,!e,!n,!t,!s,!m)′

julia> Base.infer_effects(myminmax2_basic, (Vector{Int},))
(!c,+e,!n,!t,+s,?m)

julia> Base.infer_effects(myminmax_mapreduce, (Vector{Int},))
(!c,+e,!n,!t,+s,?m)

Topic		Replies	Views
Multithreading in LoopVectorization.jl General Usage	8	841	June 30, 2021
Can't understand what LoopVectorization is doing General Usage	7	755	September 1, 2021
@tturbo on function call Performance question , loopvectorization , tturbo	4	166	February 24, 2025
LoopVectorization multithreading for multidimensional arrays Numerics loopvectorization	24	1197	March 17, 2022
VectorizationBase seems to wrongly detect the number of the physical cores New to Julia question , package	5	373	January 5, 2023

Replicate @tturbo performance

Related topics