Replicate @tturbo performance

Hi,

Is there an easy way to replicate the performance of @tturbo but using just base Julia?

The following Julia code might be used if needed for this discussion:

function myminmax(x)
  a  = b = x[1]
  @tturbo for i in 2:length(x)
    b = b > x[i] ? b : x[i]
    a = a < x[i] ? a : x[i]
  end
  a, b
end

Thank you

1 Like

Not easily (you would essentially have to write the threading and unrolled vectorized code yourself).

Thank you. Is there some doc I can read in order to learn how to write unrolled vectorized code?

Unfortunately it isn’t as simple as that so I will recommend two things
The first is a short video talking about what SIMD is and how it looks on julia:

The second is a whole MIT course that goes deep into how these things work:

I recommend both if you are interested, but the course is a bit of a time sink as expected

1 Like

This should be slightly faster than your version:

function myminmax(x)
  a = typemax(eltype(x))
  b = typemin(eltype(x))
  @tturbo for i in eachindex(x)
    b = b > x[i] ? b : x[i]
    a = a < x[i] ? a : x[i]
  end
  a, b
end

I’m the author of LoopVectorization.jl
It is of course possible to replicate the performance using just Base Julia.
After all, all the packages involved are pure Julia libraries, and they work by generating Julia code (which includes llvmcalls.

But my intention is for it to be extremely difficult to replicate the performance.

10 Likes

Thank you very much @gbaraldi

@Elrod , it would be great if you could share some of your knowledge / tricks and educate me (us) on the topic. Thank you

So is this faster than b = max(b, x[i])?

Edit: Some simple tests indicate that max is significantly faster, though the measurements are quite noisy.

That is a bug. It isn’t unrolling ifelse.

The problem is that LoopVectorization.jl doesn’t do instcombine, so it treats it as a separate comparison and selection when optimizing.
Later, LLVM optimizes them into the same thing, but by then LoopVectorization already made it’s optimization choices.
The rewrite is going to fix this by happening later, after instcombine is able to optimize them into the same operations, so it’d make the same decision in both cases.

I increased the latency of comparison instructions a bit on LoopVectorization main. It still doesn’t make the same decision for both versions, but it’s more similar and IMO pretty reasonable, so it should help.

Those videos may help to start with.
There are also lots of reading materials online that can help, e.g. Algorithms for Modern Hardware - Algorithmica or Agner Fog’s optimization resources Software optimization resources. C++ and assembly. Windows, Linux, BSD, Mac OS X

It’s a deep comment, so general open ended β€œhow to write fast code” is going to necessarily end up being a book (like linked above); I may write something some day, but for now I’m still focusing on writing loop compilers or fast code instead of text.
But I’m happy to answer any specific questions.

4 Likes

@gitboy16, is there a reason you want to learn and re-implement all of the grinding @Elrod has already done? Is there something about LoopVectorization that isn’t working for you right now or something? I have to imagine that would be faster for you to try and discuss specific issues than to try and re-implement a macro like @tturbo.

I played around with various shapes of the loop a bit. Below are the results obtained for the different solutions.
I would have expected a better result for the mM2 function, as (in theory) it makes fewer comparisons than the others.
Even the mM, while making the same comparisons as the myminmax, makes fewer assignments as it avoids the reassignment of b = b and a = a.

julia> using  LoopVectorization, BenchmarkTools

julia>

julia> function myminmaxtt(x)
           a  = b = x[1]
           @tturbo for i in eachindex(x)
             b = b > x[i] ? b : x[i]
             a = a < x[i] ? a : x[i]
           end
           a, b
         end
myminmaxtt (generic function with 1 method)

julia> function myminmax(x)
           a  = b = x[1]
           for i in 2:length(x)
             b = b > x[i] ? b : x[i]
             a = a < x[i] ? a : x[i]
           end
           a, b
         end
myminmax (generic function with 1 method)

julia>   function mM(x)
             m= M  = x[1]
             for e in x
                 M =max(M,e)
                 m =min(m,e)
             end
             m, M
           end
mM (generic function with 1 method)

julia>   function mM1(x)
             m= M  = x[1]
             for e in x
                 if e>M
                     M=e
                     continue
                 end
                 if e<m 
                     m=e
                 end
             end
             m, M
           end
mM1 (generic function with 1 method)

julia>   function mM2(x)
             m= M  = x[1]
             for e in x
                 if e>M
                     M=e
                 elseif e<m
                     m=e
                 end
             end
             m, M
           end
mM2 (generic function with 1 method)

julia>   v=rand(1:10^5, 10^6)
1000000-element Vector{Int64}:

julia>   @btime myminmax(v);
  157.200 ΞΌs (1 allocation: 32 bytes)

julia>   @btime myminmaxtt(v);
  53.300 ΞΌs (1 allocation: 32 bytes)

julia>   @btime mM(v)
  157.000 ΞΌs (1 allocation: 32 bytes)
(1, 100000)

julia>   @btime mM1(v)
  695.400 ΞΌs (1 allocation: 32 bytes)

julia>   @btime mM2(v)
  696.300 ΞΌs (1 allocation: 32 bytes)
(1, 100000)
(1, 100000)

@cgeoga , yes learning is great (at least in my view). Some people might simply like to drive a car (or not) and other prefers to know how the car works…let’s just say that I like both…byb6he way I am trying to re-implement anything, just trying to understand what is going under the hood. Usually people who know how a car works tend to know how to drive it better than people who don’t…

Thank you @Elrod . It gives me plenty of materials to read during my holidays :slightly_smiling_face:

I think that this is because the calculations being more interdependent inhibits out-of-order execution and auto-vectorization.
BTW, I did what should be a more rigorous comparison with benchmarking between a couple of these different implementations with an interesting result, see my following post.

@Elrod , maybe I can re-formulate my question; assuming you can only use base Julia, how would you re-write myminmax function to be the fastest possible?

I tried comparing a couple of implementations, coming to the conclusion that LoopVectorization actually isn’t very useful for this example: all the non-threaded implementations seem to have the same throughput (except the mapreduce-based implementation, which is slightly slower than the others).

Minmax.jl
module Minmax

using LoopVectorization

export
  myminmax1_tturbo,
  myminmax1_turbo,
  myminmax1_basic,
  myminmax2_tturbo,
  myminmax2_turbo,
  myminmax2_basic,
  myminmax_mapreduce,
  se

function myminmax1_tturbo(x)
  a = b = first(x)
  @tturbo for i in eachindex(x)
    e = x[i]
    b = ifelse(b > e, b, e)
    a = ifelse(a < e, a, e)
  end
  a, b
end

function myminmax1_turbo(x)
  a = b = first(x)
  @turbo for i in eachindex(x)
    e = x[i]
    b = ifelse(b > e, b, e)
    a = ifelse(a < e, a, e)
  end
  a, b
end

function myminmax1_basic(x)
  a = b = first(x)
  for e in x
    b = ifelse(b > e, b, e)
    a = ifelse(a < e, a, e)
  end
  a, b
end

function myminmax2_tturbo(x)
  a = b = first(x)
  @tturbo for i in eachindex(x)
    e = x[i]
    b = max(b, e)
    a = min(a, e)
  end
  a, b
end

function myminmax2_turbo(x)
  a = b = first(x)
  @turbo for i in eachindex(x)
    e = x[i]
    b = max(b, e)
    a = min(a, e)
  end
  a, b
end

function myminmax2_basic(x)
  a = b = first(x)
  for e in x
    b = max(b, e)
    a = min(a, e)
  end
  a, b
end

f(a) =
  (a, a)

g(a, b) =
  let (q, w) = a, (e, r) = b
    (min(q, e), max(w, r))
  end

myminmax_mapreduce(x) =
  mapreduce(f, g, x)

se(n) =
  rand(1:2^n, 2^(n + 2))

end  # module Minmax
Benchmarking REPL session
[root@aceramd nsajko]# printf '%s' -1 > /proc/sys/kernel/sched_rt_runtime_us  # Enable real time scheduling.
[root@aceramd nsajko]# chrt -f 99 /home/nsajko/tmp/julia-df3da0582a/bin/julia -O3 --min-optlevel=3 -g 2 --threads 4  # Start with real time scheduling and four threads.
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.0-DEV.1171 (2022-08-23)
 _/ |\__'_|_|_|\__'_|  |  Commit df3da0582a7 (0 days old master)
|__/                   |

julia> include("Minmax.jl")
Main.Minmax

julia> using BenchmarkTools, .Minmax

julia> inp = se(6);

julia> myminmax1_tturbo(inp)
(1, 64)

julia> myminmax1_turbo(inp)
(1, 64)

julia> myminmax1_basic(inp)
(1, 64)

julia> myminmax2_tturbo(inp)
(1, 64)

julia> myminmax2_turbo(inp)
(1, 64)

julia> myminmax2_basic(inp)
(1, 64)

julia> myminmax_mapreduce(inp)
(1, 64)

julia> @benchmark myminmax1_tturbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 595 samples with 1 evaluation.
 Range (min … max):  862.772 ΞΌs …  1.145 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     997.426 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   998.703 ΞΌs Β± 40.537 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                     ▁▁  β–ƒβ–…β–β–ƒβ–†β–ˆβ–‚β–ƒβ–ˆβ–ƒβ–„β–„β–‚β–‡β–ˆβ–…β–‚β–ƒβ–ƒβ–„β–ƒβ–β–‚β–               
  β–ƒβ–β–β–β–β–β–β–β–β–ƒβ–β–ƒβ–ƒβ–ƒβ–β–ƒβ–„β–…β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–‡β–‡β–„β–†β–†β–ƒβ–ƒβ–ƒβ–„β–„β–„β–„ β–…
  863 ΞΌs          Histogram: frequency by time          1.1 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_tturbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 594 samples with 1 evaluation.
 Range (min … max):  863.123 ΞΌs …  1.129 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     979.581 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   979.502 ΞΌs Β± 41.878 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                  β–‚β–‡β–…β–…β–ƒβ–ƒ β–…β–„β–β–β–†β–‡β–†β–…β–„β–„β–‡β–β–†β–ˆβ–…β–†β–β–‡β–‡β–‡β–β–…                 
  β–„β–β–ƒβ–β–ƒβ–β–ƒβ–ƒβ–ƒβ–ƒβ–„β–„β–†β–ƒβ–„β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–‡β–†β–‡β–„β–ƒβ–ƒβ–†β–ƒβ–ƒβ–„β–„β–ƒβ–ƒβ–ƒ β–…
  863 ΞΌs          Histogram: frequency by time         1.08 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_turbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 1029 samples with 1 evaluation.
 Range (min … max):  919.319 ΞΌs …  1.382 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):       1.081 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.082 ms Β± 29.876 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                  β–‚β–‚β–ƒβ–†β–ˆβ–†β–…β–†β–†β–…β–ƒβ–                  
  β–‚β–β–β–β–β–‚β–‚β–β–β–β–β–β–‚β–β–‚β–‚β–β–‚β–β–β–β–‚β–‚β–‚β–ƒβ–‚β–ƒβ–ƒβ–„β–ƒβ–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–„β–ƒβ–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–‚β–‚β–β–‚β–β–‚ β–„
  919 ΞΌs          Histogram: frequency by time         1.17 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_turbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 1031 samples with 1 evaluation.
 Range (min … max):  828.248 ΞΌs …  1.321 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):       1.079 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.080 ms Β± 28.116 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                            β–ƒβ–†β–ˆβ–ˆβ–‡β–…β–…β–ƒβ–‚           
  β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–‚β–β–β–‚β–‚β–β–β–β–β–β–β–‚β–‚β–‚β–ƒβ–„β–„β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–ƒβ–ƒβ–ƒβ–β–‚β–ƒβ–‚ β–ƒ
  828 ΞΌs          Histogram: frequency by time         1.16 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_basic(i) setup = (i = se(18))
BenchmarkTools.Trial: 1032 samples with 1 evaluation.
 Range (min … max):  887.068 ΞΌs …  1.291 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):       1.072 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.073 ms Β± 27.482 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                        β–…β–†β–ˆβ–ˆβ–„β–‡β–‡β–„β–…β–„β–             
  β–‚β–β–β–β–β–β–β–β–‚β–β–β–β–‚β–β–‚β–β–β–β–‚β–β–β–β–‚β–β–β–β–β–β–‚β–ƒβ–ƒβ–„β–„β–„β–†β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–β–‚β–‚β–‚ β–„
  887 ΞΌs          Histogram: frequency by time         1.15 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_basic(i) setup = (i = se(18))
BenchmarkTools.Trial: 1032 samples with 1 evaluation.
 Range (min … max):  828.277 ΞΌs …  1.336 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):       1.072 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.073 ms Β± 30.028 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                          β–„β–ˆβ–‡β–ˆβ–‡β–‡β–„β–ƒ              
  β–‚β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–‚β–β–β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–„β–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–„β–ƒβ–‚β–‚β–‚β–‚β–‚β–ƒβ–‚ β–ƒ
  828 ΞΌs          Histogram: frequency by time         1.16 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_tturbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 595 samples with 1 evaluation.
 Range (min … max):  747.014 ΞΌs …  1.169 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     992.245 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   989.631 ΞΌs Β± 41.371 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                       β–‚β–‚β–‚β–ˆβ–†β–‡β–†β–…β–‡β–ˆβ–…β–ˆβ–ˆβ–ˆβ–β–‚β–        
  β–‚β–β–β–β–β–β–β–β–β–‚β–β–β–‚β–β–β–β–β–β–β–‚β–‚β–β–‚β–‚β–β–‚β–β–‚β–‚β–ƒβ–†β–…β–…β–…β–†β–ˆβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–†β–ƒβ–‚ β–„
  747 ΞΌs          Histogram: frequency by time         1.07 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_tturbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 576 samples with 1 evaluation.
 Range (min … max):  812.738 ΞΌs …  1.111 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     997.535 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   996.026 ΞΌs Β± 41.331 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                               β–„β–‚β–‚β–„β–†β–‚β–„β–†β–„β–ƒβ–ƒβ–„β–β–‚β–ƒβ–†β–†β–ˆβ–ƒβ–„β–„β–†β–‚β–‚β–„        
  β–‚β–β–β–‚β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–‚β–‚β–β–ƒβ–„β–‚β–‚β–„β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–†β–†β–ƒβ–„ β–„
  813 ΞΌs          Histogram: frequency by time         1.08 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_turbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 976 samples with 1 evaluation.
 Range (min … max):  895.303 ΞΌs …  1.279 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):       1.087 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.096 ms Β± 35.680 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                    β–β–†β–ˆβ–‡β–ˆβ–„β–                     
  β–‚β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–‚β–β–β–‚β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–ƒβ–„β–„β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–„β–ƒβ–„β–…β–…β–†β–„β–…β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–„β–ƒβ–‚β–‚ β–ƒ
  895 ΞΌs          Histogram: frequency by time          1.2 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_turbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 981 samples with 1 evaluation.
 Range (min … max):  841.272 ΞΌs …  1.302 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):       1.069 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.079 ms Β± 37.738 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                       β–β–…β–ˆβ–†β–                    
  β–‚β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–‚β–β–β–β–β–‚β–‚β–β–‚β–‚β–ƒβ–ƒβ–„β–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–ƒβ–‚β–ƒβ–ƒβ–ƒβ–„β–„β–†β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒ β–ƒ
  841 ΞΌs          Histogram: frequency by time         1.18 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_basic(i) setup = (i = se(18))
BenchmarkTools.Trial: 980 samples with 1 evaluation.
 Range (min … max):  950.176 ΞΌs …  1.286 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):       1.070 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.082 ms Β± 38.737 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                            β–ƒβ–…β–…β–ˆβ–„β–ƒ                              
  β–‚β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–ƒβ–‚β–‚β–ƒβ–‚β–ƒβ–„β–„β–„β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–„β–„β–ƒβ–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–ƒβ–ƒβ–„β–„β–…β–…β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒ β–ƒ
  950 ΞΌs          Histogram: frequency by time         1.19 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_basic(i) setup = (i = se(18))
BenchmarkTools.Trial: 978 samples with 1 evaluation.
 Range (min … max):  819.450 ΞΌs …  1.285 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):       1.073 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.084 ms Β± 42.823 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                       β–…β–ˆβ–„β–                     
  β–‚β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–β–β–β–β–‚β–β–β–‚β–‚β–β–β–β–‚β–‚β–‚β–ƒβ–„β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–„β–„β–„β–„β–„β–„β–…β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚ β–ƒ
  819 ΞΌs          Histogram: frequency by time         1.21 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax_mapreduce(i) setup = (i = se(18))
BenchmarkTools.Trial: 977 samples with 1 evaluation.
 Range (min … max):  850.108 ΞΌs …  1.292 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):       1.088 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.099 ms Β± 38.489 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                       β–β–ˆβ–‡β–‡β–„                    
  β–‚β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–‚β–β–β–‚β–β–β–‚β–β–β–β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–„β–ƒβ–‚β–‚β–ƒβ–ƒβ–„β–†β–…β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒ β–ƒ
  850 ΞΌs          Histogram: frequency by time          1.2 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax_mapreduce(i) setup = (i = se(18))
BenchmarkTools.Trial: 977 samples with 1 evaluation.
 Range (min … max):  924.488 ΞΌs …  1.435 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):       1.089 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.099 ms Β± 37.189 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                   β–…β–…β–ˆβ–ˆβ–„                        
  β–‚β–β–β–β–β–β–β–β–β–β–‚β–‚β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–„β–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–„β–„β–„β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–„β–ƒ β–ƒ
  924 ΞΌs          Histogram: frequency by time          1.2 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_tturbo(i) setup = (i = se(20))
BenchmarkTools.Trial: 144 samples with 1 evaluation.
 Range (min … max):  2.571 ms …  2.918 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.781 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.779 ms Β± 30.961 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                        ▁    β–β–ˆβ–‚β–†β–„β–           
  β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–ƒβ–β–β–ƒβ–ƒβ–ƒβ–ˆβ–ˆβ–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–ƒβ–„β–„β–β–β–β–ƒ β–ƒ
  2.57 ms        Histogram: frequency by time        2.84 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_turbo(i) setup = (i = se(20))
BenchmarkTools.Trial: 205 samples with 1 evaluation.
 Range (min … max):  2.490 ms …  3.179 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.645 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.648 ms Β± 59.019 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                        ▁ β–β–‚β–…β–„β–ˆβ–„                              
  β–‚β–β–β–β–β–β–β–ƒβ–β–‚β–β–β–β–β–β–β–β–β–β–ƒβ–…β–‡β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–„β–ƒβ–‚β–‚β–‚β–‚β–β–ƒβ–‚β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–‚ β–ƒ
  2.49 ms        Histogram: frequency by time        2.82 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax1_basic(i) setup = (i = se(20))
BenchmarkTools.Trial: 205 samples with 1 evaluation.
 Range (min … max):  2.355 ms …  3.367 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.629 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.629 ms Β± 59.624 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                             β–„β–ˆβ–ƒβ–„             
  β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–ƒβ–‚β–‚β–‚β–„β–ƒβ–ƒβ–…β–‡β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–„β–ƒβ–‚β–ƒβ–β–β–β–β–β–‚ β–ƒ
  2.36 ms        Histogram: frequency by time        2.71 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_tturbo(i) setup = (i = se(20))
BenchmarkTools.Trial: 144 samples with 1 evaluation.
 Range (min … max):  2.435 ms …  3.182 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.780 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.774 ms Β± 60.085 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                    β–ˆβ–„                        
  β–‚β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–β–ƒβ–ƒβ–„β–„β–†β–…β–ˆβ–ˆβ–ˆβ–‡β–‚β–‚β–ƒβ–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚ β–‚
  2.43 ms        Histogram: frequency by time        3.02 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_turbo(i) setup = (i = se(20))
BenchmarkTools.Trial: 205 samples with 1 evaluation.
 Range (min … max):  2.426 ms …  2.933 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.627 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.628 ms Β± 35.231 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                   β–†β–…β–ˆβ–‡β–‡β–„β–…β–                   
  β–ƒβ–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–ƒβ–ƒβ–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–…β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒ β–ƒ
  2.43 ms        Histogram: frequency by time        2.75 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax2_basic(i) setup = (i = se(20))
BenchmarkTools.Trial: 205 samples with 1 evaluation.
 Range (min … max):  2.497 ms …  2.743 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.627 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.626 ms Β± 24.625 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                     β–‚ β–‚  β–ƒβ–…β–ˆβ–„β–‚β–ƒβ–ƒβ–„β–ƒ           
  β–ƒβ–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–β–ƒβ–β–„β–ƒβ–ƒβ–ƒβ–…β–…β–†β–†β–ˆβ–…β–ˆβ–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–…β–†β–†β–ƒβ–„β–β–ƒβ–ƒ β–„
  2.5 ms         Histogram: frequency by time        2.67 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark myminmax_mapreduce(i) setup = (i = se(20))
BenchmarkTools.Trial: 204 samples with 1 evaluation.
 Range (min … max):  2.342 ms …  2.939 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.671 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.667 ms Β± 38.347 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                              ▁  β–ˆβ–ˆβ–†β–…         
  β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–‚β–β–β–β–‚β–β–ƒβ–β–‚β–β–ƒβ–†β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–…β–„β–β–‚β–‚β–β–ƒ β–ƒ
  2.34 ms        Histogram: frequency by time        2.74 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Summary of the timings:

median timings for smaller input:

  • myminmax1_tturbo: 979.581 ΞΌs
  • myminmax1_turbo: 1.079 ms
  • myminmax1_basic: 1.072 ms
  • myminmax2_tturbo: 992.245 ΞΌs
  • myminmax2_turbo: 1.069 ms
  • myminmax2_basic: 1.070 ms
  • myminmax_mapreduce: 1.088 ms

median timings for bigger input:

  • myminmax1_tturbo: 2.781 ms
  • myminmax1_turbo: 2.645 ms
  • myminmax1_basic: 2.629 ms
  • myminmax2_tturbo: 2.780 ms
  • myminmax2_turbo: 2.627 ms
  • myminmax2_basic: 2.627 ms
  • myminmax_mapreduce: 2.671 ms

Interestingly, it seems the threaded implementations are slower for longer input vectors?

NB: I ran Julia as a real time root process on Linux for greater benchmarking accuracy. Also, I ran this on a laptop, with the charger plugged in. Pretty sure that charging improves performance. Finally, the CPU is an AMD Ryzen 3.

PS: I’m working on a package which should enable nicer analysis and visualization of multi-way microbenchmark-based comparisons like this one.

PPS: the myminmax_vmapreduce implementation is missing because of this: https://github.com/JuliaSIMD/LoopVectorization.jl/issues/423

Just one small additional note: my above post seems to show that LoopVectorization doesn’t provide significant benefit over base Julia for the myminmax problem (although it’s a very simple problem). However, there’s also a downside to LoopVectorization: much worse effect inference:

julia> Base.infer_effects(myminmax1_tturbo, (Vector{Int},))
(!c,!e,!n,!t,!s,!m)β€²

julia> Base.infer_effects(myminmax1_turbo, (Vector{Int},))
(!c,!e,!n,!t,!s,!m)β€²

julia> Base.infer_effects(myminmax1_basic, (Vector{Int},))
(!c,+e,!n,!t,+s,?m)

julia> Base.infer_effects(myminmax2_tturbo, (Vector{Int},))
(!c,!e,!n,!t,!s,!m)β€²

julia> Base.infer_effects(myminmax2_turbo, (Vector{Int},))
(!c,!e,!n,!t,!s,!m)β€²

julia> Base.infer_effects(myminmax2_basic, (Vector{Int},))
(!c,+e,!n,!t,+s,?m)

julia> Base.infer_effects(myminmax_mapreduce, (Vector{Int},))
(!c,+e,!n,!t,+s,?m)