I tried comparing a couple of implementations, coming to the conclusion that LoopVectorization actually isnβt very useful for this example: all the non-threaded implementations seem to have the same throughput (except the mapreduce-based implementation, which is slightly slower than the others).
Minmax.jl
module Minmax
using LoopVectorization
export
myminmax1_tturbo,
myminmax1_turbo,
myminmax1_basic,
myminmax2_tturbo,
myminmax2_turbo,
myminmax2_basic,
myminmax_mapreduce,
se
function myminmax1_tturbo(x)
a = b = first(x)
@tturbo for i in eachindex(x)
e = x[i]
b = ifelse(b > e, b, e)
a = ifelse(a < e, a, e)
end
a, b
end
function myminmax1_turbo(x)
a = b = first(x)
@turbo for i in eachindex(x)
e = x[i]
b = ifelse(b > e, b, e)
a = ifelse(a < e, a, e)
end
a, b
end
function myminmax1_basic(x)
a = b = first(x)
for e in x
b = ifelse(b > e, b, e)
a = ifelse(a < e, a, e)
end
a, b
end
function myminmax2_tturbo(x)
a = b = first(x)
@tturbo for i in eachindex(x)
e = x[i]
b = max(b, e)
a = min(a, e)
end
a, b
end
function myminmax2_turbo(x)
a = b = first(x)
@turbo for i in eachindex(x)
e = x[i]
b = max(b, e)
a = min(a, e)
end
a, b
end
function myminmax2_basic(x)
a = b = first(x)
for e in x
b = max(b, e)
a = min(a, e)
end
a, b
end
f(a) =
(a, a)
g(a, b) =
let (q, w) = a, (e, r) = b
(min(q, e), max(w, r))
end
myminmax_mapreduce(x) =
mapreduce(f, g, x)
se(n) =
rand(1:2^n, 2^(n + 2))
end # module Minmax
Benchmarking REPL session
[root@aceramd nsajko]# printf '%s' -1 > /proc/sys/kernel/sched_rt_runtime_us # Enable real time scheduling.
[root@aceramd nsajko]# chrt -f 99 /home/nsajko/tmp/julia-df3da0582a/bin/julia -O3 --min-optlevel=3 -g 2 --threads 4 # Start with real time scheduling and four threads.
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.9.0-DEV.1171 (2022-08-23)
_/ |\__'_|_|_|\__'_| | Commit df3da0582a7 (0 days old master)
|__/ |
julia> include("Minmax.jl")
Main.Minmax
julia> using BenchmarkTools, .Minmax
julia> inp = se(6);
julia> myminmax1_tturbo(inp)
(1, 64)
julia> myminmax1_turbo(inp)
(1, 64)
julia> myminmax1_basic(inp)
(1, 64)
julia> myminmax2_tturbo(inp)
(1, 64)
julia> myminmax2_turbo(inp)
(1, 64)
julia> myminmax2_basic(inp)
(1, 64)
julia> myminmax_mapreduce(inp)
(1, 64)
julia> @benchmark myminmax1_tturbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 595 samples with 1 evaluation.
Range (min β¦ max): 862.772 ΞΌs β¦ 1.145 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 997.426 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 998.703 ΞΌs Β± 40.537 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ ββ
ββββββββββββββ
ββββββββ
ββββββββββββββββββ
β
βββββββββββββββββββββββββββββ
ββββββββββββ β
863 ΞΌs Histogram: frequency by time 1.1 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax1_tturbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 594 samples with 1 evaluation.
Range (min β¦ max): 863.123 ΞΌs β¦ 1.129 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 979.581 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 979.502 ΞΌs Β± 41.878 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ
β
ββ β
βββββββ
βββββββ
βββββββ
ββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββ β
863 ΞΌs Histogram: frequency by time 1.08 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax1_turbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 1029 samples with 1 evaluation.
Range (min β¦ max): 919.319 ΞΌs β¦ 1.382 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.081 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.082 ms Β± 29.876 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββ
βββ
ββ
βββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββ β
919 ΞΌs Histogram: frequency by time 1.17 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax1_turbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 1031 samples with 1 evaluation.
Range (min β¦ max): 828.248 ΞΌs β¦ 1.321 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.079 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.080 ms Β± 28.116 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββ
β
ββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
828 ΞΌs Histogram: frequency by time 1.16 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax1_basic(i) setup = (i = se(18))
BenchmarkTools.Trial: 1032 samples with 1 evaluation.
Range (min β¦ max): 887.068 ΞΌs β¦ 1.291 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.072 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.073 ms Β± 27.482 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β
ββββββββ
ββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
887 ΞΌs Histogram: frequency by time 1.15 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax1_basic(i) setup = (i = se(18))
BenchmarkTools.Trial: 1032 samples with 1 evaluation.
Range (min β¦ max): 828.277 ΞΌs β¦ 1.336 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.072 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.073 ms Β± 30.028 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββ
ββββββββββββββββββββββββββββββββββββββ
βββββββββββββ
βββββββββ β
828 ΞΌs Histogram: frequency by time 1.16 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_tturbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 595 samples with 1 evaluation.
Range (min β¦ max): 747.014 ΞΌs β¦ 1.169 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 992.245 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 989.631 ΞΌs Β± 41.371 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββ
βββ
ββββββ
ββββββββββββββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββ β
747 ΞΌs Histogram: frequency by time 1.07 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_tturbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 576 samples with 1 evaluation.
Range (min β¦ max): 812.738 ΞΌs β¦ 1.111 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 997.535 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 996.026 ΞΌs Β± 41.331 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββ β
813 ΞΌs Histogram: frequency by time 1.08 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_turbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 976 samples with 1 evaluation.
Range (min β¦ max): 895.303 ΞΌs β¦ 1.279 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.087 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.096 ms Β± 35.680 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββ
βββββββββββββββββββββββββββββββββββββββββββ
ββββ
β
βββ
βββββββββ β
895 ΞΌs Histogram: frequency by time 1.2 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_turbo(i) setup = (i = se(18))
BenchmarkTools.Trial: 981 samples with 1 evaluation.
Range (min β¦ max): 841.272 ΞΌs β¦ 1.302 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.069 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.079 ms Β± 37.738 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
βββ
βββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββ β
841 ΞΌs Histogram: frequency by time 1.18 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_basic(i) setup = (i = se(18))
BenchmarkTools.Trial: 980 samples with 1 evaluation.
Range (min β¦ max): 950.176 ΞΌs β¦ 1.286 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.070 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.082 ms Β± 38.737 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
β
βββ
βββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββ β
950 ΞΌs Histogram: frequency by time 1.19 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_basic(i) setup = (i = se(18))
BenchmarkTools.Trial: 978 samples with 1 evaluation.
Range (min β¦ max): 819.450 ΞΌs β¦ 1.285 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.073 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.084 ms Β± 42.823 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β
βββ
βββββββββββββββββββββββββββββββββββββββββββ
βββββββ
ββββββββββ β
819 ΞΌs Histogram: frequency by time 1.21 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax_mapreduce(i) setup = (i = se(18))
BenchmarkTools.Trial: 977 samples with 1 evaluation.
Range (min β¦ max): 850.108 ΞΌs β¦ 1.292 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.088 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.099 ms Β± 38.489 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββ
βββββββββββββββββββββββββββββββββββββββββββββ
βββββββββ
ββββββ β
850 ΞΌs Histogram: frequency by time 1.2 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax_mapreduce(i) setup = (i = se(18))
BenchmarkTools.Trial: 977 samples with 1 evaluation.
Range (min β¦ max): 924.488 ΞΌs β¦ 1.435 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.089 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.099 ms Β± 37.189 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β
β
βββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
924 ΞΌs Histogram: frequency by time 1.2 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax1_tturbo(i) setup = (i = se(20))
BenchmarkTools.Trial: 144 samples with 1 evaluation.
Range (min β¦ max): 2.571 ms β¦ 2.918 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.781 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.779 ms Β± 30.961 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β ββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββ β
2.57 ms Histogram: frequency by time 2.84 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax1_turbo(i) setup = (i = se(20))
BenchmarkTools.Trial: 205 samples with 1 evaluation.
Range (min β¦ max): 2.490 ms β¦ 3.179 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.645 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.648 ms Β± 59.019 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β βββ
βββ
βββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββ β
2.49 ms Histogram: frequency by time 2.82 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax1_basic(i) setup = (i = se(20))
BenchmarkTools.Trial: 205 samples with 1 evaluation.
Range (min β¦ max): 2.355 ms β¦ 3.367 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.629 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.629 ms Β± 59.624 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
ββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββ β
2.36 ms Histogram: frequency by time 2.71 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_tturbo(i) setup = (i = se(20))
BenchmarkTools.Trial: 144 samples with 1 evaluation.
Range (min β¦ max): 2.435 ms β¦ 3.182 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.780 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.774 ms Β± 60.085 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
βββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββ β
2.43 ms Histogram: frequency by time 3.02 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_turbo(i) setup = (i = se(20))
BenchmarkTools.Trial: 205 samples with 1 evaluation.
Range (min β¦ max): 2.426 ms β¦ 2.933 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.627 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.628 ms Β± 35.231 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
βββββ
β
ββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββ β
2.43 ms Histogram: frequency by time 2.75 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_basic(i) setup = (i = se(20))
BenchmarkTools.Trial: 205 samples with 1 evaluation.
Range (min β¦ max): 2.497 ms β¦ 2.743 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.627 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.626 ms Β± 24.625 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β β ββ
βββββββ
ββββββββββββββββββββββββββββββββ
β
ββββ
βββββββββββββ
β
βββββββ β
2.5 ms Histogram: frequency by time 2.67 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax_mapreduce(i) setup = (i = se(20))
BenchmarkTools.Trial: 204 samples with 1 evaluation.
Range (min β¦ max): 2.342 ms β¦ 2.939 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.671 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.667 ms Β± 38.347 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β ββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββ β
2.34 ms Histogram: frequency by time 2.74 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Summary of the timings:
median timings for smaller input:
- myminmax1_tturbo: 979.581 ΞΌs
- myminmax1_turbo: 1.079 ms
- myminmax1_basic: 1.072 ms
- myminmax2_tturbo: 992.245 ΞΌs
- myminmax2_turbo: 1.069 ms
- myminmax2_basic: 1.070 ms
- myminmax_mapreduce: 1.088 ms
median timings for bigger input:
- myminmax1_tturbo: 2.781 ms
- myminmax1_turbo: 2.645 ms
- myminmax1_basic: 2.629 ms
- myminmax2_tturbo: 2.780 ms
- myminmax2_turbo: 2.627 ms
- myminmax2_basic: 2.627 ms
- myminmax_mapreduce: 2.671 ms
Interestingly, it seems the threaded implementations are slower for longer input vectors?
NB: I ran Julia as a real time root process on Linux for greater benchmarking accuracy. Also, I ran this on a laptop, with the charger plugged in. Pretty sure that charging improves performance. Finally, the CPU is an AMD Ryzen 3.
PS: Iβm working on a package which should enable nicer analysis and visualization of multi-way microbenchmark-based comparisons like this one.