Elrod
August 23, 2022, 10:26am
21
This is because youβre totally memory bound. Try @turbo
with vectors of length 1024 or less.
Did you start julia with multiple threads?
EDIT:
Interestingly, I hadnβt actually seen LLVM SIMD min/max functions before, but it is now.
LV should still do better for sizes like 255, i.e. the power of 2-1.
julia> using .Minmax
julia> x = se(6);
julia> @btime myminmax1_turbo($x)
23.396 ns (0 allocations: 0 bytes)
(1, 64)
julia> @btime myminmax1_tturbo($x)
33.715 ns (0 allocations: 0 bytes)
(1, 64)
julia> @btime myminmax1_basic($x)
36.803 ns (0 allocations: 0 bytes)
(1, 64)
julia> @btime myminmax2_turbo($x)
20.421 ns (0 allocations: 0 bytes)
(1, 64)
julia> @btime myminmax2_tturbo($x)
23.642 ns (0 allocations: 0 bytes)
(1, 64)
julia> @btime myminmax2_basic($x)
36.863 ns (0 allocations: 0 bytes)
(1, 64)
julia> @btime myminmax_mapreduce($x)
40.438 ns (0 allocations: 0 bytes)
(1, 64)
Once upon a time, base Julia had trouble SIMDing this.
Still, LV is still a fair bit faster. Even with the threading check (these vectors are too small to thread).
I got to length 65k before I noticed LV using 2 threads.
1 Like
nsajko
August 23, 2022, 11:13am
22
Indeed you are correct. Comparing performance of myminmax2_basic
, myminmax2_turbo
and myminmax2_tturbo
across different input lengths, @turbo
wins for the smallest inputs, then in the range from se(12)
to se(13)
@tturbo
takes the lead, after that myminmax2_basic
is tied with myminmax2_turbo
but myminmax2_tturbo
is suddenly much slower (comparing the median timings):
julia> @benchmark myminmax2_basic(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 9.457 ΞΌs β¦ 43.592 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 9.759 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 10.439 ΞΌs Β± 1.824 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β
βββ
βββββββββββββββββββββ β β
ββββββββββββββββββββββββββββββββββββββββββ
ββββ
β
βββββ
β
β
ββ
βββ β
9.46 ΞΌs Histogram: log(frequency) by time 17.3 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_turbo(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 9.418 ΞΌs β¦ 93.085 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 9.709 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 10.176 ΞΌs Β± 2.005 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
β
βββββββββ β
βββββββββββββββββββββββββ
βββ
β
β
β
β
ββ
β
ββ
β
β
β
ββ
ββββββ
β
β
βββββββββ β
9.42 ΞΌs Histogram: log(frequency) by time 19.1 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_tturbo(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 9767 samples with 5 evaluations.
Range (min β¦ max): 6.448 ΞΌs β¦ 26.342 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 12.241 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 12.036 ΞΌs Β± 1.755 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
βββ
ββ
βββββββββββββββββββββββ
βββ
βββββββββββββββββββββββββββββββββ β
6.45 ΞΌs Histogram: frequency by time 16.9 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_basic(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 9742 samples with 1 evaluation.
Range (min β¦ max): 18.815 ΞΌs β¦ 119.003 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 23.825 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 24.777 ΞΌs Β± 4.714 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββββ
β
βββ
ββ
β
ββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββ β
18.8 ΞΌs Histogram: frequency by time 44.4 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_turbo(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 9707 samples with 1 evaluation.
Range (min β¦ max): 18.815 ΞΌs β¦ 106.610 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 23.704 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 24.672 ΞΌs Β± 4.239 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββ
βββ
βββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββ β
18.8 ΞΌs Histogram: frequency by time 42 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_tturbo(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 5236 samples with 1 evaluation.
Range (min β¦ max): 13.044 ΞΌs β¦ 130.655 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 43.958 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 44.657 ΞΌs Β± 8.647 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ
ββββββ
βββββββββββββββββββββββββ
βββββββββββββββ
β
βββββββββββββββββββ β
13 ΞΌs Histogram: frequency by time 72.2 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
This is with Julia started with four threads (same as above).
BTW, donβt think that Iβm criticizing your work, I know that your packages are tremendously useful, itβs just that comparing microbenchmarks is something I enjoy very much . Also thank you for being so instructive here.
Elrod
August 23, 2022, 12:19pm
23
FWIW, I got
julia> using Random
julia> @benchmark myminmax2_basic(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min β¦ max): 5.611 ΞΌs β¦ 16.944 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 6.013 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 6.156 ΞΌs Β± 802.595 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββ
ββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββ β
5.61 ΞΌs Histogram: frequency by time 10.7 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_turbo(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min β¦ max): 3.982 ΞΌs β¦ 13.480 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 4.304 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 4.473 ΞΌs Β± 687.470 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββ
β
ββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββ β
3.98 ΞΌs Histogram: frequency by time 8.16 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_tturbo(i) setup = (Random.seed!(12345678); i = se(14))
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min β¦ max): 3.263 ΞΌs β¦ 13.234 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 4.667 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 4.710 ΞΌs Β± 791.557 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββββββββββββ β ββ β β β
ββββββββββββββββββββββββββββ
βββββ
βββββββββ
ββββββ
βββββββββββ β
3.26 ΞΌs Histogram: log(frequency) by time 9.35 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_basic(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 17.183 ΞΌs β¦ 67.228 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 22.326 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 23.755 ΞΌs Β± 5.780 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββ
ββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββ β
17.2 ΞΌs Histogram: frequency by time 54.7 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_turbo(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 9785 samples with 3 evaluations.
Range (min β¦ max): 10.911 ΞΌs β¦ 34.302 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 14.028 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 14.655 ΞΌs Β± 2.598 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββ
βββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββ β
10.9 ΞΌs Histogram: frequency by time 28.6 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark myminmax2_tturbo(i) setup = (Random.seed!(12345678); i = se(15))
BenchmarkTools.Trial: 6380 samples with 7 evaluations.
Range (min β¦ max): 7.464 ΞΌs β¦ 24.428 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 9.435 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 9.811 ΞΌs Β± 1.682 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
βββ ββ
β
ββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
7.46 ΞΌs Histogram: frequency by time 18.1 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
So for 15
it helped, but I saw the regressed mean time for 14
as well.
Looking at only minimum times tends to be misleading for multithreaded or allocating code.
itβs just that comparing microbenchmarks is something I enjoy very much
Feel free to make PRs that adjust the threading ramp up or heuristics.
1 Like
You should try with x = rand(10^8)
or more and you should see a difference in your benchmarks.