Performance is also a function of the sizes of the arrays youβre benchmarking.
Functions:
using LoopVectorization
function comp!(dist, a, c)
@avx for i in axes(a, 2)
acc = zero(eltype(dist))
for k in axes(a,1)
acc += (c[k] * a[k,i])^2
end
dist[i] = acc
end
end
function compllvm!(dist, a, c)
@inbounds @fastmath for i in axes(a, 2)
acc = zero(eltype(dist))
for k in axes(a,1)
acc += (c[k] * a[k,i])^2
end
dist[i] = acc
end
end
julia> a = rand(Float32, M, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));
julia> @benchmark comp!($d, $a, $c)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 162.401 ms (0.00% GC)
median time: 167.070 ms (0.00% GC)
mean time: 166.839 ms (0.00% GC)
maximum time: 167.810 ms (0.00% GC)
--------------
samples: 30
evals/sample: 1
julia> @benchmark compllvm!($d, $a, $c)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 145.021 ms (0.00% GC)
median time: 145.275 ms (0.00% GC)
mean time: 146.247 ms (0.00% GC)
maximum time: 151.646 ms (0.00% GC)
--------------
samples: 35
evals/sample: 1
The LLVM version is faster, because the @avx
version iterates across size(a,2)
more quickly.
However, note that LLVM is very reliant on multiples of powers of 2. See what happens if we shrink size(a,1)
by 1.
julia> a = rand(Float32, M-1, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));
julia> @benchmark comp!($d, $a, $c)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 169.176 ms (0.00% GC)
median time: 169.942 ms (0.00% GC)
mean time: 171.480 ms (0.00% GC)
maximum time: 176.977 ms (0.00% GC)
--------------
samples: 30
evals/sample: 1
julia> @benchmark compllvm!($d, $a, $c)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 273.203 ms (0.00% GC)
median time: 284.456 ms (0.00% GC)
mean time: 280.057 ms (0.00% GC)
maximum time: 285.689 ms (0.00% GC)
--------------
samples: 18
evals/sample: 1
Back to the earlier point on quickly moving across size(a,2)
, performance is also better when this is small:
julia> a = rand(Float32, M, 2M); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));
julia> @benchmark comp!($d, $a, $c)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.174 ΞΌs (0.00% GC)
median time: 1.181 ΞΌs (0.00% GC)
mean time: 1.187 ΞΌs (0.00% GC)
maximum time: 2.734 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark compllvm!($d, $a, $c)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.689 ΞΌs (0.00% GC)
median time: 1.703 ΞΌs (0.00% GC)
mean time: 1.706 ΞΌs (0.00% GC)
maximum time: 4.101 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
Of course, what matters is what performs best at the size range youβre actually running on.
Some interesting stats to look at (the first table is with LoopVectorization, the second is without):
julia> using LinuxPerf
julia> a = rand(Float32, M, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));
julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf! (generic function with 1 method)
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(comp!, 30, d, a, c))
4.842500 seconds (2 allocations: 64 bytes)
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles 1.97e+10 60.0% # 4.1 cycles per ns
β instructions 4.36e+09 60.0% # 0.2 insns per cycle
β branch-instructions 1.71e+08 60.0% # 3.9% of instructions
β branch-misses 7.44e+03 60.0% # 0.0% of branch instructions
β task-clock 4.84e+09 100.0% # 4.8 s
β context-switches 0.00e+00 100.0%
β cpu-migrations 0.00e+00 100.0%
β page-faults 0.00e+00 100.0%
β L1-dcache-load-misses 1.01e+09 20.0% # 88.4% of dcache loads
β L1-dcache-loads 1.14e+09 20.0%
β L1-icache-load-misses 1.68e+04 20.0%
β dTLB-load-misses 1.45e+03 20.0% # 0.0% of dTLB loads
β dTLB-loads 1.14e+09 20.0%
β iTLB-load-misses 6.95e+02 40.0% # 646.5% of iTLB loads
β iTLB-loads 1.08e+02 40.0%
βββββββββββββββββββββββββββββββββββββββββββ
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(compllvm!, 30, d, a, c))
4.568975 seconds (2 allocations: 64 bytes)
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles 1.86e+10 60.0% # 4.1 cycles per ns
β instructions 8.22e+09 60.0% # 0.4 insns per cycle
β branch-instructions 6.23e+08 60.0% # 7.6% of instructions
β branch-misses 5.36e+03 60.0% # 0.0% of branch instructions
β task-clock 4.57e+09 100.0% # 4.6 s
β context-switches 0.00e+00 100.0%
β cpu-migrations 0.00e+00 100.0%
β page-faults 0.00e+00 100.0%
β L1-dcache-load-misses 1.01e+09 20.0% # 50.4% of dcache loads
β L1-dcache-loads 1.99e+09 20.0%
β L1-icache-load-misses 1.72e+04 20.0%
β dTLB-load-misses 4.20e+02 20.0% # 0.0% of dTLB loads
β dTLB-loads 1.99e+09 20.0%
β iTLB-load-misses 6.55e+02 40.0% # 708.1% of iTLB loads
β iTLB-loads 9.25e+01 40.0%
βββββββββββββββββββββββββββββββββββββββββββ
This was the original size where the llvm version was faster. 30 repetitions took 4.8 seconds with LoopVectorization vs 4.6 without.
But LoopVectorization calculated the answer with 4.36e+09
total instructions, while LLVM required nearly twice as many at 8.22e+09
.
Of course, what we really care about is runtime, not total instructions used.
This CPU can issue multiple instructions per clock (IPC), but this code ran at only 0.2 and 0.4 IPC for with/without LV. Thatβs because performance was dominated by L1-dcache-load-misses
, where LoopVectorization does a lot worse at the moment.
I plan on eventually having LoopVectorization optimize cache performance as well, but it doesnβt yet. So in problems dominated by it, LoopVectorization might fair poorly even if it does well in reducing the number of instructions required to calculate the answer.
EDIT:
Just for fun, the size(a,1) == 127
example (first table w/ lv, second w/out):
julia> a = rand(Float32, M-1, N); c = rand(Float32, size(a,1)); d = similar(c, size(a,2));
julia> size(a)
(127, 4153344)
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(comp!, 30, d, a, c))
5.164595 seconds (2 allocations: 64 bytes)
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles 2.10e+10 60.0% # 4.1 cycles per ns
β instructions 4.42e+09 60.0% # 0.2 insns per cycle
β branch-instructions 1.87e+08 60.0% # 4.2% of instructions
β branch-misses 7.48e+03 60.0% # 0.0% of branch instructions
β task-clock 5.17e+09 100.0% # 5.2 s
β context-switches 0.00e+00 100.0%
β cpu-migrations 0.00e+00 100.0%
β page-faults 0.00e+00 100.0%
β L1-dcache-load-misses 9.97e+08 20.0% # 87.7% of dcache loads
β L1-dcache-loads 1.14e+09 20.0%
β L1-icache-load-misses 3.28e+04 20.0%
β dTLB-load-misses 3.06e+04 20.0% # 0.0% of dTLB loads
β dTLB-loads 1.14e+09 20.0%
β iTLB-load-misses 4.60e+02 40.0% # 142.6% of iTLB loads
β iTLB-loads 3.23e+02 40.0%
βββββββββββββββββββββββββββββββββββββββββββ
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" (@time foreachf!(compllvm!, 30, d, a, c))
8.513785 seconds (2 allocations: 64 bytes)
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles 3.64e+10 60.0% # 4.3 cycles per ns
β instructions 5.38e+10 60.0% # 1.5 insns per cycle
β branch-instructions 8.47e+09 60.0% # 15.7% of instructions
β branch-misses 1.25e+08 60.0% # 1.5% of branch instructions
β task-clock 8.51e+09 100.0% # 8.5 s
β context-switches 0.00e+00 100.0%
β cpu-migrations 0.00e+00 100.0%
β page-faults 0.00e+00 100.0%
β L1-dcache-load-misses 9.98e+08 20.0% # 6.0% of dcache loads
β L1-dcache-loads 1.67e+10 20.0%
β L1-icache-load-misses 3.63e+04 20.0%
β dTLB-load-misses 1.67e+04 20.0% # 0.0% of dTLB loads
β dTLB-loads 1.67e+10 20.0%
β iTLB-load-misses 1.07e+03 40.0% # 276.1% of iTLB loads
β iTLB-loads 3.88e+02 40.0%
βββββββββββββββββββββββββββββββββββββββββββ
Here LoopVectorization performs slightly better, probably in large part because it uses over an order of magnitude less instructionsβ¦