Iβm probably overly optimistic about the compiler/length of the vectors you may be working with.
But someone on slack commented recently that LV often gave about a 2x speedup on many of the simple loops they were working with, and it seemed this was purely because it used larger vectors.
The problem with LLVMβs vectorization is that it unrolls aggressively, and then doesnβt vectorize the unroll*vectorization remainder. So using 512 bit vectors with Float64
means it will only vectorize blocks of 32. If your loop is 63 iterations, then with 512 bit vectors it will likely run 1 unrolled and vectorized iteration, followed by 31 scalar iterations.
With 256 iterations, itβll run 2 unrolled and vectorized iterations, followed by 15 scalar iterations β much faster.
LoopVectorization.jl does this better. I was hoping the @vp
macro would get LLVM to do something more similar, but not quite.
The code below runs dot products at all lengths 1:1024
in a random order, and benchmarks how long it takes.
julia> @time using LoopVectorization, Random, BenchmarkTools
0.000189 seconds (470 allocations: 46.227 KiB)
julia> function dot_fast(x,y)
s = zero(eltype(x))
for i = eachindex(x)
@inbounds @fastmath s += x[i]*y[i]
end
s
end
dot_fast (generic function with 1 method)
julia> macro vp(expr)
nodes = (Symbol("llvm.loop.vectorize.predicate.enable"), 1)
if expr.head != :for
error("Syntax error: loopinfo needs a for loop")
end
push!(expr.args[2].args, Expr(:loopinfo, nodes))
return esc(expr)
end
@vp (macro with 1 method)
julia> function dot_vp(x,y)
s = zero(eltype(x))
@vp for i = eachindex(x)
@inbounds @fastmath s += x[i]*y[i]
end
s
end
dot_vp (generic function with 1 method)
julia> function dot_turbo(x,y)
s = zero(eltype(x))
@turbo for i = eachindex(x)
s += x[i]*y[i]
end
s
end
dot_turbo (generic function with 1 method)
julia> x = rand(1024); y = rand(length(x)); Ns = randperm(length(x)); z = similar(x);
julia> @benchmark map!(n -> dot_vp(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 61.602 ΞΌs β¦ 101.234 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 61.731 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 61.795 ΞΌs Β± 547.266 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββ ββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
61.6 ΞΌs Histogram: log(frequency) by time 63.3 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark map!(n -> dot_fast(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 41.530 ΞΌs β¦ 71.679 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 41.659 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 41.735 ΞΌs Β± 549.633 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββ βββ β
ββββββββββ
β
ββββββββββββββββββββββββββββββββββββββ
β
ββββββββββ β
41.5 ΞΌs Histogram: log(frequency) by time 43.4 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark map!(n -> dot_turbo(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 31.564 ΞΌs β¦ 73.508 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 31.845 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 31.905 ΞΌs Β± 547.370 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
31.6 ΞΌs Histogram: frequency by time 33.5 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
Reducing the length to 1:256
:
julia> x = rand(256); y = rand(length(x)); Ns = randperm(length(x)); z = similar(x);
julia> @benchmark map!(n -> dot_vp(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min β¦ max): 4.635 ΞΌs β¦ 9.801 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 4.664 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 4.671 ΞΌs Β± 65.712 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
4.63 ΞΌs Histogram: frequency by time 4.89 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark map!(n -> dot_fast(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min β¦ max): 5.650 ΞΌs β¦ 10.561 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 5.678 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 5.688 ΞΌs Β± 83.408 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
β
βββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
5.65 ΞΌs Histogram: frequency by time 5.94 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark map!(n -> dot_turbo(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min β¦ max): 3.181 ΞΌs β¦ 8.153 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 3.194 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 3.199 ΞΌs Β± 60.401 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββ
β β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
3.18 ΞΌs Histogram: log(frequency) by time 3.39 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
and now the @vp
version is faster than the default, but both are of course still slower than @turbo
.
This was with native,-prefer-256-bit
.
With the default (native
is the default):
julia> x = rand(256); y = rand(length(x)); Ns = randperm(length(x)); z = similar(x);
julia> @benchmark map!(n -> dot_vp(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
Range (min β¦ max): 8.562 ΞΌs β¦ 9.660 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 8.587 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 8.601 ΞΌs Β± 78.964 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββ
β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββ β
8.56 ΞΌs Histogram: log(frequency) by time 9.05 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark map!(n -> dot_fast(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min β¦ max): 4.940 ΞΌs β¦ 7.153 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 4.961 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 4.969 ΞΌs Β± 49.064 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β
βββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
4.94 ΞΌs Histogram: frequency by time 5.18 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark map!(n -> dot_turbo(@view($x[begin:n]),@view($y[begin:n])), $z, $Ns)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
Range (min β¦ max): 3.740 ΞΌs β¦ 9.553 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 3.751 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 3.759 ΞΌs Β± 88.349 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β
ββββ βββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
3.74 ΞΌs Histogram: log(frequency) by time 3.95 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
@turbo
doesnβt care, and will use full sized vectors anyway. Iβm not quite sure why it got worse performance than before; the assembly is actually exactly the same.
But you can see that the βnormalβ version of dot product (dot_fast
) was actually faster with 256 bit vectors than 512 bit vectors!
Using predicates, e.g. @vp
or @turbo
made the 512 bit code faster. And in the case of @turbo
, it is significantly faster at most sizes over the range of 16 or so up until a few hundred.
But note that @vp
with 512 bit vectors was faster than not having @vp
with 256 bit.
I do think @vp
+512 bit vectors is a better default than not-@vp
and 256 bit, unless the expected vector length is very long.
EDIT:
Also, our results should be fairly comparable.
julia> versioninfo()
Julia Version 1.9.0-DEV.635
Commit 5ef75cbf5b (2022-05-24 19:07 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 36 Γ Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.3 (ORCJIT, cascadelake)
Threads: 36 on 36 virtual cores