The first thing I would note here is the benchmark:
Notice how performance plummets beyond around 60
x60
?
This is because LoopVectorization does register tiling, but does not yet do cache tiling. I plan for this to change in the future, but for now, that means we must do the cache-tiling ourselves.
What does ātilingā mean? It means operating block-wise to get data re-use. āCache-tilingā means re-use in the CPUās cache, and āregister-tilingā means re-use in the CPU-registers.
@inbounds for j in 1:size(A,2)
@simd for i in 1:size(A,1)
y[i] += A[i,j] * x[j]
end
end
When register-tiling here, note that as the loop above loop is written, we need to:
- load
x[j]
a total of size(A,2)
times ā minimum possible.
- load
A[i,j]
a total of size(A,1) * size(A,2)
times ā minimum possible.
- load and store
y[i]
a total of size(A,1) * size(A,2)
times ā an excess factor of size(A,2)
.
Register tiling will attack that third point. To investigate what exactly it does:
julia> using LoopVectorization
julia> M = N = 500;
julia> y = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N);
julia> ls = LoopVectorization.@avx_debug for j in 1:size(A,2), i in 1:size(A,1)
y[i] += A[i,j] * x[j]
end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03),:numericconstant,Symbol("##reductzero#640"),LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x04),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000001, 0x0000000000000000, 0x0000000000010204, LoopVectorization.compute, 0x00, 0x04),:LoopVectorization,:reduced_add,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000001, 0x0000000000000000, 0x0000000000000503, LoopVectorization.compute, 0x00, 0x03),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000006, LoopVectorization.memstore, 0x03, 0x05)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:A,Symbol("##vptr##_A")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:y,Symbol("##vptr##_y")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(4, LoopVectorization.IntOrFloat)},Tuple{}}
LPSYM = Tuple{:j,:i}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1},VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00005605ec236e00, (400,)), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c620, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c240, ()))
julia> LoopVectorization.choose_order(ls)
([:i, :j], :j, :i, :i, 1, 8)
The vector [:i, :j]
gives the order of the loops from outer-most to inner-most. That is, it placed the i
loop as outer-most, while the j
loop is inner-most.
The next two symbols give which loops are unrolled by factors corresponding to the integers:
j
is unrolled by 1
x (i.e., it isnāt), and i
by 8
x.
Given that j
is now the innermost loop, not much sense in unrolling it.
The last symbol refers to which loop is actually vectorized.
Note that this can change as a function of types:
julia> B = A';
julia> lst = LoopVectorization.@avx_debug for j in 1:size(A,2), i in 1:size(A,1)
y[i] += B[i,j] * x[j]
end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03),:numericconstant,Symbol("##reductzero#656"),LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x04),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000001, 0x0000000000000000, 0x0000000000010204, LoopVectorization.compute, 0x00, 0x04),:LoopVectorization,:reduced_add,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000001, 0x0000000000000000, 0x0000000000000503, LoopVectorization.compute, 0x00, 0x03),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000006, LoopVectorization.memstore, 0x03, 0x05)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:B,Symbol("##vptr##_B")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:y,Symbol("##vptr##_y")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(4, LoopVectorization.IntOrFloat)},Tuple{}}
LPSYM = Tuple{:j,:i}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1},VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.RowMajorStridedPointer{Float64,1}(Ptr{Float64} @0x00005605ec236e00, (400,)), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c620, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c240, ()))
julia> LoopVectorization.choose_order(lst) # j is now vectorized instead of `i`
([:i, :j], :j, :i, :j, 1, 8)
Anyway, back to the original ([:i, :j], :j, :i, :i, 1, 8)
example,
We still load from A
for each combination. We were already at the minimum, so no savings there.
As i
is now the outer-most loop, we now load and store from y[i]
a total of size(A,1)
times.
However, we do need to load from x[j]
, but by cld(size(A,1),8)*size(A,2)
. We get a reducing factor of 8
, because we get to re-use the load from x[j]
for each factor by which the i
loop is unrolled. 8
x unrolling means we get an 8
x reduction in how many times we have to load from this vector!
We can see some spectacular performance improvements from this:
julia> function gemvavx1!(y, A, x)
@avx for j in axes(A,2), i in axes(A,1)
y[i] += A[i,j] * x[j]
end
end
gemvavx1! (generic function with 1 method)
julia> function gemvavx2!(y, A, x)
@avx for i in axes(A,1)
yįµ¢ = zero(eltype(y))
for j ā axes(A,2)
yįµ¢ += A[i,j] * x[j]
end
y[i] = yįµ¢
end
end
gemvavx2! (generic function with 1 method)
julia> function gemv_simd!(y, A, x)
@inbounds for j in axes(A,2); @simd ivdep for i in axes(A,1)
y[i] += A[i,j] * x[j]
end; end
end
gemv_simd! (generic function with 1 method)
julia> M = N = 56;
julia> y1 = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N); y2 = similar(y1); y3 = similar(y1);
julia> @benchmark gemv_simd!(fill!($y1, 0), $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 862.015 ns (0.00% GC)
median time: 867.697 ns (0.00% GC)
mean time: 867.474 ns (0.00% GC)
maximum time: 1.537 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 66
julia> @benchmark gemvavx1!(fill!($y2, 0), $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 92.937 ns (0.00% GC)
median time: 93.351 ns (0.00% GC)
mean time: 93.472 ns (0.00% GC)
maximum time: 147.397 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 954
julia> @benchmark gemvavx2!($y3, $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 76.973 ns (0.00% GC)
median time: 77.107 ns (0.00% GC)
mean time: 77.198 ns (0.00% GC)
maximum time: 130.935 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 969
julia> y1 ā y2 ā y3
true
But ā only at these tiny sizes.
Your CPU is way faster at computing than it is at moving memory around.
And A * x
requires on the order of length(A)
operations, and on the same order of memory. So in the end, itāll perform more or less the same as sum
:
julia> function sumavx(A)
s = zero(eltype(A))
@avx for i ā eachindex(A)
s += A[i]
end
s
end
sumavx (generic function with 1 method)
julia> @btime sum($A)
192.643 ns (0 allocations: 0 bytes)
1578.41421948359
julia> @btime sumavx($A)
67.960 ns (0 allocations: 0 bytes)
1578.41421948359
Since A * x
and sum(A)
require accessing roughly the same amount of memory.
And once these vectors stop fitting in our CPUās caches, memory bandwidth is all that determines performance. Make A
500
x500
, and gemv_simd!
, which was so slow before, is now equally fast as everything else:
julia> M = N = 500;
julia> y1 = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N); y2 = similar(y1); y3 = similar(y1);
julia> @benchmark gemv_simd!(fill!($y1, 0), $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 82.602 Ī¼s (0.00% GC)
median time: 85.719 Ī¼s (0.00% GC)
mean time: 85.867 Ī¼s (0.00% GC)
maximum time: 168.723 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark gemvavx1!(fill!($y2, 0), $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 82.080 Ī¼s (0.00% GC)
median time: 84.833 Ī¼s (0.00% GC)
mean time: 84.985 Ī¼s (0.00% GC)
maximum time: 257.037 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark gemvavx2!($y3, $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 83.944 Ī¼s (0.00% GC)
median time: 84.750 Ī¼s (0.00% GC)
mean time: 84.895 Ī¼s (0.00% GC)
maximum time: 258.025 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> y1 ā y2 ā y3
true
julia> @btime sumavx($A)
83.417 Ī¼s (0 allocations: 0 bytes)
125042.8745667071
julia> @btime sumsimd($A)
83.430 Ī¼s (0 allocations: 0 bytes)
125042.87456670718
So the best you can do is add multithreading and break up the i
vector (you canāt parallelize it, or youāre be reading and writing to y
with different threads simultaneously.
function gemv_threads!(y, A, x)
M = length(y)
chunks = min(M >> 7, Base.Threads.nthreads())
chunks > 1 || return gemvavx2!(y, A, x)
chunk_size = LoopVectorization.VectorizationBase.align(cld(M, chunks), eltype(y))
Base.Threads.@sync for c in 1:chunks
Threads.@spawn begin
b = chunk_size*(c-1) + 1
e = min(M, chunk_size*c)
r = b:e
gemvavx2!(view(y, r), view(A, r, :), x)
end
end
y
end
With this, for 500x500, I get
julia> @benchmark gemv_threads!($y3, $A, $x)
BenchmarkTools.Trial:
memory estimate: 2.09 KiB
allocs estimate: 30
--------------
minimum time: 10.343 Ī¼s (0.00% GC)
median time: 35.046 Ī¼s (0.00% GC)
mean time: 35.778 Ī¼s (0.00% GC)
maximum time: 164.298 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> using LinearAlgebra; BLAS.set_num_threads(Base.Threads.nthreads());
julia> @benchmark mul!($y2, $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 12.815 Ī¼s (0.00% GC)
median time: 15.596 Ī¼s (0.00% GC)
mean time: 15.964 Ī¼s (0.00% GC)
maximum time: 82.304 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> y2 ā y3
true
Peak performance is similar with OpenBLAS here, but average performance is much worse.
As @stillyslalom said, it does itās own modeling to make decisions and then generates code with lots of llvmcall
.