The first thing I would note here is the benchmark:
Notice how performance plummets beyond around 60x60?
This is because LoopVectorization does register tiling, but does not yet do cache tiling. I plan for this to change in the future, but for now, that means we must do the cache-tiling ourselves.
What does ātilingā mean? It means operating block-wise to get data re-use. āCache-tilingā means re-use in the CPUās cache, and āregister-tilingā means re-use in the CPU-registers.
@inbounds for j in 1:size(A,2)
@simd for i in 1:size(A,1)
y[i] += A[i,j] * x[j]
end
end
When register-tiling here, note that as the loop above loop is written, we need to:
- load
x[j] a total of size(A,2) times ā minimum possible.
- load
A[i,j] a total of size(A,1) * size(A,2) times ā minimum possible.
- load and store
y[i] a total of size(A,1) * size(A,2) times ā an excess factor of size(A,2).
Register tiling will attack that third point. To investigate what exactly it does:
julia> using LoopVectorization
julia> M = N = 500;
julia> y = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N);
julia> ls = LoopVectorization.@avx_debug for j in 1:size(A,2), i in 1:size(A,1)
y[i] += A[i,j] * x[j]
end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03),:numericconstant,Symbol("##reductzero#640"),LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x04),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000001, 0x0000000000000000, 0x0000000000010204, LoopVectorization.compute, 0x00, 0x04),:LoopVectorization,:reduced_add,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000001, 0x0000000000000000, 0x0000000000000503, LoopVectorization.compute, 0x00, 0x03),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000006, LoopVectorization.memstore, 0x03, 0x05)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:A,Symbol("##vptr##_A")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:y,Symbol("##vptr##_y")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(4, LoopVectorization.IntOrFloat)},Tuple{}}
LPSYM = Tuple{:j,:i}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1},VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00005605ec236e00, (400,)), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c620, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c240, ()))
julia> LoopVectorization.choose_order(ls)
([:i, :j], :j, :i, :i, 1, 8)
The vector [:i, :j] gives the order of the loops from outer-most to inner-most. That is, it placed the i loop as outer-most, while the j loop is inner-most.
The next two symbols give which loops are unrolled by factors corresponding to the integers:
j is unrolled by 1x (i.e., it isnāt), and i by 8x.
Given that j is now the innermost loop, not much sense in unrolling it.
The last symbol refers to which loop is actually vectorized.
Note that this can change as a function of types:
julia> B = A';
julia> lst = LoopVectorization.@avx_debug for j in 1:size(A,2), i in 1:size(A,1)
y[i] += B[i,j] * x[j]
end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03),:numericconstant,Symbol("##reductzero#656"),LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x04),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000001, 0x0000000000000000, 0x0000000000010204, LoopVectorization.compute, 0x00, 0x04),:LoopVectorization,:reduced_add,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000001, 0x0000000000000000, 0x0000000000000503, LoopVectorization.compute, 0x00, 0x03),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000006, LoopVectorization.memstore, 0x03, 0x05)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:B,Symbol("##vptr##_B")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:y,Symbol("##vptr##_y")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(4, LoopVectorization.IntOrFloat)},Tuple{}}
LPSYM = Tuple{:j,:i}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1},VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.RowMajorStridedPointer{Float64,1}(Ptr{Float64} @0x00005605ec236e00, (400,)), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c620, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c240, ()))
julia> LoopVectorization.choose_order(lst) # j is now vectorized instead of `i`
([:i, :j], :j, :i, :j, 1, 8)
Anyway, back to the original ([:i, :j], :j, :i, :i, 1, 8) example,
We still load from A for each combination. We were already at the minimum, so no savings there.
As i is now the outer-most loop, we now load and store from y[i] a total of size(A,1) times.
However, we do need to load from x[j], but by cld(size(A,1),8)*size(A,2). We get a reducing factor of 8, because we get to re-use the load from x[j] for each factor by which the i loop is unrolled. 8x unrolling means we get an 8x reduction in how many times we have to load from this vector!
We can see some spectacular performance improvements from this:
julia> function gemvavx1!(y, A, x)
@avx for j in axes(A,2), i in axes(A,1)
y[i] += A[i,j] * x[j]
end
end
gemvavx1! (generic function with 1 method)
julia> function gemvavx2!(y, A, x)
@avx for i in axes(A,1)
yįµ¢ = zero(eltype(y))
for j ā axes(A,2)
yįµ¢ += A[i,j] * x[j]
end
y[i] = yįµ¢
end
end
gemvavx2! (generic function with 1 method)
julia> function gemv_simd!(y, A, x)
@inbounds for j in axes(A,2); @simd ivdep for i in axes(A,1)
y[i] += A[i,j] * x[j]
end; end
end
gemv_simd! (generic function with 1 method)
julia> M = N = 56;
julia> y1 = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N); y2 = similar(y1); y3 = similar(y1);
julia> @benchmark gemv_simd!(fill!($y1, 0), $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 862.015 ns (0.00% GC)
median time: 867.697 ns (0.00% GC)
mean time: 867.474 ns (0.00% GC)
maximum time: 1.537 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 66
julia> @benchmark gemvavx1!(fill!($y2, 0), $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 92.937 ns (0.00% GC)
median time: 93.351 ns (0.00% GC)
mean time: 93.472 ns (0.00% GC)
maximum time: 147.397 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 954
julia> @benchmark gemvavx2!($y3, $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 76.973 ns (0.00% GC)
median time: 77.107 ns (0.00% GC)
mean time: 77.198 ns (0.00% GC)
maximum time: 130.935 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 969
julia> y1 ā y2 ā y3
true
But ā only at these tiny sizes.
Your CPU is way faster at computing than it is at moving memory around.
And A * x requires on the order of length(A) operations, and on the same order of memory. So in the end, itāll perform more or less the same as sum:
julia> function sumavx(A)
s = zero(eltype(A))
@avx for i ā eachindex(A)
s += A[i]
end
s
end
sumavx (generic function with 1 method)
julia> @btime sum($A)
192.643 ns (0 allocations: 0 bytes)
1578.41421948359
julia> @btime sumavx($A)
67.960 ns (0 allocations: 0 bytes)
1578.41421948359
Since A * x and sum(A) require accessing roughly the same amount of memory.
And once these vectors stop fitting in our CPUās caches, memory bandwidth is all that determines performance. Make A 500x500, and gemv_simd!, which was so slow before, is now equally fast as everything else:
julia> M = N = 500;
julia> y1 = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N); y2 = similar(y1); y3 = similar(y1);
julia> @benchmark gemv_simd!(fill!($y1, 0), $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 82.602 μs (0.00% GC)
median time: 85.719 μs (0.00% GC)
mean time: 85.867 μs (0.00% GC)
maximum time: 168.723 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark gemvavx1!(fill!($y2, 0), $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 82.080 μs (0.00% GC)
median time: 84.833 μs (0.00% GC)
mean time: 84.985 μs (0.00% GC)
maximum time: 257.037 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark gemvavx2!($y3, $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 83.944 μs (0.00% GC)
median time: 84.750 μs (0.00% GC)
mean time: 84.895 μs (0.00% GC)
maximum time: 258.025 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> y1 ā y2 ā y3
true
julia> @btime sumavx($A)
83.417 μs (0 allocations: 0 bytes)
125042.8745667071
julia> @btime sumsimd($A)
83.430 μs (0 allocations: 0 bytes)
125042.87456670718
So the best you can do is add multithreading and break up the i vector (you canāt parallelize it, or youāre be reading and writing to y with different threads simultaneously.
function gemv_threads!(y, A, x)
M = length(y)
chunks = min(M >> 7, Base.Threads.nthreads())
chunks > 1 || return gemvavx2!(y, A, x)
chunk_size = LoopVectorization.VectorizationBase.align(cld(M, chunks), eltype(y))
Base.Threads.@sync for c in 1:chunks
Threads.@spawn begin
b = chunk_size*(c-1) + 1
e = min(M, chunk_size*c)
r = b:e
gemvavx2!(view(y, r), view(A, r, :), x)
end
end
y
end
With this, for 500x500, I get
julia> @benchmark gemv_threads!($y3, $A, $x)
BenchmarkTools.Trial:
memory estimate: 2.09 KiB
allocs estimate: 30
--------------
minimum time: 10.343 μs (0.00% GC)
median time: 35.046 μs (0.00% GC)
mean time: 35.778 μs (0.00% GC)
maximum time: 164.298 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> using LinearAlgebra; BLAS.set_num_threads(Base.Threads.nthreads());
julia> @benchmark mul!($y2, $A, $x)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 12.815 μs (0.00% GC)
median time: 15.596 μs (0.00% GC)
mean time: 15.964 μs (0.00% GC)
maximum time: 82.304 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> y2 ā y3
true
Peak performance is similar with OpenBLAS here, but average performance is much worse.
As @stillyslalom said, it does itās own modeling to make decisions and then generates code with lots of llvmcall.