Actually, I donāt think this should necessarily be a problem here.
512:
julia> using LoopVectorization, BenchmarkTools
julia> function AmulB!(C, A, B)
@avx for m ā axes(C,1), n ā axes(C,2)
Cāā = zero(eltype(C))
for k ā axes(B,1)
Cāā += A[m,k] * B[k,n]
end
C[m,n] = Cāā
end
end
AmulB! (generic function with 1 method)
julia> M, K, N = 512, 72, 72;
julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N);
julia> @benchmark AmulB!($C1, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 44.239 μs (0.00% GC)
median time: 44.365 μs (0.00% GC)
mean time: 44.423 μs (0.00% GC)
maximum time: 120.546 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> 2e-9M*K*N / 44.239e-6
119.9940324148376
vs 520:
julia> M, K, N = 520, 72, 72;
julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N);
julia> @benchmark AmulB!($C1, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 43.211 μs (0.00% GC)
median time: 43.345 μs (0.00% GC)
mean time: 43.409 μs (0.00% GC)
maximum time: 104.449 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> 2e-9M*K*N / 43.211e-6
124.7682303117262
The CPU I ran this on (10980XE) has 8-way associative 32 KiB L1D cache/core
julia> 32 * (2 ^ 10) Ć· (8 * 8)
512
Yielding the fabled 512 as the magic(-ally bad) number.
But 119 (M = 512) vs 124 (M = 520) GFLOPS isnāt a big deal.
Both are very good; the theoretical peak for this CPU is 131.2.
Agner Fogās example (see section 9.10, starting on page 103) is transposing a matrix.
We can try this:
julia> A = rand(512, 512); B = similar(A);
julia> @benchmark copyto!($B, $(A'))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 844.904 μs (0.00% GC)
median time: 850.551 μs (0.00% GC)
mean time: 850.523 μs (0.00% GC)
maximum time: 1.722 ms (0.00% GC)
--------------
samples: 5876
evals/sample: 1
julia> A = rand(520, 520); B = similar(A);
julia> @benchmark copyto!($B, $(A'))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 465.476 μs (0.00% GC)
median time: 470.590 μs (0.00% GC)
mean time: 470.852 μs (0.00% GC)
maximum time: 873.865 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
Tranposing a 512x512 matrix took about 80% longer than transposing a 520x520 matrix on my computer. Ouch!
(The problem is much less severe if you use FastTranspose.jl, where I get 285 microseconds for 512x512 vs 235 for 520x520.)
That is because what were contiguous loads (down a column) from A suddenly become 512 doubles apart when storing into B, so that each one evicts the previous cacheline.
So, to reproduce this in matrix multiplication, we should similarly expect pathological behavior when multiplying column-major A * B without packing into a tile-major format.
This is because (in column-major matmul) on each iteration of the k loop, we load vectors from a column of A, and a bunch of scalars from a row of B. So if Bās stride is 512, each of these scalars will be from a different cacheline.
Yet, performance is still fine:
julia> M, K, N = 72, 512, 72;
julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N); C2 = A * B;
julia> @benchmark AmulB!($C1, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 43.920 μs (0.00% GC)
median time: 44.285 μs (0.00% GC)
mean time: 44.367 μs (0.00% GC)
maximum time: 120.774 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> 2e-9M*K*N / 43.920e-6
120.86557377049182
Thatās 120 GFLOPS. Iām happy to see great performance, but it seems like I need to understand this problem better.
@biona001
My matrix are typically 512 by 5~30k. I suppose this is too large for LoopVectorization?
At that size, why not multithread the BLAS and sum? How does ThreadsX.sum(A * A') or simple sum(A * A') do?
Additionally, do they all really have 512 rows? If so, you can use a preallocated 512 x 512 C. sum(A * A') ā sum(A' * A), so youāll want to make the smaller dim the outer dim.
Returning to my example, I looked into why the simply written version was slower, and it turns out the problem is with LLVM, so Iāve filed a bug there.
The simpler version gives LoopVectorization more freedom in choosing the order of the loops, in which case it chooses an order more like:
function sumAįµA(A::AbstractMatrix{T}) where {T}
s = zero(T)
@avx for n ā axes(A,1), k ā axes(A,2)
t = zero(T)
for m ā axes(A,1)
t += A[m,k] * A[n,k]
end
s += t
end
s
end
And this is indeed faster at many sizes than what I had written last time, and manages to workaround that LLVM performance-bug.
Iāll consider trying to automatically implement some workarounds that hopefully have less runtime overhead than doing something like the above, which should be slower than the simple
for m ā axes(A,1), n ā axes(A,1), k ā axes(A,2); s += ... version.
I suggest giving it a try; seems to scale well without resorting to extra levels of tiling:
julia> A = rand(512, 1000);
julia> @btime sumAįµA($A)
4.047 ms (0 allocations: 0 bytes)
6.5659906453390986e7
julia> 2e-9*512^2*1000 / 4.047e-3
129.54978996787744