Actually, I donāt think this should necessarily be a problem here.
512:
julia> using LoopVectorization, BenchmarkTools
julia> function AmulB!(C, A, B)
@avx for m ā axes(C,1), n ā axes(C,2)
Cāā = zero(eltype(C))
for k ā axes(B,1)
Cāā += A[m,k] * B[k,n]
end
C[m,n] = Cāā
end
end
AmulB! (generic function with 1 method)
julia> M, K, N = 512, 72, 72;
julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N);
julia> @benchmark AmulB!($C1, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 44.239 Ī¼s (0.00% GC)
median time: 44.365 Ī¼s (0.00% GC)
mean time: 44.423 Ī¼s (0.00% GC)
maximum time: 120.546 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> 2e-9M*K*N / 44.239e-6
119.9940324148376
vs 520:
julia> M, K, N = 520, 72, 72;
julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N);
julia> @benchmark AmulB!($C1, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 43.211 Ī¼s (0.00% GC)
median time: 43.345 Ī¼s (0.00% GC)
mean time: 43.409 Ī¼s (0.00% GC)
maximum time: 104.449 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> 2e-9M*K*N / 43.211e-6
124.7682303117262
The CPU I ran this on (10980XE) has 8-way associative 32 KiB L1D cache/core
julia> 32 * (2 ^ 10) Ć· (8 * 8)
512
Yielding the fabled 512 as the magic(-ally bad) number.
But 119 (M = 512) vs 124 (M = 520) GFLOPS isnāt a big deal.
Both are very good; the theoretical peak for this CPU is 131.2.
Agner Fogās example (see section 9.10, starting on page 103) is transposing a matrix.
We can try this:
julia> A = rand(512, 512); B = similar(A);
julia> @benchmark copyto!($B, $(A'))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 844.904 Ī¼s (0.00% GC)
median time: 850.551 Ī¼s (0.00% GC)
mean time: 850.523 Ī¼s (0.00% GC)
maximum time: 1.722 ms (0.00% GC)
--------------
samples: 5876
evals/sample: 1
julia> A = rand(520, 520); B = similar(A);
julia> @benchmark copyto!($B, $(A'))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 465.476 Ī¼s (0.00% GC)
median time: 470.590 Ī¼s (0.00% GC)
mean time: 470.852 Ī¼s (0.00% GC)
maximum time: 873.865 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
Tranposing a 512x512 matrix took about 80% longer than transposing a 520x520 matrix on my computer. Ouch!
(The problem is much less severe if you use FastTranspose.jl, where I get 285 microseconds for 512x512 vs 235 for 520x520.)
That is because what were contiguous loads (down a column) from A
suddenly become 512 doubles apart when storing into B
, so that each one evicts the previous cacheline.
So, to reproduce this in matrix multiplication, we should similarly expect pathological behavior when multiplying column-major A * B
without packing into a tile-major format.
This is because (in column-major matmul) on each iteration of the k
loop, we load vectors from a column of A
, and a bunch of scalars from a row of B
. So if B
ās stride is 512, each of these scalars will be from a different cacheline.
Yet, performance is still fine:
julia> M, K, N = 72, 512, 72;
julia> A = rand(M, K); B = rand(K, N); C1 = Matrix{Float64}(undef, M, N); C2 = A * B;
julia> @benchmark AmulB!($C1, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 43.920 Ī¼s (0.00% GC)
median time: 44.285 Ī¼s (0.00% GC)
mean time: 44.367 Ī¼s (0.00% GC)
maximum time: 120.774 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> 2e-9M*K*N / 43.920e-6
120.86557377049182
Thatās 120 GFLOPS. Iām happy to see great performance, but it seems like I need to understand this problem better.
@biona001
My matrix are typically 512 by 5~30k. I suppose this is too large for LoopVectorization?
At that size, why not multithread the BLAS and sum? How does ThreadsX.sum(A * A')
or simple sum(A * A')
do?
Additionally, do they all really have 512 rows? If so, you can use a preallocated 512 x 512 C
. sum(A * A') ā sum(A' * A)
, so youāll want to make the smaller dim the outer dim.
Returning to my example, I looked into why the simply written version was slower, and it turns out the problem is with LLVM, so Iāve filed a bug there.
The simpler version gives LoopVectorization more freedom in choosing the order of the loops, in which case it chooses an order more like:
function sumAįµA(A::AbstractMatrix{T}) where {T}
s = zero(T)
@avx for n ā axes(A,1), k ā axes(A,2)
t = zero(T)
for m ā axes(A,1)
t += A[m,k] * A[n,k]
end
s += t
end
s
end
And this is indeed faster at many sizes than what I had written last time, and manages to workaround that LLVM performance-bug.
Iāll consider trying to automatically implement some workarounds that hopefully have less runtime overhead than doing something like the above, which should be slower than the simple
for m ā axes(A,1), n ā axes(A,1), k ā axes(A,2); s += ...
version.
I suggest giving it a try; seems to scale well without resorting to extra levels of tiling:
julia> A = rand(512, 1000);
julia> @btime sumAįµA($A)
4.047 ms (0 allocations: 0 bytes)
6.5659906453390986e7
julia> 2e-9*512^2*1000 / 4.047e-3
129.54978996787744