Iβm going to focus on matrix-matrix multiplication instead of matrix-vector.
See the LoopVectorization benchmarks here for results across a range of sizes.
Picking one size (144x144), letβs look a little closer at a very simple implementation of matrix multiplication, to guarantee that nothing too fancy is going on:
julia> using LoopVectorization, PaddedMatrices
julia> M = K = N = 144; C = Matrix{Float64}(undef, M, N); A = rand(M, K); B = rand(K, N);
julia> function AmulB!(C, A, B)
@avx for m β axes(A,1), n β axes(B,2)
Cmn = zero(eltype(C))
for k β axes(A,2)
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
AmulB! (generic function with 1 method)
julia> @benchmark AmulB!($C, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 47.615 ΞΌs (0.00% GC)
median time: 47.789 ΞΌs (0.00% GC)
mean time: 47.895 ΞΌs (0.00% GC)
maximum time: 129.122 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark AmulB!($C, $A', $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 69.159 ΞΌs (0.00% GC)
median time: 69.557 ΞΌs (0.00% GC)
mean time: 69.722 ΞΌs (0.00% GC)
maximum time: 129.797 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
The column-major version was about 45% faster!
Why is that? Because column major matrix A
is actually a great way to multiply matrices. In fact, even A' * B'
is better than A' * B
:
julia> @benchmark AmulB!($C, $A', $B')
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 59.039 ΞΌs (0.00% GC)
median time: 60.140 ΞΌs (0.00% GC)
mean time: 59.920 ΞΌs (0.00% GC)
maximum time: 117.935 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
Because A' * B'
is like (B * A)'
.
Why is this? There are two major problems with A' * B
:
- Dot products arenβt that efficient with respect to maximizing SIMD: you do a bunch of SIMD operations, but then at the end you need to do a bunch of extra operations to turn your SIMD vectors into a scalar before you can store it in
C
.
- Youβre moving rapidly across the memory in the arrays, because youβre loading SIMD vectors from both of them. You want to minimize memory bandwidth requirements.
For reference, A * B
is basically doing:
C[1:8,1] .= A[1:8,1] * B[1,1] .+ A[1:8,2] * B[2,1] .+ A[1:8,3] * B[3,1] .+ ...
Compared to A' * B
, this means that the SIMD vectors from A
can be both loaded and stored directly into C
without the need for any extra processing, and that weβre loading elements from B
very slowly; memory bandwidth is much less of an issue for matrix B
.
While for A' * B
, we have
C[1,1] = sum( A[1:8,1] .* B[1:8,1] .+ A[9:16,1] .* B[9:16,1] .+ A[17:24,1] .* B[17:24,1] .+ ...)
requiring the extra sum
at the end, on top of the more rapid memory churn.
If youβre on Linux, you can follow this example by using the master branch of LinuxPerf.jl
(the @pstats
macro isnβt in the latest release). For each @pstats
below, I ran it twice to make sure nothing was compiling. A * B
:
julia> using LinuxPerf
julia> foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))
foreachf! (generic function with 1 method)
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
foreachf!(AmulB!, 10_000, C, A, B)
end
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles 1.97e+09 60.1% # 4.1 cycles per ns
β instructions 6.57e+09 60.2% # 3.3 insns per cycle
β branch-instructions 1.40e+08 60.2% # 2.1% of instructions
β branch-misses 1.09e+06 60.2% # 0.8% of branch instructions
β task-clock 4.82e+08 100.0% # 481.9 ms
β context-switches 0.00e+00 100.0%
β cpu-migrations 0.00e+00 100.0%
β page-faults 0.00e+00 100.0%
β L1-dcache-load-misses 5.72e+08 19.9% # 27.4% of dcache loads
β L1-dcache-loads 2.09e+09 19.9%
β L1-icache-load-misses 2.10e+04 19.9%
β dTLB-load-misses 1.51e+01 19.9% # 0.0% of dTLB loads
β dTLB-loads 2.09e+09 19.9%
β iTLB-load-misses 1.15e+04 40.0% # 5801.3% of iTLB loads
β iTLB-loads 1.97e+02 40.0%
βββββββββββββββββββββββββββββββββββββββββββ
Now A' * B
:
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" begin
foreachf!(AmulB!, 10_000, C, A', B)
end
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles 2.84e+09 59.9% # 4.1 cycles per ns
β instructions 8.03e+09 60.0% # 2.8 insns per cycle
β branch-instructions 1.96e+08 60.0% # 2.4% of instructions
β branch-misses 8.68e+06 60.0% # 4.4% of branch instructions
β task-clock 6.98e+08 100.0% # 697.6 ms
β context-switches 0.00e+00 100.0%
β cpu-migrations 0.00e+00 100.0%
β page-faults 0.00e+00 100.0%
β L1-dcache-load-misses 7.99e+08 20.1% # 51.2% of dcache loads
β L1-dcache-loads 1.56e+09 20.1%
β L1-icache-load-misses 2.52e+05 20.1%
β dTLB-load-misses 2.20e+02 20.0% # 0.0% of dTLB loads
β dTLB-loads 1.56e+09 20.0%
β iTLB-load-misses 2.97e+03 39.9% # 736.0% of iTLB loads
β iTLB-loads 4.04e+02 39.9%
βββββββββββββββββββββββββββββββββββββββββββ
Note how for A' * B
we needed 803_000
instructions (8.03e9 to do it 10_000 times), but for A * B
we only needed 657_000
instructions. Thatβs because, as outlined above, we can store directly from A
into C
without any extra work.
On top of that, we ran at 3.1 instructions per clock cycle for A * B
with 27.4% L1-data cache misses, while A' * B
only hit 2.8 instructions per clock cycle (on top of needing many more instructions!), with 51.2% L1-data cache misses. Thatβs the cost of the rapid memory-churn required.
The library PaddedMatrices.jl
uses LoopVectorization.jl
to define an at least half serious attempt at matrix multiplication. It can calculate A' * B
be much faster than the AmulB!
we defined above:
julia> @benchmark PaddedMatrices.jmul!($C, $A', $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 55.666 ΞΌs (0.00% GC)
median time: 55.923 ΞΌs (0.00% GC)
mean time: 56.026 ΞΌs (0.00% GC)
maximum time: 125.665 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
How? Itβs actually faster to transpose A
, copying the memory to a new location, and then doing a column major multiplication. So that is what it does.
The transpose currently is slower than it should be; MKL is much faster:
julia> using LinearAlgebra; BLAS.set_num_threads(1); BLAS.vendor()
:mkl
julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 47.968 ΞΌs (0.00% GC)
median time: 48.065 ΞΌs (0.00% GC)
mean time: 48.155 ΞΌs (0.00% GC)
maximum time: 114.288 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mul!($C, $A', $B)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 51.489 ΞΌs (0.00% GC)
median time: 51.596 ΞΌs (0.00% GC)
mean time: 51.682 ΞΌs (0.00% GC)
maximum time: 117.158 ΞΌs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
but Iβm planning on working on this in the not too distant future.
The point I would like to make here is: if you donβt know how libraries like MKL and OpenBLAS are implemented, donβt put too much stock in comparisons using them. They all copy your data into more efficient data structures, so it doesnβt normally matter in the end what the layout was initially.