As an example:
julia> using StaticArrays, BenchmarkTools
julia> @inline function AmulB!(C,A,B)
@fastmath @inbounds for n ∈ axes(C,2), m ∈ axes(C,1)
Cmn = zero(eltype(C))
for k ∈ axes(B,1)
Cmn += A[m,k]*B[k,n]
end
C[m,n] = Cmn
end
end
AmulB! (generic function with 1 method)
julia> @inline function AmulB(A::StaticMatrix{M,K,T1}, B::StaticMatrix{K,N,T2}) where {M,K,N,T1,T2}
C = MMatrix{M,N,promote_type(T1,T2)}(undef)
AmulB!(C, A, B)
return SMatrix(C)
end
AmulB (generic function with 1 method)
julia> A = @SMatrix rand(7,7); B = @SMatrix rand(7,7);
julia> @btime $(Ref(A))[] * $(Ref(B))[];
58.635 ns (0 allocations: 0 bytes)
julia> @btime AmulB($(Ref(A))[], $(Ref(B))[]);
34.616 ns (0 allocations: 0 bytes)
julia> AmulB(A, B) ≈ A * B
true
julia> @inline function AmulBalloc(A::StaticMatrix{M,K,T1}, B::StaticMatrix{K,N,T2}) where {M,K,N,T1,T2}
C = MMatrix{M,N,promote_type(T1,T2)}(undef)
AmulB!(C, A, B)
return C
end
AmulBalloc (generic function with 1 method)
julia> @btime AmulBalloc($(Ref(A))[], $(Ref(B))[]);
29.751 ns (1 allocation: 400 bytes)
As you can see here, heap allocating the MArray
(30 ns) is actually faster than stack allocating it and then copying it to a stack allocated SMatrix
(35 ns), when what the compiler should be doing is just writing the result to the eventual stack allocated matrix directly.
Both are faster than the unrolled version (59 ns).
The compiler does a really bad job with more complicated types:
jula> using ForwardDiff
julia> Ad = ForwardDiff.Dual.(A, A, A, A);
julia> @btime AmulB($(Ref(Ad))[], $(Ref(B))[]);
2.366 μs (0 allocations: 0 bytes)
julia> @btime $(Ref(Ad))[] * $(Ref(B))[];
105.213 ns (0 allocations: 0 bytes)
(it’s using gather and scatters in the loopy version), but that’s a fixable compiler problem.
In principle, custom types are where this really should be much better, as the degree of unrolling in generated code isn’t going to scale appropriately as a function of the size of the eltypes.
By unrolling the duals, the loop vectorizer (SIMD-ing across loop iterations) is basically forced into the gather/scatter here.
I suspect that if we also re-rolled Dual numbers, so that it is just loops there too, we may see better runtime performance. Matmul with just a single dual matrix can be reinterpreted into regular Float64
matmul, which would vectorize fine without gather/scatter.
The extreme unrolling of both large (nested) duals + static arrays can lead to 70% of compile time being spent in LLVM, such as in certain Pumas models fits.
I’m sure it slows down the Julia side of compilation a lot, too.