Performance with small matrices: 5-argument mul! vs BLAS.gemm!

Trying out the new 5-argument mul! introduced in Julia 1.3, (ABα+Cβ -> C), for the calculation of matrix commutator comu(A,B)=AB-BA. From my testing, with small (2x2 or 3x3) matrices, mul! is actually faster than gemm!, and when such functions are used in the inner-most/hot loop, it can lead to very noticeable performance differences:

using BenchmarkTools
using LinearAlgebra

ndim = 3;
m1=rand(ComplexF64,ndim,ndim);
m2=rand(ComplexF64,ndim,ndim);
ou=rand(ComplexF64,ndim,ndim);

function comu!(r, a, b)
    idy = one(eltype(r))
    mul!(r, a, b)
    mul!(r, b, a, -idy, idy)
    return nothing
end

function comu_b!(r, a, b)
    idy = one(eltype(r))
    mul!(r, a, b)
    BLAS.gemm!('N', 'N', -idy, b, a, idy, r)
    return nothing
end
@btime comu!($ou,$m1,$m2)

88.698 ns (0 allocations: 0 bytes)

@btime comu_b!($ou,$m1,$m2)

200.584 ns (0 allocations: 0 bytes)

Is there something special that’s going on here? As another comparison, using just 3-argument mul!

tmp = similar(ou);

function comu_mm!(r,a,b,h)
    mul!(r,a,b)
    mul!(h,b,a)
    r .-= h
    return nothing
end

@btime comu_mm!($ou,$m1,$m2,$tmp)

82.248 ns (0 allocations: 0 bytes)

Thanks!

1 Like

For very small matrices mul! doesn’t call BLAS routines but gemm! explicitly does call BLAS. BLAS is great for large matrices but as you’ve observed has some unnecessary overhead for really small matrices.

There are matmul2x2! and matmul3x3! functions that implement fully unrolled matrix multiplication for these specific sizes in https://github.com/JuliaLang/julia/blob/master/stdlib/LinearAlgebra/src/matmul.jl.

2 Likes