Performance with small matrices: 5-argument mul! vs BLAS.gemm!

Egwene_al_Vere · November 26, 2019, 7:10pm

Trying out the new 5-argument mul! introduced in Julia 1.3, (ABα+Cβ → C), for the calculation of matrix commutator comu(A,B)=AB-BA. From my testing, with small (2x2 or 3x3) matrices, mul! is actually faster than gemm!, and when such functions are used in the inner-most/hot loop, it can lead to very noticeable performance differences:

using BenchmarkTools
using LinearAlgebra

ndim = 3;
m1=rand(ComplexF64,ndim,ndim);
m2=rand(ComplexF64,ndim,ndim);
ou=rand(ComplexF64,ndim,ndim);

function comu!(r, a, b)
    idy = one(eltype(r))
    mul!(r, a, b)
    mul!(r, b, a, -idy, idy)
    return nothing
end

function comu_b!(r, a, b)
    idy = one(eltype(r))
    mul!(r, a, b)
    BLAS.gemm!('N', 'N', -idy, b, a, idy, r)
    return nothing
end

@btime comu!($ou,$m1,$m2)

88.698 ns (0 allocations: 0 bytes)

@btime comu_b!($ou,$m1,$m2)

200.584 ns (0 allocations: 0 bytes)

Is there something special that’s going on here? As another comparison, using just 3-argument mul!

tmp = similar(ou);

function comu_mm!(r,a,b,h)
    mul!(r,a,b)
    mul!(h,b,a)
    r .-= h
    return nothing
end

@btime comu_mm!($ou,$m1,$m2,$tmp)

82.248 ns (0 allocations: 0 bytes)

Thanks!

bcmichael · November 26, 2019, 7:26pm

For very small matrices mul! doesn’t call BLAS routines but gemm! explicitly does call BLAS. BLAS is great for large matrices but as you’ve observed has some unnecessary overhead for really small matrices.

There are matmul2x2! and matmul3x3! functions that implement fully unrolled matrix multiplication for these specific sizes in https://github.com/JuliaLang/julia/blob/master/stdlib/LinearAlgebra/src/matmul.jl.

Topic		Replies	Views
Are there any 5-argument versions of `mul!` or similar in LinearAlgebra now? Performance	1	920	September 5, 2018
Why mul! is so fast? General Usage question , linearalgebra	7	6871	November 26, 2019
Matrix-Vector multiplication complex/real Performance	0	379	March 14, 2021
Gemm! vs symm! performance Performance	7	513	January 2, 2023
`mul` dispatch to BLAS incomplete? Performance linearalgebra , numerics	18	1066	November 22, 2021

Performance with small matrices: 5-argument mul! vs BLAS.gemm!

Related topics