Trying out the new 5-argument `mul!`

introduced in Julia 1.3, (ABα+Cβ -> C), for the calculation of matrix commutator comu(A,B)=AB-BA. From my testing, with small (2x2 or 3x3) matrices, `mul!`

is actually faster than `gemm!`

, and when such functions are used in the inner-most/hot loop, it can lead to very noticeable performance differences:

```
using BenchmarkTools
using LinearAlgebra
ndim = 3;
m1=rand(ComplexF64,ndim,ndim);
m2=rand(ComplexF64,ndim,ndim);
ou=rand(ComplexF64,ndim,ndim);
function comu!(r, a, b)
idy = one(eltype(r))
mul!(r, a, b)
mul!(r, b, a, -idy, idy)
return nothing
end
function comu_b!(r, a, b)
idy = one(eltype(r))
mul!(r, a, b)
BLAS.gemm!('N', 'N', -idy, b, a, idy, r)
return nothing
end
```

```
@btime comu!($ou,$m1,$m2)
```

88.698 ns (0 allocations: 0 bytes)

```
@btime comu_b!($ou,$m1,$m2)
```

200.584 ns (0 allocations: 0 bytes)

Is there something special that’s going on here? As another comparison, using just 3-argument `mul!`

```
tmp = similar(ou);
function comu_mm!(r,a,b,h)
mul!(r,a,b)
mul!(h,b,a)
r .-= h
return nothing
end
@btime comu_mm!($ou,$m1,$m2,$tmp)
```

82.248 ns (0 allocations: 0 bytes)

Thanks!