Consider the example,
using LinearAlgebra, BenchmarkTools
function comu_f!(r, a, b, p, s)
mul!(r, a, b, s, p)
mul!(r, b, a, -s, true)
return nothing
end
m1, m2, m3 = [rand(ComplexF64, 10, 10) for _ in 1:3]
function txx(r, a, b)
p = false
s = 2.0
comu_f!(r, a, b, p, s)
end
function txx2(r, a, b)
p = false
s = 2.0
mul!(r, a, b, s, p)
mul!(r, b, a, -s, true)
nothing
end
(In real use, p and s would change, and is in a inner loop so I’d like to reduce allocations.)
@btime txx($m1, $m2, $m3)
shows
757.120 ns (1 allocation: 32 bytes)
while @btime txx2($m1, $m2, $m3)
shows
745.260 ns (0 allocations: 0 bytes)
Since txx2
just copy-paste the code, why does txx
allocates while the mul! call should not? Thanks!