I’m trying to multiply a matrix of Float64 by a matrix of SVectors, but I’m finding that mul! will take a while to allocate some memory for this, and I’m not sure why. Is there some way to get around this?
using StaticArrays
using BenchmarkTools
using LinearAlgebra
const Dr = rand(Float64, 2, 3)
const A = rand(SVector{3, Float64}, 3, 2)
const B = zeros(SVector{3, Float64}, 2, 2)
@btime mul!($(B), $(Dr), $(A))
julia> Profile.Allocs.fetch().allocs
3-element Vector{Profile.Allocs.Alloc}:
Profile.Allocs.Alloc(Profile.Allocs.UnknownType, Base.StackTraces.StackFrame[maybe_record_alloc_to_profile at gc-alloc-profiler.h:42 [inlined], ...], 16)
Profile.Allocs.Alloc(Matrix{Float64}, Base.StackTraces.StackFrame[maybe_record_alloc_to_profile at gc-alloc-profiler.h:42 [inlined], ...], 3592)
Profile.Allocs.Alloc(Matrix{SVector{3, Float64}}, Base.StackTraces.StackFrame[maybe_record_alloc_to_profile at gc-alloc-profiler.h:42 [inlined], ...], 10648)
which doesn’t seem to add up (16+3592+10648=14356 bytes? = 13.922 KiB) to the 14.141 KiB I see and suggests much larger matrices than the inputs (assuming 40 byte array overhead, lengths of 444 and 442).
For large-ish matrices, I’m finding that this multiplicationn seems to be a little faster than the 1.11 multiplication for some reason. Thanks!
using StaticArrays
using BenchmarkTools
using LinearAlgebra
function mymul!(C, A, B)
for j = axes(C, 2), i = axes(C, 1)
C[i, j] = sum(A[i, k] * B[k, j] for k=axes(A, 2))
end
end
const Dr = rand(Float64, 500, 500)
const A = rand(SVector{3, Float64}, 500, 3000)
const B = zeros(SVector{3, Float64}, 500, 3000)
@btime mul!($(B), $(Dr), $(A))
@btime mymul!($(B), $(Dr), $(A))
827.817 ms (0 allocations: 0 bytes)
650.042 ms (0 allocations: 0 bytes)
fr?! mymul! is a naive implementation, the most you might add is a matching dimensions check before the loop. Why is the LinearAlgebra implementation so much more complicated then?
On my machine, mul! on 1.10 is actually a bit faster in Vincent’s last example: 552 ms for mul! (4 allocations), and 724 ms for mymul! (0 allocations). The mul! timing stays constant, no matter the number of BLAS threads.
Of course, for ordinary matrices mul! handily outperforms mymul!:
using StaticArrays, BenchmarkTools, LinearAlgebra
X = rand(100, 200)
Y = rand(200, 300)
Z = zeros(100, 300)
@btime mymul!(Z, X, Y);
BLAS.set_num_threads(1)
@btime mul!(Z, X, Y);
BLAS.set_num_threads(4)
@btime mul!(Z, X, Y);