Matrix of SVectors allocating

Hi,

I’m trying to multiply a matrix of Float64 by a matrix of SVectors, but I’m finding that mul! will take a while to allocate some memory for this, and I’m not sure why. Is there some way to get around this?

using StaticArrays
using BenchmarkTools
using LinearAlgebra

const Dr = rand(Float64, 2, 3)
const A = rand(SVector{3, Float64}, 3, 2)
const B = zeros(SVector{3, Float64}, 2, 2)

@btime mul!($(B), $(Dr), $(A))
  332.328 ns (3 allocations: 14.39 KiB)
2×2 Matrix{SVector{3, Float64}}:
 [0.74142, 0.650898, 1.27879]    [0.80855, 0.960478, 0.59237]
 [0.344308, 0.457656, 0.878124]  [0.629784, 0.663227, 0.445165]
1 Like

Not sure what’s going on, but this is better on 1.11:

1.10.4> @btime mul!($B, $Dr, $A);
  400.450 ns (3 allocations: 14.14 KiB)

1.11.0-rc2> @btime mul!($B, $Dr, $A);
  23.594 ns (0 allocations: 0 bytes)
3 Likes

This is probably not what you’re looking for, but an obvious way to get around the problem on 1.10 is to write the matrix multiplication yourself:

function mymul!(C, A, B)
    for j = axes(C, 2), i = axes(C, 1)
        C[i, j] = sum(A[i, k] * B[k, j] for k=axes(A, 2))
    end
end
julia> @btime mymul!($B, $Dr, $A)
  26.432 ns (0 allocations: 0 bytes)

This is likely a bit slower than the 1.11 mul! version, but is much faster than the allocating version on 1.10

julia> @btime mul!($B, $Dr, $A);
  296.651 ns (3 allocations: 14.14 KiB)
2 Likes

I don’t know either. Something is going one with the inference in matmul2x2!, and many possible runtime dispatches follow (omitted for space):

julia> @report_opt mul!(B, Dr, A)
═════ 189 possible errors found ═════
┌ mul!(C::Matrix{SVector{3, Float64}}, A::Matrix{Float64}, B::Matrix{SVector{3, Float64}}) @ LinearAlgebra C:\workdir\usr\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:237
│┌ mul!(C::Matrix{SVector{3, Float64}}, A::Matrix{Float64}, B::Matrix{SVector{3, Float64}}, α::Bool, β::Bool) @ LinearAlgebra C:\workdir\usr\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:263
││┌ generic_matmatmul!(C::Matrix{…}, tA::Char, tB::Char, A::Matrix{…}, B::Matrix{…}, _add::LinearAlgebra.MulAddMul{…}) @ LinearAlgebra C:\workdir\usr\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:778
│││┌ matmul2x2!(C::Matrix{…}, tA::Char, tB::Char, A::Matrix{…}, B::Matrix{…}, _add::LinearAlgebra.MulAddMul{…}) @ LinearAlgebra C:\workdir\usr\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:1025
││││┌ _modify!(p::LinearAlgebra.MulAddMul{true, true, Bool, Bool}, x::Any, C::Matrix{SVector{…}}, idx′::Tuple{Int64, Int64}) @ LinearAlgebra C:\workdir\usr\share\julia\stdlib\v1.10\LinearAlgebra\src\generic.jl:91
│││││┌ setindex!(::Matrix{SVector{3, Float64}}, ::Any, ::CartesianIndex{2}) @ Base ./multidimensional.jl:698
││││││┌ setindex!(::Matrix{SVector{3, Float64}}, ::Any, ::Int64, ::Int64) @ Base ./array.jl:1024
│││││││ runtime dispatch detected: convert(::SVector{3, Float64}, x::Any)::Any
││││││└────────────────────
...
julia> code_warntype(LinearAlgebra.matmul2x2!, (Matrix{SVector{3, Float64}}, Char, Char, Matrix{Float64}, Matrix{SVector{3, Float64}}, LinearAlgebra.MulAddMul{true, true, Bool, Bool}))
...
35 ┄ %191 = (A11 * B11)::Union{Adjoint{Float64, MVector{3, Float64}}, Transpose{Float64, MVector{3, Float64}}, SVector{3, Float64}}
│    %192 = (A12 * B21)::Union{Adjoint{Float64, MVector{3, Float64}}, Transpose{Float64, MVector{3, Float64}}, SVector{3, Float64}}
│    %193 = (%191 + %192)::Any
│    %194 = Core.tuple(1, 1)::Core.Const((1, 1))
│           LinearAlgebra._modify!(_add, %193, C, %194)
...repeats 3 times
...

After using Profile on a single call, I find:

julia> Profile.Allocs.fetch().allocs
3-element Vector{Profile.Allocs.Alloc}:
 Profile.Allocs.Alloc(Profile.Allocs.UnknownType, Base.StackTraces.StackFrame[maybe_record_alloc_to_profile at gc-alloc-profiler.h:42 [inlined], ...], 16)
 Profile.Allocs.Alloc(Matrix{Float64}, Base.StackTraces.StackFrame[maybe_record_alloc_to_profile at gc-alloc-profiler.h:42 [inlined], ...], 3592)
 Profile.Allocs.Alloc(Matrix{SVector{3, Float64}}, Base.StackTraces.StackFrame[maybe_record_alloc_to_profile at gc-alloc-profiler.h:42 [inlined], ...], 10648)

which doesn’t seem to add up (16+3592+10648=14356 bytes? = 13.922 KiB) to the 14.141 KiB I see and suggests much larger matrices than the inputs (assuming 40 byte array overhead, lengths of 444 and 442).

1 Like

I actually did not consider this…

For large-ish matrices, I’m finding that this multiplicationn seems to be a little faster than the 1.11 multiplication for some reason. Thanks!

using StaticArrays
using BenchmarkTools
using LinearAlgebra

function mymul!(C, A, B)
    for j = axes(C, 2), i = axes(C, 1)
        C[i, j] = sum(A[i, k] * B[k, j] for k=axes(A, 2))
    end
end

const Dr = rand(Float64, 500, 500)
const A = rand(SVector{3, Float64}, 500, 3000)
const B = zeros(SVector{3, Float64}, 500, 3000)

@btime mul!($(B), $(Dr), $(A))
@btime mymul!($(B), $(Dr), $(A))
  827.817 ms (0 allocations: 0 bytes)
  650.042 ms (0 allocations: 0 bytes)
1 Like

fr?! mymul! is a naive implementation, the most you might add is a matching dimensions check before the loop. Why is the LinearAlgebra implementation so much more complicated then?

On my machine, mul! on 1.10 is actually a bit faster in Vincent’s last example: 552 ms for mul! (4 allocations), and 724 ms for mymul! (0 allocations). The mul! timing stays constant, no matter the number of BLAS threads.

Of course, for ordinary matrices mul! handily outperforms mymul!:

using StaticArrays, BenchmarkTools, LinearAlgebra

X = rand(100, 200)
Y = rand(200, 300)
Z = zeros(100, 300)

@btime mymul!(Z, X, Y);

BLAS.set_num_threads(1)
@btime mul!(Z, X, Y);

BLAS.set_num_threads(4)
@btime mul!(Z, X, Y);

gives as timings respectively

  5.072 ms (0 allocations: 0 bytes)
  218.600 μs (0 allocations: 0 bytes)
  122.800 μs (0 allocations: 0 bytes)

For comparison, the SVector code, replacing Y by rand(SVector{3, Float64}, 200, 300) and Z by zeros(SVector{3, Float64}, 100, 300) gives as timings

  5.192 ms (0 allocations: 0 bytes)
  4.335 ms (4 allocations: 24.64 KiB)
  4.331 ms (4 allocations: 24.64 KiB)

so mul! is disproportionately affected by using SVector.

2 Likes

This is ultimately likely using a naive summation to calculate the product in LinearAlgebra, so using sum might lead to better/faster algorithms.