Matrix of SVectors allocating

vincentxwang · July 31, 2024, 10:14am

Hi,

I’m trying to multiply a matrix of Float64 by a matrix of SVectors, but I’m finding that mul! will take a while to allocate some memory for this, and I’m not sure why. Is there some way to get around this?

using StaticArrays
using BenchmarkTools
using LinearAlgebra

const Dr = rand(Float64, 2, 3)
const A = rand(SVector{3, Float64}, 3, 2)
const B = zeros(SVector{3, Float64}, 2, 2)

@btime mul!($(B), $(Dr), $(A))

  332.328 ns (3 allocations: 14.39 KiB)
2×2 Matrix{SVector{3, Float64}}:
 [0.74142, 0.650898, 1.27879]    [0.80855, 0.960478, 0.59237]
 [0.344308, 0.457656, 0.878124]  [0.629784, 0.663227, 0.445165]

DNF · July 31, 2024, 10:49am

Not sure what’s going on, but this is better on 1.11:

1.10.4> @btime mul!($B, $Dr, $A);
  400.450 ns (3 allocations: 14.14 KiB)

1.11.0-rc2> @btime mul!($B, $Dr, $A);
  23.594 ns (0 allocations: 0 bytes)

eldee · July 31, 2024, 10:59am

This is probably not what you’re looking for, but an obvious way to get around the problem on 1.10 is to write the matrix multiplication yourself:

function mymul!(C, A, B)
    for j = axes(C, 2), i = axes(C, 1)
        C[i, j] = sum(A[i, k] * B[k, j] for k=axes(A, 2))
    end
end

julia> @btime mymul!($B, $Dr, $A)
  26.432 ns (0 allocations: 0 bytes)

This is likely a bit slower than the 1.11 mul! version, but is much faster than the allocating version on 1.10

julia> @btime mul!($B, $Dr, $A);
  296.651 ns (3 allocations: 14.14 KiB)

Benny · July 31, 2024, 11:51am

I don’t know either. Something is going one with the inference in matmul2x2!, and many possible runtime dispatches follow (omitted for space):

julia> @report_opt mul!(B, Dr, A)
═════ 189 possible errors found ═════
┌ mul!(C::Matrix{SVector{3, Float64}}, A::Matrix{Float64}, B::Matrix{SVector{3, Float64}}) @ LinearAlgebra C:\workdir\usr\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:237
│┌ mul!(C::Matrix{SVector{3, Float64}}, A::Matrix{Float64}, B::Matrix{SVector{3, Float64}}, α::Bool, β::Bool) @ LinearAlgebra C:\workdir\usr\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:263
││┌ generic_matmatmul!(C::Matrix{…}, tA::Char, tB::Char, A::Matrix{…}, B::Matrix{…}, _add::LinearAlgebra.MulAddMul{…}) @ LinearAlgebra C:\workdir\usr\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:778
│││┌ matmul2x2!(C::Matrix{…}, tA::Char, tB::Char, A::Matrix{…}, B::Matrix{…}, _add::LinearAlgebra.MulAddMul{…}) @ LinearAlgebra C:\workdir\usr\share\julia\stdlib\v1.10\LinearAlgebra\src\matmul.jl:1025
││││┌ _modify!(p::LinearAlgebra.MulAddMul{true, true, Bool, Bool}, x::Any, C::Matrix{SVector{…}}, idx′::Tuple{Int64, Int64}) @ LinearAlgebra C:\workdir\usr\share\julia\stdlib\v1.10\LinearAlgebra\src\generic.jl:91
│││││┌ setindex!(::Matrix{SVector{3, Float64}}, ::Any, ::CartesianIndex{2}) @ Base ./multidimensional.jl:698
││││││┌ setindex!(::Matrix{SVector{3, Float64}}, ::Any, ::Int64, ::Int64) @ Base ./array.jl:1024
│││││││ runtime dispatch detected: convert(::SVector{3, Float64}, x::Any)::Any
││││││└────────────────────
...
julia> code_warntype(LinearAlgebra.matmul2x2!, (Matrix{SVector{3, Float64}}, Char, Char, Matrix{Float64}, Matrix{SVector{3, Float64}}, LinearAlgebra.MulAddMul{true, true, Bool, Bool}))
...
35 ┄ %191 = (A11 * B11)::Union{Adjoint{Float64, MVector{3, Float64}}, Transpose{Float64, MVector{3, Float64}}, SVector{3, Float64}}
│    %192 = (A12 * B21)::Union{Adjoint{Float64, MVector{3, Float64}}, Transpose{Float64, MVector{3, Float64}}, SVector{3, Float64}}
│    %193 = (%191 + %192)::Any
│    %194 = Core.tuple(1, 1)::Core.Const((1, 1))
│           LinearAlgebra._modify!(_add, %193, C, %194)
...repeats 3 times
...

After using Profile on a single call, I find:

julia> Profile.Allocs.fetch().allocs
3-element Vector{Profile.Allocs.Alloc}:
 Profile.Allocs.Alloc(Profile.Allocs.UnknownType, Base.StackTraces.StackFrame[maybe_record_alloc_to_profile at gc-alloc-profiler.h:42 [inlined], ...], 16)
 Profile.Allocs.Alloc(Matrix{Float64}, Base.StackTraces.StackFrame[maybe_record_alloc_to_profile at gc-alloc-profiler.h:42 [inlined], ...], 3592)
 Profile.Allocs.Alloc(Matrix{SVector{3, Float64}}, Base.StackTraces.StackFrame[maybe_record_alloc_to_profile at gc-alloc-profiler.h:42 [inlined], ...], 10648)

which doesn’t seem to add up (16+3592+10648=14356 bytes? = 13.922 KiB) to the 14.141 KiB I see and suggests much larger matrices than the inputs (assuming 40 byte array overhead, lengths of 444 and 442).

vincentxwang · July 31, 2024, 6:06pm

I actually did not consider this…

For large-ish matrices, I’m finding that this multiplicationn seems to be a little faster than the 1.11 multiplication for some reason. Thanks!

using StaticArrays
using BenchmarkTools
using LinearAlgebra

function mymul!(C, A, B)
    for j = axes(C, 2), i = axes(C, 1)
        C[i, j] = sum(A[i, k] * B[k, j] for k=axes(A, 2))
    end
end

const Dr = rand(Float64, 500, 500)
const A = rand(SVector{3, Float64}, 500, 3000)
const B = zeros(SVector{3, Float64}, 500, 3000)

@btime mul!($(B), $(Dr), $(A))
@btime mymul!($(B), $(Dr), $(A))

  827.817 ms (0 allocations: 0 bytes)
  650.042 ms (0 allocations: 0 bytes)

Benny · August 1, 2024, 6:47am

fr?! mymul! is a naive implementation, the most you might add is a matching dimensions check before the loop. Why is the LinearAlgebra implementation so much more complicated then?

eldee · August 1, 2024, 8:27am

On my machine, mul! on 1.10 is actually a bit faster in Vincent’s last example: 552 ms for mul! (4 allocations), and 724 ms for mymul! (0 allocations). The mul! timing stays constant, no matter the number of BLAS threads.

Of course, for ordinary matrices mul! handily outperforms mymul!:

using StaticArrays, BenchmarkTools, LinearAlgebra

X = rand(100, 200)
Y = rand(200, 300)
Z = zeros(100, 300)

@btime mymul!(Z, X, Y);

BLAS.set_num_threads(1)
@btime mul!(Z, X, Y);

BLAS.set_num_threads(4)
@btime mul!(Z, X, Y);

gives as timings respectively

  5.072 ms (0 allocations: 0 bytes)
  218.600 μs (0 allocations: 0 bytes)
  122.800 μs (0 allocations: 0 bytes)

For comparison, the SVector code, replacing Y by rand(SVector{3, Float64}, 200, 300) and Z by zeros(SVector{3, Float64}, 100, 300) gives as timings

  5.192 ms (0 allocations: 0 bytes)
  4.335 ms (4 allocations: 24.64 KiB)
  4.331 ms (4 allocations: 24.64 KiB)

so mul! is disproportionately affected by using SVector.

jishnub · August 2, 2024, 7:56pm

This is ultimately likely using a naive summation to calculate the product in LinearAlgebra, so using sum might lead to better/faster algorithms.

Topic		Replies	Views
Allocation when applying in place multiplication to matrix of SVector Performance question	2	183	March 15, 2023
Static Array times scalar Performance question	2	177	February 19, 2024
Allocations when pushing to SVector in a loop General Usage	4	533	February 6, 2019
Allocation with StaticArrays New to Julia memory-allocation , staticarrays	19	674	September 15, 2021
How to create an SVector of SVectors part II General Usage question , staticarrays	7	423	May 14, 2021

Matrix of SVectors allocating

Related topics