Inexplicable allocations when summing `StaticArrays`


This is a cross-post from an issue I opened in the StaticArrays package:

I have been working on a pull-request to fix an issue I discovered with StaticArrays. I discovered that when one entity that happens to be a column vector (single column of matrix), and the other a row vector, and you compute the outer product of these two vectors, the returned array is not a StaticArray but a regular allocated array. I stumbled upon this issue whilst implementing a particular cost function. I have stripped the cost function and given it as an example below.

I came up with a proposed fix by adding the following two definitions to matrix_multiply.jl in the package:

@inline *(A::StaticMatrix, B::Adjoint{<:Any, <:StaticVector}) = *(reshape(A, Size(Size(A)[1],)), B) 
@inline mul!(dest::StaticVecOrMat, A::StaticMatrix, B::Adjoint{<:Any, <:StaticVector}) = mul!(dest, reshape(A, Size(Size(A)[1],)), B) 

I have verified that adding these additional dispatch rules fixes the original problem (the returned array is now a StaticArray). However, when I am summing the result of these StaticArray types I am suddenly getting a lot of allocations.

You can reproduce the problem by running the following code snippet:

using StaticArrays, BenchmarkTools, LinearAlgebra

function hom(v::SVector)

function T(๐›‰::AbstractArray, ๐’ž::Tuple{AbstractArray, Vararg{AbstractArray}}, ๐’Ÿ::Tuple{AbstractArray, Vararg{AbstractArray}})
    โŠ— = kron
    l = 9
    ๐ˆโ‚— = SMatrix{9,9}(1.0I)
    ๐ˆโ‚˜ =  SMatrix{1,1}(1.0I)
    ๐“ = @SMatrix zeros(9,9)
    N = length(๐’Ÿ[1])
    โ„ณ, โ„ณสน = ๐’Ÿ
    ฮ›โ‚, ฮ›โ‚‚ = ๐’ž
    ๐šฒโ‚™ = @MMatrix zeros(4,4)
    ๐žโ‚ = @SMatrix [1.0; 0.0; 0.0]
    ๐žโ‚‚ = @SMatrix [0.0; 1.0; 0.0]
    for n = 1: N
        index = SVector(1,2)
        ๐šฒโ‚™[1:2,1:2] .=  ฮ›โ‚[n][index,index]
        ๐šฒโ‚™[3:4,3:4] .=  ฮ›โ‚‚[n][index,index]
        ๐ฆ = hom(โ„ณ[n])
        ๐ฆสน= hom(โ„ณสน[n])
        ๐”โ‚™ = (๐ฆ โŠ— ๐ฆสน)
        โˆ‚โ‚“๐ฎโ‚™ =  [(๐žโ‚ โŠ— ๐ฆสน) (๐žโ‚‚ โŠ— ๐ฆสน) (๐ฆ โŠ— ๐žโ‚) (๐ฆ โŠ— ๐žโ‚‚)]
        ๐โ‚™ =  โˆ‚โ‚“๐ฎโ‚™ * ๐šฒโ‚™ * โˆ‚โ‚“๐ฎโ‚™'
        ๐šบโ‚™ = ๐›‰' * ๐โ‚™ * ๐›‰
        ๐šบโ‚™โปยน = inv(๐šบโ‚™)
        ๐“โ‚ = @SMatrix zeros(Float64,9,9)
        for k = 1:l
            ๐žโ‚– = ๐ˆโ‚—[:,k]
            โˆ‚๐žโ‚–๐šบโ‚™ = (๐ˆโ‚˜ โŠ— ๐žโ‚–') * ๐โ‚™ * (๐ˆโ‚˜ โŠ— ๐›‰) + (๐ˆโ‚˜ โŠ— ๐›‰') * ๐โ‚™ * (๐ˆโ‚˜ โŠ— ๐žโ‚–)
            # Accumulating the result in ๐“โ‚ allocates memory, even though
            # the two terms in the summation are both SArrays.
            ๐“โ‚ = ๐“โ‚ + ๐”โ‚™ * ๐šบโ‚™โปยน * (โˆ‚๐žโ‚–๐šบโ‚™) * ๐šบโ‚™โปยน * ๐”โ‚™' * ๐›‰ * ๐žโ‚–'
        ๐“ = ๐“ + ๐“โ‚

# Some sample data
N = 300
โ„ณ = [@SVector rand(2) for i = 1:N]
โ„ณสน = [@SVector rand(2) for i = 1:N]
ฮ›โ‚ =  [SMatrix{3,3}(Matrix(Diagonal([1.0,1.0,0.0]))) for i = 1:length(โ„ณ)]
ฮ›โ‚‚ =  [SMatrix{3,3}(Matrix(Diagonal([1.0,1.0,0.0]))) for i = 1:length(โ„ณ)]
F = @SMatrix rand(3,3)
๐’ž = (ฮ›โ‚,ฮ›โ‚‚)
๐’Ÿ = (โ„ณ, โ„ณสน)

@btime T(vec($F),$๐’ž,$๐’Ÿ)  # 682.152 ฮผs (6002 allocations: 3.85 MiB

I tried @code_warntype T(vec(F),๐’ž,๐’Ÿ) which produced an output too long to list here. However, a particular line jumped out with Any :

1230 โ”€ %5524 = invoke Base.afoldl(%5506::typeof(*), %5517::SArray{Tuple{9,1},Float64,2,9}, %5475::Adjoint{Float64,SArray{Tuple{9},Float64,1,9}}, _2::SArray{Tuple{9},Float64,1,9}, %5476::Adjoint{Float64,SArray{Tuple{9},Float64,1,9}})::Any

I donโ€™t know how to proceed from here and would appreciate any advice. In particular, I am not sure whether my โ€œfixโ€ is missing something, or whether I have stumbled upon a different โ€œbugโ€.


The main thing Iโ€™d suggest is to try to strip it down to bare essentials. While the example is lovely, thereโ€™s a lot of computation here that is presumably irrelevant to the inference problem. If you can digest it down to a couple of key lines it will be much easier to make progress.


Rewriting โˆ‚โ‚“๐ฎโ‚™ * ๐šฒโ‚™ * โˆ‚โ‚“๐ฎโ‚™' as something like โˆ‚โ‚“๐ฎโ‚™ * (๐šฒโ‚™ * โˆ‚โ‚“๐ฎโ‚™') might work around it.


Iโ€™ve managed to construct a minimal example which reproduces the allocation problem:

using BenchmarkTools, StaticArrays

function test_mem_ok(P,Q)
    return P[1]*Q[1] * P[2]*Q[2] * P[3]*Q[3] * P[4]*Q[4] * P[5]*Q[5] *   P[6]*Q[6]  *  P[7]*Q[7] *  P[8]*Q[8]  *  P[9]*Q[9]

function test_mem_bad(P,Q)
    return P[1]*Q[1] * P[2]*Q[2] * P[3]*Q[3] * P[4]*Q[4] * P[5]*Q[5] *   P[6]*Q[6]  *  P[7]*Q[7] *  P[8]*Q[8]  *  P[9]*Q[9] *  P[10]*Q[10]

function test_mem_fix(P,Q)
    return (P[1]*Q[1] * P[2]*Q[2] * P[3]*Q[3] * P[4]*Q[4] * P[5]*Q[5])  *   P[6]*Q[6]  *  P[7]*Q[7] *  P[8]*Q[8]  *  P[9]*Q[9] *  P[10]*Q[10]

function test_ok()
    A = @SVector [@SMatrix [1.0] for i = 1:10]
    C =  @SVector [adjoint(SVector(1.0)) for i = 1:10]

function test_bad()
    A = @SVector [@SMatrix [1.0] for i = 1:10]
    C =  @SVector [adjoint(SVector(1.0)) for i = 1:10]

function test_fix()
    A = @SVector [@SMatrix [1.0] for i = 1:10]
    C =  @SVector [adjoint(SVector(1.0)) for i = 1:10]

@time test_ok() #  0.000002 seconds (5 allocations: 176 bytes)
@code_warntype test_ok()

@time test_bad() #   0.000003 seconds (26 allocations: 512 bytes)
@code_warntype test_bad()

@time test_fix() #     0.000001 seconds (5 allocations: 176 bytes)
@code_warntype test_fix()

This seems to be a case of afoldl causes many memory allocations.

There was a performance tip warning in the documentation about this issue which was subsequently removed.

Evizero had some further insights that he made in the issue that I originally opened.

Is this something I should report as an issue on Julialang?


Out of curiousity, what is @time showing allocation but @btime not

using BenchmarkTools
const k = zeros(20)
function test_mem()
    for i in 1: 10
        c += (k[i] *2)  

function test(n::Int64)
    ret = 0
    for i = 1:n
        ret += test_mem()
@btime test(100000000)
@time   test(100000000)

926.639 ms (0 allocations: 0 bytes)
 0.941509 seconds (238 allocations: 14.453 KiB)


Run it twice

julia> @time   test(100000000)
  1.350088 seconds (238 allocations: 14.453 KiB)

julia> @time   test(100000000)
  1.309178 seconds (5 allocations: 176 bytes)

The 5 allocations are from @time itself.


Thanks. Any specific reason we need to run @time twice but @btime just once to get the correct resultant memory allocation? I understand the function got compiled when I first ran @btime in sequence.

Sorry if this appears to be a basic level question.


@btime runs the function many time and returns the result with the shortest run time.


Thanks for clarification.