Inexplicable allocations when summing `StaticArrays`

zygmuntszpak · November 26, 2018, 11:58am

This is a cross-post from an issue I opened in the StaticArrays package: https://github.com/JuliaArrays/StaticArrays.jl/issues/537

I have been working on a pull-request to fix an issue I discovered with StaticArrays. I discovered that when one entity that happens to be a column vector (single column of matrix), and the other a row vector, and you compute the outer product of these two vectors, the returned array is not a StaticArray but a regular allocated array. I stumbled upon this issue whilst implementing a particular cost function. I have stripped the cost function and given it as an example below.

I came up with a proposed fix by adding the following two definitions to matrix_multiply.jl in the https://github.com/JuliaArrays/StaticArrays.jl package:

@inline *(A::StaticMatrix, B::Adjoint{<:Any, <:StaticVector}) = *(reshape(A, Size(Size(A)[1],)), B) 
@inline mul!(dest::StaticVecOrMat, A::StaticMatrix, B::Adjoint{<:Any, <:StaticVector}) = mul!(dest, reshape(A, Size(Size(A)[1],)), B)

I have verified that adding these additional dispatch rules fixes the original problem (the returned array is now a StaticArray). However, when I am summing the result of these StaticArray types I am suddenly getting a lot of allocations.

You can reproduce the problem by running the following code snippet:

using StaticArrays, BenchmarkTools, LinearAlgebra

function hom(v::SVector)
    push(v,1)
end

function T(𝛉::AbstractArray, 𝒞::Tuple{AbstractArray, Vararg{AbstractArray}}, 𝒟::Tuple{AbstractArray, Vararg{AbstractArray}})
    ⊗ = kron
    l = 9
    𝐈ₗ = SMatrix{9,9}(1.0I)
    𝐈ₘ =  SMatrix{1,1}(1.0I)
    𝐓 = @SMatrix zeros(9,9)
    N = length(𝒟[1])
    ℳ, ℳʹ = 𝒟
    Λ₁, Λ₂ = 𝒞
    𝚲ₙ = @MMatrix zeros(4,4)
    𝐞₁ = @SMatrix [1.0; 0.0; 0.0]
    𝐞₂ = @SMatrix [0.0; 1.0; 0.0]
    for n = 1: N
        index = SVector(1,2)
        𝚲ₙ[1:2,1:2] .=  Λ₁[n][index,index]
        𝚲ₙ[3:4,3:4] .=  Λ₂[n][index,index]
        𝐦 = hom(ℳ[n])
        𝐦ʹ= hom(ℳʹ[n])
        𝐔ₙ = (𝐦 ⊗ 𝐦ʹ)
        ∂ₓ𝐮ₙ =  [(𝐞₁ ⊗ 𝐦ʹ) (𝐞₂ ⊗ 𝐦ʹ) (𝐦 ⊗ 𝐞₁) (𝐦 ⊗ 𝐞₂)]
        𝐁ₙ =  ∂ₓ𝐮ₙ * 𝚲ₙ * ∂ₓ𝐮ₙ'
        𝚺ₙ = 𝛉' * 𝐁ₙ * 𝛉
        𝚺ₙ⁻¹ = inv(𝚺ₙ)
        𝐓₁ = @SMatrix zeros(Float64,9,9)
        for k = 1:l
            𝐞ₖ = 𝐈ₗ[:,k]
            ∂𝐞ₖ𝚺ₙ = (𝐈ₘ ⊗ 𝐞ₖ') * 𝐁ₙ * (𝐈ₘ ⊗ 𝛉) + (𝐈ₘ ⊗ 𝛉') * 𝐁ₙ * (𝐈ₘ ⊗ 𝐞ₖ)
            # Accumulating the result in 𝐓₁ allocates memory, even though
            # the two terms in the summation are both SArrays.
            𝐓₁ = 𝐓₁ + 𝐔ₙ * 𝚺ₙ⁻¹ * (∂𝐞ₖ𝚺ₙ) * 𝚺ₙ⁻¹ * 𝐔ₙ' * 𝛉 * 𝐞ₖ'
        end
        𝐓 = 𝐓 + 𝐓₁
    end
    𝐓
end


# Some sample data
N = 300
ℳ = [@SVector rand(2) for i = 1:N]
ℳʹ = [@SVector rand(2) for i = 1:N]
Λ₁ =  [SMatrix{3,3}(Matrix(Diagonal([1.0,1.0,0.0]))) for i = 1:length(ℳ)]
Λ₂ =  [SMatrix{3,3}(Matrix(Diagonal([1.0,1.0,0.0]))) for i = 1:length(ℳ)]
F = @SMatrix rand(3,3)
𝒞 = (Λ₁,Λ₂)
𝒟 = (ℳ, ℳʹ)

T(vec(F),𝒞,𝒟)
@btime T(vec($F),$𝒞,$𝒟)  # 682.152 μs (6002 allocations: 3.85 MiB

I tried @code_warntype T(vec(F),𝒞,𝒟) which produced an output too long to list here. However, a particular line jumped out with Any :

1230 ─ %5524 = invoke Base.afoldl(%5506::typeof(*), %5517::SArray{Tuple{9,1},Float64,2,9}, %5475::Adjoint{Float64,SArray{Tuple{9},Float64,1,9}}, _2::SArray{Tuple{9},Float64,1,9}, %5476::Adjoint{Float64,SArray{Tuple{9},Float64,1,9}})::Any

I don’t know how to proceed from here and would appreciate any advice. In particular, I am not sure whether my “fix” is missing something, or whether I have stumbled upon a different “bug”.

tim.holy · November 27, 2018, 9:40am

The main thing I’d suggest is to try to strip it down to bare essentials. While the example is lovely, there’s a lot of computation here that is presumably irrelevant to the inference problem. If you can digest it down to a couple of key lines it will be much easier to make progress.

kristoffer.carlsson · November 27, 2018, 1:52pm

Rewriting ∂ₓ𝐮ₙ * 𝚲ₙ * ∂ₓ𝐮ₙ' as something like ∂ₓ𝐮ₙ * (𝚲ₙ * ∂ₓ𝐮ₙ') might work around it.

zygmuntszpak · November 29, 2018, 6:52am

I’ve managed to construct a minimal example which reproduces the allocation problem:

using BenchmarkTools, StaticArrays

function test_mem_ok(P,Q)
    return P[1]*Q[1] * P[2]*Q[2] * P[3]*Q[3] * P[4]*Q[4] * P[5]*Q[5] *   P[6]*Q[6]  *  P[7]*Q[7] *  P[8]*Q[8]  *  P[9]*Q[9]
end

function test_mem_bad(P,Q)
    return P[1]*Q[1] * P[2]*Q[2] * P[3]*Q[3] * P[4]*Q[4] * P[5]*Q[5] *   P[6]*Q[6]  *  P[7]*Q[7] *  P[8]*Q[8]  *  P[9]*Q[9] *  P[10]*Q[10]
end

function test_mem_fix(P,Q)
    return (P[1]*Q[1] * P[2]*Q[2] * P[3]*Q[3] * P[4]*Q[4] * P[5]*Q[5])  *   P[6]*Q[6]  *  P[7]*Q[7] *  P[8]*Q[8]  *  P[9]*Q[9] *  P[10]*Q[10]
end

function test_ok()
    A = @SVector [@SMatrix [1.0] for i = 1:10]
    C =  @SVector [adjoint(SVector(1.0)) for i = 1:10]
    test_mem_ok(A,C)
end

function test_bad()
    A = @SVector [@SMatrix [1.0] for i = 1:10]
    C =  @SVector [adjoint(SVector(1.0)) for i = 1:10]
    test_mem_bad(A,C)
end

function test_fix()
    A = @SVector [@SMatrix [1.0] for i = 1:10]
    C =  @SVector [adjoint(SVector(1.0)) for i = 1:10]
    test_mem_fix(A,C)
end

test_ok()
@time test_ok() #  0.000002 seconds (5 allocations: 176 bytes)
@code_warntype test_ok()

test_bad()
@time test_bad() #   0.000003 seconds (26 allocations: 512 bytes)
@code_warntype test_bad()

test_fix()
@time test_fix() #     0.000001 seconds (5 allocations: 176 bytes)
@code_warntype test_fix()

This seems to be a case of afoldl causes many memory allocations.

There was a performance tip warning in the documentation about this issue which was subsequently removed.

Evizero had some further insights that he made in the issue that I originally opened.

Is this something I should report as an issue on Julialang?

Ajaychat3 · November 29, 2018, 7:18am

Out of curiousity, what is @time showing allocation but @btime not

using BenchmarkTools
const k = zeros(20)
function test_mem()
    c=0
    for i in 1: 10
        c += (k[i] *2)  
    end
    c
end

function test(n::Int64)
    ret = 0
    for i = 1:n
        ret += test_mem()
    end
    ret
end
@btime test(100000000)
@time   test(100000000)

926.639 ms (0 allocations: 0 bytes)
 0.941509 seconds (238 allocations: 14.453 KiB)

kristoffer.carlsson · December 2, 2018, 6:48pm

Run it twice

julia> @time   test(100000000)
  1.350088 seconds (238 allocations: 14.453 KiB)
0.0

julia> @time   test(100000000)
  1.309178 seconds (5 allocations: 176 bytes)
0.0

The 5 allocations are from @time itself.

Ajaychat3 · December 3, 2018, 1:19am

Thanks. Any specific reason we need to run @time twice but @btime just once to get the correct resultant memory allocation? I understand the function got compiled when I first ran @btime in sequence.

Sorry if this appears to be a basic level question.

kristoffer.carlsson · December 3, 2018, 1:23am

@btime runs the function many time and returns the result with the shortest run time.

Ajaychat3 · December 3, 2018, 1:25am

Thanks for clarification.

Topic		Replies	Views
Common allocation mistakes Performance memory-allocation	47	7156	August 21, 2023
Allocations even when using StaticArrays.jl Performance staticarrays	2	78	November 18, 2024
Allocations with StaticArrays and `+=` in Julia `v1.11` Performance staticarrays	6	351	October 26, 2024
Allocations while using static arrays in a struct General Usage memory-allocation	3	1577	December 12, 2018
Remove allocations from splatting? General Usage question , memory-allocation	18	572	July 19, 2023

Inexplicable allocations when summing `StaticArrays`

Related topics