Excessive Memory Allocation?

freestatelabs · January 15, 2024, 6:16pm

I have a program I’m writing that reads in two Matrices, performs a bit of linear algebra in a pair of nested loops, then outputs a new Matrix. Because eventually it needs to operate on very large input Matrices, I’ve been testing its performance and found that it is quite a bit slower than I’m expecting, likely due to a very large number of memory allocations that I do not understand.

This is a simplified version of my code:

using BenchmarkTools: @benchmark 
using LinearAlgebra: norm, dot, cross
A = rand(3, 1000) 
B = rand(7, 1000) 

function f(A::AbstractArray{Float64}, B::AbstractArray{Float64})

    # Preallocate variables outside of loops; for this example, this is a total 
    # of (3000 + 4*3 Float64) * 8 bytes/Float64 ~= 24kB (?)
    C = zeros(size(A)) 
    x = zeros(3) 
    y = zeros(3) 
    z = zeros(3)
    zcrossx = zeros(3) 

    for j in axes(B)[2] 
        for i in axes(A)[2] 

            x = B[4:6,j] - B[1:3,j]
            y = B[1:3,j] - A[:,i] 
            z = B[4:6,j] - A[:,i]
            zcrossx = cross(z,x)

            C[:,i] += zcrossx * B[7,j] *(1/(norm(zcrossx)^2)) * (dot(x,z)/norm(z) - dot(x,y)/norm(y))

        end
    end

    return C
end

@benchmark f($A, $B)

When running it, I get this output:

BenchmarkTools.Trial: 12 samples with 1 evaluation.
 Range (min … max):  423.763 ms … 436.830 ms  ┊ GC (min … max): 11.80% … 12.91%
 Time  (median):     426.853 ms               ┊ GC (median):    12.24%
 Time  (mean ± σ):   427.625 ms ±   3.538 ms  ┊ GC (mean ± σ):  12.37% ±  0.43%

  ▁   ▁ █    ▁ ▁ ▁▁      ▁ ▁      ▁                           ▁  
  █▁▁▁█▁█▁▁▁▁█▁█▁██▁▁▁▁▁▁█▁█▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  424 ms           Histogram: frequency by time          437 ms <

 Memory estimate: 1.12 GiB, allocs estimate: 15000006.

Per the quick math in the comments of my script, I assume the heap memory allocation would be on the order of 24 kB, not over 1 GB!

Can anyone explain what’s causing this behavior? Is there a simple fix to dramatically reduce this memory allocation?

Sukera · January 15, 2024, 6:24pm

These all create a copy, since slicing in Julia copies; use @views to create lightweight views into the existing memory.

alfaromartino · January 15, 2024, 6:26pm

rewrite the function by

@views function f(A::AbstractArray{Float64}, B::AbstractArray{Float64})

so that slices are views rather than copies.

I’d also recommend initializing variables without filling the values with zeros.
You could use

 y = similar(x)
 z = similar(x)
 zcrossx = similar(x)

You can also use the same logic with undef for C = zeros(size(A)) and x = zeros(3).
For instance,
x = Vector{Float64}(undef,3)

In fact, given that x is small, you could use a tuple or a SVector (from the package StaticArrays), which do not allocate.

ffevotte · January 15, 2024, 6:30pm

Like @Sukera and @alfaromartino explained:

expressions like B[4:6, j] create a copy of a subarray of B; you can use views instead
expressions like x = something reassign x to that something, instead of reusing the memory already allocated in x. You can use broadcasted operations to mutate the destination array

Taking this into account, your can rewrite your algorithm like:

# An in-place version of `cross`. I'm surprised it does not already exist?
function cross!(c, a, b)
    if !(length(a) == length(b) == 3)
        throw(DimensionMismatch("cross product is only defined for vectors of length 3"))
    end
    a1, a2, a3 = a
    b1, b2, b3 = b
    c .= (a2*b3-a3*b2, a3*b1-a1*b3, a1*b2-a2*b1)
end

function f2(A::AbstractArray{Float64}, B::AbstractArray{Float64})

    # Preallocate variables outside of loops; for this example, this is a total
    # of (3000 + 4*3 Float64) * 8 bytes/Float64 ~= 24kB (?)
    C = zeros(size(A))
    x = zeros(3)
    y = zeros(3)
    z = zeros(3)
    zcrossx = zeros(3)

    for j in axes(B)[2]
        for i in axes(A)[2]

            @views x .= B[4:6,j] .- B[1:3,j]
            @views y .= B[1:3,j] .- A[:,i]
            @views z .= B[4:6,j] .- A[:,i]
            cross!(zcrossx, z, x)

            @views C[:,i] .+= zcrossx .* B[7,j] .* (1/(norm(zcrossx)^2)) .* (dot(x,z)/norm(z) .- dot(x,y)/norm(y))

        end
    end

    return C
end

I get:

julia> @benchmark f($A, $B)
f_orig(A, B)
f(A, b)
BenchmarkTools.Trial: 108 samples with 1 evaluation.
 Range (min … max):  45.657 ms …  49.530 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     46.269 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.589 ms ± 857.378 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▃██▁▄ ▁                                                  
  ▄▃▃▇▆█████▇█▆▇▆▁▆▇▄▁▃▄▁▁▃▃▃▃▁▃▁▁▃▁▁▁▁▃▃▁▃▃▁▁▁▁▃▁▃▁▄▁▃▃▁▁▁▁▁▃ ▃
  45.7 ms         Histogram: frequency by time         49.4 ms <

 Memory estimate: 23.80 KiB, allocs estimate: 6.

Which confirms your estimations about the size of preallocated arrays

freestatelabs · January 15, 2024, 6:45pm

Thank you! This worked perfectly!

I did not realize that the broadcast operator . mutates a variable. It also removes the need for the @views macro on those lines.

float!() being an in-place function - the reason that the built-in LinearAlgebra.cross() requires a memory allocation is that it returns a Vector, whereas norm() and dot() return scalars?

Thank you all for the help!

ffevotte · January 15, 2024, 7:19pm

Yes exactly. More technically, cross returns a vector, which will need heap-allocated memory to store the coefficients. Whereas norm or dot return scalars, which can typically be stack allocated (or not even require any memory at all if they are only stored in registers).

cross! here avoids heap allocations by re-using the preallocated memory in zcrossx. Another option (already mentioned above) would be to return a StaticArrays.SVector (or a plain NTuple) instead of standard arrays: because such data structures are immutable and of fixed size (3 in this case) and the compiler knows about it, they can be stack-allocated.

Palli · January 16, 2024, 9:03am

[Theoretical speculation]
As you say scalars are “typically” (almost always) stack-allocated, since always same size, but that applies to vectors too, at least in many cases.

The problem is that the type Vector can in general be any size, why not, but what you get from a cross is limited size for a 2D or 3D vector. We do not have StaticArrays in base, maybe should, or StaticArraysCore, and it might help, also I think Bumper.jl could turn those heap-allocations into stack-allocations (even useing the same Vector type, not from StaticArrays). But while easy to use it’s not automatic. Ideally the compiler could do this automatically without code changes or syntax as with Bumper.jl. I asked if Bumper should be added to Base, and it’s best not to since it’s dangerous if used incorrectly. But if “its” use were automatic then it wouldn’t be, if you fit a pattern “it” could handled you get stack-allocation, and if not you still have heap-allocation. So it turns into an optimization-problem for the programmer, trying to satisfy the compiler, I think could also eliminate need for @view in some cases.

Small 2D and 3D vectors and cross is common enough that if would be good if possible (yes, in general vectors can be arbitrary size, nD, but usually of rather low n?). Until then Julia code CAN be tuned with more manual effort.

Topic		Replies	Views
Common allocation mistakes Performance memory-allocation	47	7116	August 21, 2023
Memory allocation in matrix multiplication Performance	2	2132	July 8, 2020
Help to reduce number of allocations General Usage	23	5489	January 6, 2019
Why does THIS allocate? No hairs were harmed in this experiment ('cause I am bald...) General Usage	16	549	October 16, 2023
Multiplication of vector of matrices and vector of vectors Performance linearalgebra , tullio	3	418	November 7, 2022

Excessive Memory Allocation?

Related topics