Huge memory allocation

Hi,

I am trying to do some calculation which involves large array (float64 arrays with length 2^30). The array is large but I think still acceptable. However, when I tried to run the codes within some loops, the memory allocation increases quickly and it easily leads to out of memory error. I wonder if there is any way to reduce the memory allocation. A sample code is below

using LinearAlgebra
using StaticArrays
using StatsBase
using Statistics
using Random
using JLD
using Tullio
using Plots
using Printf


function recursion_node(sigmas,taus,theta1,theta2)

    M_SS_S =@SVector[(5/32+3*cos(2*theta1+2*theta2)/32),0,0,9*(1-cos(2*theta1+2*theta2))/32]

    M_TT_S =@SVector[15/32+9*cos(2*theta1+2*theta2)/32,0,0,3*(1-cos(2*theta1+2*theta2))/32]

    M_ST_T=@SVector[0,3/32+cos(2*theta1)/16+cos(2*theta2)/16+cos(2*theta1-2*theta2)/32,3/32-cos(2*theta1)/16-cos(2*theta2)/16+cos(2*theta1-2*theta2)/32,(1-cos(2*theta1-2*theta2))/8]

    M_TS_T=@SVector[0,3/32+cos(2*theta1)/16+cos(2*theta2)/16+cos(2*theta1-2*theta2)/32,3/32-cos(2*theta1)/16-cos(2*theta2)/16+cos(2*theta1-2*theta2)/32,(1-cos(2*theta1-2*theta2))/8]

    M_TT_T=@SVector[0,3/16+cos(2*theta1)/8+cos(2*theta2)/8+cos(2*theta1-2*theta2)/16,3/16-cos(2*theta1)/8-cos(2*theta2)/8+cos(2*theta1-2*theta2)/16,0]

    sigmas_new = [ M_SS_S[i] * kron(sigmas, sigmas)+ M_TT_S[i] * kron(taus, taus) for i in 1:4 ]

    sigmas_new = vcat(sigmas_new...)

    taus_new = [ M_ST_T[i] * kron(sigmas, taus) + M_TS_T[i] * kron(taus, sigmas) + M_TT_T[i] * kron(taus, taus) for i in 1:4 ]
  
    taus_new = vcat(taus_new...)

    return sigmas_new,taus_new


end

function dynamics(theta1,theta2)

    sigmas = [1/4]

    taus =[3/4]

    for i in 1:4

        sigmas, taus = recursion_node(sigmas, taus,theta1,theta2)

        ind = sigmas+taus .> 1e-16

        sigmas = sigmas[ind]

        taus = taus[ind]

        sigmas = sigmas./(sigmas.+taus)

        taus = taus./(sigmas.+taus)
    end

    return sigmas,taus
end

function main()

    pool = zeros(99*2,2)

    t = 1

    for i in 0:1
        for j in 1:1

            theta1 = i>j ? i*pi/(2*99) : pi-j*pi/(2*99)

            theta2 = i>j ? j*pi/(2*99) : pi*i/(2*99)

            sigma,tau = dynamics(theta1,theta2)

            order = sum(sigma.^2+tau.^2)/sum(sigma+tau)

            purity = sum(sigma.^2+tau.^2/3)/sum(sigma+tau)

            pool[t,1] = order
            pool[t,2] = purity

            t= t+1

        end
    end

    save("/xxx.xx","pool",pool) #my save path

    return #pool
end

@time main();

For instance, when I try to run the code above with i goes from 0 to 1 and j is just 1, the memory allocation is

94.263461 seconds (6.74 k allocations: 134.412 GiB, 8.81% gc time, 0.03% compilation time)

Is there a way to reduce the memory allocation? Thanks

Just some quick impressions looking over your comments and code:

A single Vector{Float64} of length 2^{30} should occupy around 8.5 GB in memory. However, your code makes tons of new allocations that you may not even be aware of. Julia uses a Garbage Collector (GC) for memory management that will recycle memory no longer in use but the recycling isn’t instantaneous.

For example, just looking at this block of code (comments added):

    for i in 1:4
        sigmas, taus = recursion_node(sigmas, taus,theta1,theta2)
        # allocates two vectors of some length N

        ind = sigmas+taus .> 1e-16
        # allocates another vector of length N

        sigmas = sigmas[ind]
        # allocates another vector of some length M < N

        taus = taus[ind]
        # allocates another vector of length M

        sigmas = sigmas./(sigmas.+taus)
        # allocates another vector of length M

        taus = taus./(sigmas.+taus)
        # allocates another vector length M
    end

Just within this single for i in 1:4 loop, it’s possible that you’re actually asking Julia to allocate 28x the memory needed to store a single vector.

The way that Julia stores arrays, you should think about the variable name as a label pointing to a block of memory containing the data. A statement like sigmas = sigmas./(sigmas.+taus) tells Julia to build the calculation on the right in a new block of memory and then re-assign the label sigmas to point to that new block. In this case, you probably want to also use the “in-place” assignment symbol .= to indicate the result should overwrite the original vector.

Another way to address memory usage here would be to pre-allocate all of the vectors you plan to use and then modify them in-place. However, if the length of sigmas and taus changes during each of the four loop cycles, then that will be trickier to implement.

Similarly, for code like order = sum(sigma.^2+tau.^2)/sum(sigma+tau), you should be aware that multiple allocations may be necessary to compute the intermediate steps (e.g. sigma.^2).

4 Likes

Hi Mike:

Thanks for the reply. I agree with your opinion. Is it possible to calculate sum(sigma.^2+tau.^2)/sum(sigma+tau) without new memory allocation? A for loop may be the candidate, but it may make the speed decreases a lot. Thanks

Why would you think this? sum itself has an underlying for-loop too. A for-loop doesn’t inherently have allocations, it’s your operations like sigma.^2 that allocate memory for the result. The compiler evidently doesn’t elide those allocations for you, and no compiler can do this automatically in general. If you know some constraints, like the vectors having the same lengths or your algorithm not needing to store initial and intermediate results, then you can exploit those constraints to reuse allocated memory, like mike.ingold’s great suggestions. You do have to learn to recognize where you explicitly allocate and where you do in-place mutation.

I’m not a broadcasting expert but you might get some improvement here by simply using .+ additions within the sums.

I don’t have a terminal handy but you could also try replacing sum(sigma.^2 + tau.^2) with a something like mapreduce(+, (s,t) -> s^2 + t^2, sigma, tau) where you perform the interior function element-wise along the vectors and sum the results.

2 Likes

For reference, you might be interested in this excellent article by @stevengj: More Dots: Syntactic Loop Fusion in Julia (julialang.org).

1 Like

In some languages (Python and MATLAB, for example), for loops are have high overhead so are slow for “simple” operations. In such languages, the use of vectorized builtins is necessary for performance. This is not the case in Julia. Builtin functions in Julia are mostly written by experts, so sometimes use “advanced” tools for performance (such as @simd in sum), but these tools are available to you in your own code as well.

But the above recommendation with mapreduce would be my preferred solution here.

3 Likes

(s,t) -> s^2 + t^2 is exactly the lowered elementwise function in broadcasting sigma.^2 .+ tau.^2, only mapreduce does a reduction operation instead of storing elementwise outputs in an output array.

2 Likes

I wasn’t totally aware of the underlying mechanics here, so thanks for the clarification.

This was more my thinking here, that a mapreduce shouldn’t need to allocate the entire vector length since it should be able to reduce as it goes.

I believe that the first and second arguments to mapreduce should be interchanged above, as in mapreduce((s,t) -> s^2 + t^2, +, sigma, tau).

Testing this shows that it does indeed allocate:

using BenchmarkTools
n = 1000
sigma = rand(n)
tau = rand(n)

f1(sigma, tau) = mapreduce((s, t) -> s^2 + t^2, +, sigma, tau) 
julia> @btime f1($sigma, $tau)
  1.260 ÎĽs (5 allocations: 8.05 KiB)
656.7520684102655

I don’t understand why this allocates, but here’s a simple alternative that’s actually faster and doesn’t allocate:

f2(sigma, tau) = sum(s^2 + t^2 for (s,t) in zip(sigma, tau))
julia> @btime f2($sigma, $tau)
  886.000 ns (0 allocations: 0 bytes)
656.7520684102652

I’d appreciate help in understanding why the first example allocates.

1 Like

Good catch @PeterSimon.

At least according to the docstring for mapreduce,

mapreduce is functionally equivalent to calling reduce(op, map(f, itr); init=init) , but will in general execute faster since no intermediate collection needs to be created.

so that allocation is interesting.

Answering my own question, using @edit mapreduce((s, t) -> s^2 + t^2, +, sigma, tau) leads to

mapreduce(f, op, A::AbstractArrayOrBroadcasted...; kw...) =
    reduce(op, map(f, A...); kw...)

on line 359 of reducedim.jl. So it’s just using reduce with map and therefore allocating!

1 Like

Following @PeterSimon’s observation and avoiding mapreduce, the most straightforward way is the following:

function foo1(sigma, tau)
    numerator = sum(s^2 + t^2 for (s, t) in zip(sigma, tau))
    denominator = sum(s + t for (s, t) in zip(sigma, tau))
    return numerator / denominator
end

However, if you want maximum performance you have to combine the two passes over the arrays into one. For this, I’d just write out the loop manually; as others have pointed out, for loops are fast in julia and are almost always preferred over coercing your computation into a sequence of vectorized operations like you may be used to from matlab and numpy.

However, the risk when manually writing out reductions similar to sum is that you screw up performance by initializing with a zero of the wrong type. Avoiding this isn’t difficult, but requires some awareness, especially if your code may be used with unconventional number types. Here’s how I would write it:

function foo2(sigma, tau)
    zsigma, ztau = zero(eltype(sigma)), zero(eltype(tau))
    numerator = zsigma^2 + ztau^2
    denominator = zsigma + ztau
    for (s, t) in zip(sigma, tau)
        numerator += s^2 + t^2
        denominator += s + t
    end
    return numerator / denominator
end

However, if you like hacks, there’s also the following:

function foo3(sigma, tau)
    r = sum(complex(s^2 + t^2, s + t) for (s, t) in zip(sigma, tau))
    return real(r) / imag(r)
end

Benchmarks:

julia> using BenchmarkTools

julia> n = 1000; sigma = rand(n); tau = rand(n);

julia> @btime foo1($sigma, $tau)  # two `sum` calls
  2.144 ÎĽs (0 allocations: 0 bytes)
0.6656392400466352

julia> @btime foo2($sigma, $tau)  # manual loop
  1.057 ÎĽs (0 allocations: 0 bytes)
0.6656392400466352

julia> @btime foo3($sigma, $tau)  # `complex` hack
  1.077 ÎĽs (0 allocations: 0 bytes)
0.6656392400466352
4 Likes

This conversation is descending into some nuance and more complicated optimizations, so I’d like to reiterate and clarify for @jisutich just in case:

Your first step should simply be to try to make more use of broadcasting and in-place assignments. Depending on how memory-constrained you are, that might be enough to get you in under the wire.

Also, it’s not a best practice, but you can manually trigger Julia’s garbage collector via GC.gc() (documentation here). Unused vector data still held in memory should then be flushed, so you might try inserting this, e.g., at the start or end of the for i in 1:4 loop. It’ll probably slow things down a bit, but can be a brute-force way to shed some out-of-use memory.

4 Likes

What?! Is that the generic method for mapreduce or some particular specialization? I don’t see how that wouldn’t allocate (in contrast to the docstring assertion) barring maybe some compiler wizardry that maybe isn’t getting triggered here.

It’s the method dispatched to in this case, out of the six methods listed:

julia> methods(mapreduce)
# 6 methods for generic function "mapreduce" from Base:
 [1] mapreduce(f, op, A::Union{Base.AbstractBroadcasted, AbstractArray}; dims, init)
     @ reducedim.jl:357
 [2] mapreduce(f, op, A::Union{Base.AbstractBroadcasted, AbstractArray}...; kw...)
     @ reducedim.jl:359
 [3] mapreduce(f, op, itr::Base.SkipMissing{<:AbstractArray})
     @ missing.jl:283
 [4] mapreduce(f, op, a::Number)
     @ reduce.jl:451
 [5] mapreduce(f, op, itr; kw...)
     @ reduce.jl:307
 [6] mapreduce(f, op, itrs...; kw...)
     @ reduce.jl:308

Based on a very quick look over that source file, it seems like mapreduce over multiple input arguments just falls back to an allocating version, but there other methods look as expected.

There’s some nuance to this. The example code could surely benefit from some preallocation and in-place operations. Still, algorithmic improvements to avoid temporary storage altogether are usually preferable when possible, especially when the arrays are this large—RAM is finite, and virtual memory is slow. The sum(...) / sum(...) calculation is a prime example of something that’s best done without temporaries. So the first step is perhaps to figure out which temporary arrays are needed (so should be preallocated) and which ones can be avoided.

2 Likes