# Huge memory allocation

Hi,

I am trying to do some calculation which involves large array (float64 arrays with length 2^30). The array is large but I think still acceptable. However, when I tried to run the codes within some loops, the memory allocation increases quickly and it easily leads to out of memory error. I wonder if there is any way to reduce the memory allocation. A sample code is below

``````using LinearAlgebra
using StaticArrays
using StatsBase
using Statistics
using Random
using JLD
using Tullio
using Plots
using Printf

function recursion_node(sigmas,taus,theta1,theta2)

M_SS_S =@SVector[(5/32+3*cos(2*theta1+2*theta2)/32),0,0,9*(1-cos(2*theta1+2*theta2))/32]

M_TT_S =@SVector[15/32+9*cos(2*theta1+2*theta2)/32,0,0,3*(1-cos(2*theta1+2*theta2))/32]

M_ST_T=@SVector[0,3/32+cos(2*theta1)/16+cos(2*theta2)/16+cos(2*theta1-2*theta2)/32,3/32-cos(2*theta1)/16-cos(2*theta2)/16+cos(2*theta1-2*theta2)/32,(1-cos(2*theta1-2*theta2))/8]

M_TS_T=@SVector[0,3/32+cos(2*theta1)/16+cos(2*theta2)/16+cos(2*theta1-2*theta2)/32,3/32-cos(2*theta1)/16-cos(2*theta2)/16+cos(2*theta1-2*theta2)/32,(1-cos(2*theta1-2*theta2))/8]

M_TT_T=@SVector[0,3/16+cos(2*theta1)/8+cos(2*theta2)/8+cos(2*theta1-2*theta2)/16,3/16-cos(2*theta1)/8-cos(2*theta2)/8+cos(2*theta1-2*theta2)/16,0]

sigmas_new = [ M_SS_S[i] * kron(sigmas, sigmas)+ M_TT_S[i] * kron(taus, taus) for i in 1:4 ]

sigmas_new = vcat(sigmas_new...)

taus_new = [ M_ST_T[i] * kron(sigmas, taus) + M_TS_T[i] * kron(taus, sigmas) + M_TT_T[i] * kron(taus, taus) for i in 1:4 ]

taus_new = vcat(taus_new...)

return sigmas_new,taus_new

end

function dynamics(theta1,theta2)

sigmas = [1/4]

taus =[3/4]

for i in 1:4

sigmas, taus = recursion_node(sigmas, taus,theta1,theta2)

ind = sigmas+taus .> 1e-16

sigmas = sigmas[ind]

taus = taus[ind]

sigmas = sigmas./(sigmas.+taus)

taus = taus./(sigmas.+taus)
end

return sigmas,taus
end

function main()

pool = zeros(99*2,2)

t = 1

for i in 0:1
for j in 1:1

theta1 = i>j ? i*pi/(2*99) : pi-j*pi/(2*99)

theta2 = i>j ? j*pi/(2*99) : pi*i/(2*99)

sigma,tau = dynamics(theta1,theta2)

order = sum(sigma.^2+tau.^2)/sum(sigma+tau)

purity = sum(sigma.^2+tau.^2/3)/sum(sigma+tau)

pool[t,1] = order
pool[t,2] = purity

t= t+1

end
end

save("/xxx.xx","pool",pool) #my save path

return #pool
end

@time main();
``````

For instance, when I try to run the code above with `i` goes from 0 to 1 and `j` is just 1, the memory allocation is

``````94.263461 seconds (6.74 k allocations: 134.412 GiB, 8.81% gc time, 0.03% compilation time)
``````

Is there a way to reduce the memory allocation? Thanks

A single `Vector{Float64}` of length 2^{30} should occupy around 8.5 GB in memory. However, your code makes tons of new allocations that you may not even be aware of. Julia uses a Garbage Collector (GC) for memory management that will recycle memory no longer in use but the recycling isnâ€™t instantaneous.

``````    for i in 1:4
sigmas, taus = recursion_node(sigmas, taus,theta1,theta2)
# allocates two vectors of some length N

ind = sigmas+taus .> 1e-16
# allocates another vector of length N

sigmas = sigmas[ind]
# allocates another vector of some length M < N

taus = taus[ind]
# allocates another vector of length M

sigmas = sigmas./(sigmas.+taus)
# allocates another vector of length M

taus = taus./(sigmas.+taus)
# allocates another vector length M
end
``````

Just within this single `for i in 1:4` loop, itâ€™s possible that youâ€™re actually asking Julia to allocate 28x the memory needed to store a single vector.

The way that Julia stores arrays, you should think about the variable name as a label pointing to a block of memory containing the data. A statement like `sigmas = sigmas./(sigmas.+taus)` tells Julia to build the calculation on the right in a new block of memory and then re-assign the label `sigmas` to point to that new block. In this case, you probably want to also use the â€śin-placeâ€ť assignment symbol `.=` to indicate the result should overwrite the original vector.

Another way to address memory usage here would be to pre-allocate all of the vectors you plan to use and then modify them in-place. However, if the length of `sigmas` and `taus` changes during each of the four loop cycles, then that will be trickier to implement.

Similarly, for code like `order = sum(sigma.^2+tau.^2)/sum(sigma+tau)`, you should be aware that multiple allocations may be necessary to compute the intermediate steps (e.g. `sigma.^2`).

4 Likes

Hi Mike:

Thanks for the reply. I agree with your opinion. Is it possible to calculate `sum(sigma.^2+tau.^2)/sum(sigma+tau)` without new memory allocation? A for loop may be the candidate, but it may make the speed decreases a lot. Thanks

Why would you think this? `sum` itself has an underlying for-loop too. A for-loop doesnâ€™t inherently have allocations, itâ€™s your operations like `sigma.^2` that allocate memory for the result. The compiler evidently doesnâ€™t elide those allocations for you, and no compiler can do this automatically in general. If you know some constraints, like the vectors having the same lengths or your algorithm not needing to store initial and intermediate results, then you can exploit those constraints to reuse allocated memory, like mike.ingoldâ€™s great suggestions. You do have to learn to recognize where you explicitly allocate and where you do in-place mutation.

Iâ€™m not a broadcasting expert but you might get some improvement here by simply using `.+` additions within the `sum`s.

I donâ€™t have a terminal handy but you could also try replacing `sum(sigma.^2 + tau.^2)` with a something like `mapreduce(+, (s,t) -> s^2 + t^2, sigma, tau)` where you perform the interior function element-wise along the vectors and sum the results.

2 Likes

For reference, you might be interested in this excellent article by @stevengj: More Dots: Syntactic Loop Fusion in Julia (julialang.org).

1 Like

In some languages (Python and MATLAB, for example), `for` loops are have high overhead so are slow for â€śsimpleâ€ť operations. In such languages, the use of vectorized builtins is necessary for performance. This is not the case in Julia. Builtin functions in Julia are mostly written by experts, so sometimes use â€śadvancedâ€ť tools for performance (such as `@simd` in `sum`), but these tools are available to you in your own code as well.

But the above recommendation with `mapreduce` would be my preferred solution here.

3 Likes

`(s,t) -> s^2 + t^2` is exactly the lowered elementwise function in broadcasting `sigma.^2 .+ tau.^2`, only `mapreduce` does a reduction operation instead of storing elementwise outputs in an output array.

2 Likes

I wasnâ€™t totally aware of the underlying mechanics here, so thanks for the clarification.

This was more my thinking here, that a `mapreduce` shouldnâ€™t need to allocate the entire vector length since it should be able to reduce as it goes.

I believe that the first and second arguments to `mapreduce` should be interchanged above, as in ` mapreduce((s,t) -> s^2 + t^2, +, sigma, tau)`.

Testing this shows that it does indeed allocate:

``````using BenchmarkTools
n = 1000
sigma = rand(n)
tau = rand(n)

f1(sigma, tau) = mapreduce((s, t) -> s^2 + t^2, +, sigma, tau)
``````
``````julia> @btime f1(\$sigma, \$tau)
1.260 ÎĽs (5 allocations: 8.05 KiB)
656.7520684102655
``````

I donâ€™t understand why this allocates, but hereâ€™s a simple alternative thatâ€™s actually faster and doesnâ€™t allocate:

``````f2(sigma, tau) = sum(s^2 + t^2 for (s,t) in zip(sigma, tau))
``````
``````julia> @btime f2(\$sigma, \$tau)
886.000 ns (0 allocations: 0 bytes)
656.7520684102652
``````

Iâ€™d appreciate help in understanding why the first example allocates.

1 Like

Good catch @PeterSimon.

At least according to the docstring for mapreduce,

`mapreduce` is functionally equivalent to calling `reduce(op, map(f, itr); init=init)` , but will in general execute faster since no intermediate collection needs to be created.

so that allocation is interesting.

Answering my own question, using `@edit mapreduce((s, t) -> s^2 + t^2, +, sigma, tau)` leads to

``````mapreduce(f, op, A::AbstractArrayOrBroadcasted...; kw...) =
reduce(op, map(f, A...); kw...)
``````

on line 359 of reducedim.jl. So itâ€™s just using `reduce` with `map` and therefore allocating!

1 Like

Following @PeterSimonâ€™s observation and avoiding `mapreduce`, the most straightforward way is the following:

``````function foo1(sigma, tau)
numerator = sum(s^2 + t^2 for (s, t) in zip(sigma, tau))
denominator = sum(s + t for (s, t) in zip(sigma, tau))
return numerator / denominator
end
``````

However, if you want maximum performance you have to combine the two passes over the arrays into one. For this, Iâ€™d just write out the loop manually; as others have pointed out, for loops are fast in julia and are almost always preferred over coercing your computation into a sequence of vectorized operations like you may be used to from matlab and numpy.

However, the risk when manually writing out reductions similar to `sum` is that you screw up performance by initializing with a zero of the wrong type. Avoiding this isnâ€™t difficult, but requires some awareness, especially if your code may be used with unconventional number types. Hereâ€™s how I would write it:

``````function foo2(sigma, tau)
zsigma, ztau = zero(eltype(sigma)), zero(eltype(tau))
numerator = zsigma^2 + ztau^2
denominator = zsigma + ztau
for (s, t) in zip(sigma, tau)
numerator += s^2 + t^2
denominator += s + t
end
return numerator / denominator
end
``````

However, if you like hacks, thereâ€™s also the following:

``````function foo3(sigma, tau)
r = sum(complex(s^2 + t^2, s + t) for (s, t) in zip(sigma, tau))
return real(r) / imag(r)
end
``````

Benchmarks:

``````julia> using BenchmarkTools

julia> n = 1000; sigma = rand(n); tau = rand(n);

julia> @btime foo1(\$sigma, \$tau)  # two `sum` calls
2.144 ÎĽs (0 allocations: 0 bytes)
0.6656392400466352

julia> @btime foo2(\$sigma, \$tau)  # manual loop
1.057 ÎĽs (0 allocations: 0 bytes)
0.6656392400466352

julia> @btime foo3(\$sigma, \$tau)  # `complex` hack
1.077 ÎĽs (0 allocations: 0 bytes)
0.6656392400466352
``````
4 Likes

This conversation is descending into some nuance and more complicated optimizations, so Iâ€™d like to reiterate and clarify for @jisutich just in case:

Your first step should simply be to try to make more use of broadcasting and in-place assignments. Depending on how memory-constrained you are, that might be enough to get you in under the wire.

Also, itâ€™s not a best practice, but you can manually trigger Juliaâ€™s garbage collector via `GC.gc()` (documentation here). Unused vector data still held in memory should then be flushed, so you might try inserting this, e.g., at the start or end of the `for i in 1:4` loop. Itâ€™ll probably slow things down a bit, but can be a brute-force way to shed some out-of-use memory.

4 Likes

What?! Is that the generic method for `mapreduce` or some particular specialization? I donâ€™t see how that wouldnâ€™t allocate (in contrast to the docstring assertion) barring maybe some compiler wizardry that maybe isnâ€™t getting triggered here.

Itâ€™s the method dispatched to in this case, out of the six methods listed:

``````julia> methods(mapreduce)
# 6 methods for generic function "mapreduce" from Base:
[1] mapreduce(f, op, A::Union{Base.AbstractBroadcasted, AbstractArray}; dims, init)
@ reducedim.jl:357
[2] mapreduce(f, op, A::Union{Base.AbstractBroadcasted, AbstractArray}...; kw...)
@ reducedim.jl:359
[3] mapreduce(f, op, itr::Base.SkipMissing{<:AbstractArray})
@ missing.jl:283
[4] mapreduce(f, op, a::Number)
@ reduce.jl:451
[5] mapreduce(f, op, itr; kw...)
@ reduce.jl:307
[6] mapreduce(f, op, itrs...; kw...)
@ reduce.jl:308
``````

Based on a very quick look over that source file, it seems like mapreduce over multiple input arguments just falls back to an allocating version, but there other methods look as expected.

Thereâ€™s some nuance to this. The example code could surely benefit from some preallocation and in-place operations. Still, algorithmic improvements to avoid temporary storage altogether are usually preferable when possible, especially when the arrays are this largeâ€”RAM is finite, and virtual memory is slow. The `sum(...) / sum(...)` calculation is a prime example of something thatâ€™s best done without temporaries. So the first step is perhaps to figure out which temporary arrays are needed (so should be preallocated) and which ones can be avoided.

2 Likes