Understanding memory usage in Julia

Andrea_Vigliotti · July 15, 2023, 8:20am

I am working on a finite element program that solves a non linear problem by incrementing the boundary conditions, that means that from time to time I need to assemble some quite large stiffness matrices (\approx 350k \times 350k) and solve linear algebra problems.

I am surprised of how much memory the program actually uses compared to what I expected and I am trying to understand why that is happening.

Before starting the main loop of the program, after the mesh has been read from the input file and all necessary data structure have been created I check the memory occupancy of all my variables with the @show_local macro by @cgeoga from this post and I get a total usage of about 1.5gb, which is what I expected for a 113k elements model, but if I check the output from pmap on that process it says 30gb!

But that’s not all, as the calculations proceed the memory usage from pmap grows seemingly out of control reaching in excess of 100gb after two hundred increments, when it gets killed by OS

During the main loop the program needs to update a few stiffness matrices and solve linear problems, but everything happens inside functions, so temporary variables should be taken care of by the garbage collection, as no additional matrix or vectors are created but the existing updated.

I tried to force garbage collection invoking GC.gc() at the end of every load increment, but things didn’t change.

I am using SparseArrays to store the stiffness matrix and LienearAlgebra.cholesky factorization to solve the linear algebra problems, I also store the cholesky factorization for re-use.

Everything runs on a node of a cluster, and I asked for two cores and 96gb,

can anyone help to understand what is happening and is it possible to contain the memory usage of a program of this type?

lmiq · July 15, 2023, 11:49am

Are you using that function or the in-place version? This one

Andrea_Vigliotti · July 15, 2023, 3:21pm

I think what I am doing is equivalent to the one you pointed out, this is what I am doing:

cholB1KtB1ff = let 
  Ktu      = makeKt(Φ, elems, u, d)
  B1KtB1ff = Symmetric(transpose(B1[:,ifree])*(λT/Δt^2*MM+λV*Ktu)*B1[:,ifree])
  cholesky(B1KtB1ff)
end

this should mean that neither Ktu nor B1KtB1ff are kept, and memory should be freed after the end, is that correct ?

would this be any different?

cholB1KtB1ff = let 
  Ktu      = makeKt(Φ, elems, u, d)
  cholesky!(Symmetric(transpose(B1[:,ifree])*(λT/Δt^2*MM+λV*Ktu)*B1[:,ifree]))
end

where MM has the same dimensions of Ktu, is assembled once and never changes, and the other are scalars.

nsajko · July 15, 2023, 3:54pm

If you want specific help it would probably be best if you included some self-contained code. As far as general tips go, check out the Performance tips section in the manual. In particular, try doing allocation profiling (very convenient in the (VS)Code editor, just do @profview_allocs some_call()).

The latest Julia release also has a feature, seemingly still not documented, to trigger frequent garbage collection after a certain threshold. For example: julia --heap-size-hint=10G.

If I remember correctly, sometimes it was necessary to invoke this call multiple times in a row.

No, Julia isn’t C++. The memory gets freed when the garbage collector decides so.

Andrea_Vigliotti · July 15, 2023, 4:09pm

I think I am following all of the guidelines of Performance tips, especially when it comes to enclosing code in functions, reusing vectors, and not having globals, I will look at @profview_allocs some_call()) to see if I can get a better understanding of what is happening

About invoking GC.gc() multiple times, at the moment it is done at every increment of the boundary conditions, which is more often than the stiffness matrix factorization gets updated. But still the memory footprint of the program is from 50 to 100 times the expected

nsajko · July 15, 2023, 4:11pm

Just remember to run with @profview_allocs an additional time before doing it for real, to get everything to compile.

Ralph_Smith · July 15, 2023, 4:22pm

The factorization objects (FO) returned by cholesky contain large arrays managed by the solver library (so not directly visible to Julia tooling). When an FO goes out of scope and is garbage-collected, the GC should call a finalizer which frees the underlying storage. Are you sure that your obsolete FO’s are going out of scope?

Andrea_Vigliotti · July 15, 2023, 4:35pm

I was thinking something like that was happening, but I was not aware of this peculiarity with the garbage collection of the FO objects,

in fact the memory occupancy of cholB1KtB1ff from Base.summarysize is 8 bytes, while the size in memory of the matrix is \approx 150 mb, that suggested that the decomposition was handled and kept somewhere out of the reach of julia, but thought it was still handled by GC.gc() …

can you suggest how to do it, or where could I find information about it?

Andrea_Vigliotti · July 15, 2023, 4:43pm

do you think it would make any difference if I switched to MKL instead of OpenBLAS that comes with the standard Julia installation?

Ralph_Smith · July 15, 2023, 4:56pm

Sorry if I wasn’t clear; the gc call should automatically free the memory by calling the finalizer without extra effort from you, as long as you don’t keep the cholB1KtB1ff variable (which is just a wrapped pointer) in scope. You might have been misled about the need to drop references to the FO’s by the small size visible to the Julia tools.

Finalizers are described in the Julia manual, and they are set up for the sparse array factorizations in the SharedArrays interface to the SuiteSparse library. Which BLAS library you use should not matter here.

Andrea_Vigliotti · July 15, 2023, 5:11pm

I am keeping cholB1KtB1ff in the main function because I need it multiple times, and I overwrite it when I update it, what should I do to make sure that gc() takes care of disposing the old one when a new one is made?

I thought cholB1KtB1ff was a pointer to something living elsewhere, but I thought gc could reach it even if it was not visible through Base.summarysize

lmiq · July 15, 2023, 5:12pm

I meant reusing the input buffer, with:

cholesky!(A::AbstractMatrix, NoPivot(); check = true) -> Cholesky

The same as [`cholesky`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.cholesky), but saves space by overwriting the input `A`, instead of creating a copy. An [`InexactError`](https://docs.julialang.org/en/v1/base/base/#Core.InexactError) exception is thrown if the factorization produces a number not representable by the element type of `A`, e.g. for integer types.

Also note that A + B will create a new Matrix as a result. If you are doing this iteratively, you probably should compute C .= A .+ B and reuse C. Matrix multiplications can be done in place with mul!.

And B[:,ifree] allocates a new vector. Use, perhaps, @view(B1[:,ifree]).

I don’t know how much each of these things might make a difference there for you.

Andrea_Vigliotti · July 15, 2023, 5:23pm

I was aware of that, but the old copies should be discharged by the gc(), right? I was not too much worried about that because it takes place only a limited number of times, and it is not where the bulk of the computations lie

Ralph_Smith · July 15, 2023, 5:43pm

Although what you describe seems reasonable to me, there may be some subtlety about scope and object references in play here. Instead of doing so much in your main function, perhaps you could have an intermediate function which does the factorization of a given matrix, applies it to all the vectors associated with that problem, and then returns, thus isolating each cholB1KtB1ff. Such function barriers sometimes help the memory management.

Andrea_Vigliotti · July 15, 2023, 5:46pm

thanks, I’ll try that and I’ll post the results

danielwe · July 15, 2023, 7:43pm

Note that OP is using sparse matrices, for which different conventions apply. You can do this:

F = cholesky(A)
# use F
cholesky!(F, B)
# use F

From the original post, it sounds like this is what OP is doing (“I also store the cholesky factorization for re-use”), but this is not what’s happening in the example code provided later.

I’m not an expert here, but in general, I think such reuse is only advisable if A and B have the same sparsity pattern. However, if that’s the case, it’s highly recommended, as it not only saves allocations but also saves you from recomputing the symbolic factorization.

BLAS libraries don’t matter here. Sparse linear algebra is handled by different libraries altogether. Cholesky factorization and solving linear systems using the factorization object are both done by SuiteSparse.CHOLMOD, and matrix-vector multiplication is handled by a Julia native function within the SparseArray stdlib.

I can’t really help with the memory issue. I’ve also observed that memory will sometimes grow indefinitely in long-running code with lots of sparse arrays, so I’ve been suspecting that the wrapped memory underlying SparseMatrixCSC isn’t always properly finalized, but I haven’t made any attempt to find the bug. This issue might be relevant: Memory leak in solving ODE many times · Issue #1946 · SciML/OrdinaryDiffEq.jl · GitHub.

sgaure · July 15, 2023, 8:51pm

Is your Julia program parallel, i.e. with two tasks (via @spawn or @threads)? Julia’s gc is not very good at managing memory in such cases. Actually, the only times I’ve had problems with memory usage growing without bounds have been with allocation in multiple threads.

Andrea_Vigliotti · July 16, 2023, 7:12am

yes, the code is parallel and uses @threads in a for loop for assembling the global stiffness matrix from the elemental stiffness matrices, but no memory is allocated within the for loop, all the required arrays are allocated in advance. I will try to remove that and report what happens.

what I meant by re-using is that the same factorization is used for a few different problems before a new stiffness matrix is assembled and factorized. That is because the cost for factorization is the highest and it makes sense to keep using the same factorized matrix at the cost of taking more iterations rather than assembling and factorizing at each step (that would take one iteration only)

reusing the factorization in the sense you showed in your post proved not to be effective in my case

pjuergens · July 24, 2023, 9:47am

I’m facing the same problem, i.e. memory usage growing without bounds and I need to use multiple cores with allocations. How did you solve the problem in that case? @Andrea_Vigliotti did you test what happened when you removed parallelization?

In my case, parallelization is done via @everywhere and pmap. (and no SparseArrays)

ufechner7 · July 24, 2023, 4:21pm

I added:

if Sys.free_memory()/2^30 < 6.0
    GC.gc()
end

to large functions that are called by pmap…

I think there is a bug in the garbage collector, this should not be needed…

Created a bug report: Garbage collector not working · Issue #50658 · JuliaLang/julia · GitHub

Topic		Replies	Views
Accumulating memory in nested loops General Usage performance	10	2247	April 17, 2017
Paper for discussion: The Linear Algebra Mapping Problem Internals & Design	13	1473	November 25, 2019
Help to reduce number of allocations General Usage	23	5559	January 6, 2019
Sparse matrix-vector product: much more slow than Matlab Performance matlab , optimization	24	4639	December 20, 2017
Great performance hit on (simple) functions General Usage question	10	1406	November 7, 2016

Understanding memory usage in Julia

Related topics