# Understanding memory usage in Julia

I am working on a finite element program that solves a non linear problem by incrementing the boundary conditions, that means that from time to time I need to assemble some quite large stiffness matrices (\approx 350k \times 350k) and solve linear algebra problems.

I am surprised of how much memory the program actually uses compared to what I expected and I am trying to understand why that is happening.

Before starting the main loop of the program, after the mesh has been read from the input file and all necessary data structure have been created I check the memory occupancy of all my variables with the @show_local macro by @cgeoga from this post and I get a total usage of about 1.5gb, which is what I expected for a 113k elements model, but if I check the output from pmap on that process it says 30gb!

But thatâ€™s not all, as the calculations proceed the memory usage from pmap grows seemingly out of control reaching in excess of 100gb after two hundred increments, when it gets killed by OS

During the main loop the program needs to update a few stiffness matrices and solve linear problems, but everything happens inside functions, so temporary variables should be taken care of by the garbage collection, as no additional matrix or vectors are created but the existing updated.

I tried to force garbage collection invoking GC.gc() at the end of every load increment, but things didnâ€™t change.

I am using SparseArrays to store the stiffness matrix and LienearAlgebra.cholesky factorization to solve the linear algebra problems, I also store the cholesky factorization for re-use.

Everything runs on a node of a cluster, and I asked for two cores and 96gb,

can anyone help to understand what is happening and is it possible to contain the memory usage of a program of this type?

2 Likes

Are you using that function or the in-place version? This one

I think what I am doing is equivalent to the one you pointed out, this is what I am doing:

cholB1KtB1ff = let
Ktu      = makeKt(Î¦, elems, u, d)
B1KtB1ff = Symmetric(transpose(B1[:,ifree])*(Î»T/Î”t^2*MM+Î»V*Ktu)*B1[:,ifree])
cholesky(B1KtB1ff)
end


this should mean that neither Ktu nor B1KtB1ff are kept, and memory should be freed after the end, is that correct ?

would this be any different?

cholB1KtB1ff = let
Ktu      = makeKt(Î¦, elems, u, d)
cholesky!(Symmetric(transpose(B1[:,ifree])*(Î»T/Î”t^2*MM+Î»V*Ktu)*B1[:,ifree]))
end


where MM has the same dimensions of Ktu, is assembled once and never changes, and the other are scalars.

If you want specific help it would probably be best if you included some self-contained code. As far as general tips go, check out the Performance tips section in the manual. In particular, try doing allocation profiling (very convenient in the (VS)Code editor, just do @profview_allocs some_call()).

The latest Julia release also has a feature, seemingly still not documented, to trigger frequent garbage collection after a certain threshold. For example: julia --heap-size-hint=10G.

If I remember correctly, sometimes it was necessary to invoke this call multiple times in a row.

No, Julia isnâ€™t C++. The memory gets freed when the garbage collector decides so.

1 Like

I think I am following all of the guidelines of Performance tips, especially when it comes to enclosing code in functions, reusing vectors, and not having globals, I will look at @profview_allocs some_call()) to see if I can get a better understanding of what is happening

About invoking GC.gc() multiple times, at the moment it is done at every increment of the boundary conditions, which is more often than the stiffness matrix factorization gets updated. But still the memory footprint of the program is from 50 to 100 times the expected

Just remember to run with @profview_allocs an additional time before doing it for real, to get everything to compile.

1 Like

The factorization objects (FO) returned by cholesky contain large arrays managed by the solver library (so not directly visible to Julia tooling). When an FO goes out of scope and is garbage-collected, the GC should call a finalizer which frees the underlying storage. Are you sure that your obsolete FOâ€™s are going out of scope?

1 Like

I was thinking something like that was happening, but I was not aware of this peculiarity with the garbage collection of the FO objects,

in fact the memory occupancy of cholB1KtB1ff  from Base.summarysize is 8 bytes, while the size in memory of the matrix is \approx 150 mb, that suggested that the decomposition was handled and kept somewhere out of the reach of julia, but thought it was still handled by GC.gc() â€¦

can you suggest how to do it, or where could I find information about it?

do you think it would make any difference if I switched to MKL instead of OpenBLAS that comes with the standard Julia installation?

Sorry if I wasnâ€™t clear; the gc call should automatically free the memory by calling the finalizer without extra effort from you, as long as you donâ€™t keep the cholB1KtB1ff variable (which is just a wrapped pointer) in scope. You might have been misled about the need to drop references to the FOâ€™s by the small size visible to the Julia tools.

Finalizers are described in the Julia manual, and they are set up for the sparse array factorizations in the SharedArrays interface to the SuiteSparse library. Which BLAS library you use should not matter here.

1 Like

I am keeping cholB1KtB1ff in the main function because I need it multiple times, and I overwrite it when I update it, what should I do to make sure that gc() takes care of disposing the old one when a new one is made?

I thought cholB1KtB1ff was a pointer to something living elsewhere, but I thought gc could reach it even if it was not visible through Base.summarysize

I meant reusing the input buffer, with:

cholesky!(A::AbstractMatrix, NoPivot(); check = true) -> Cholesky

The same as [cholesky](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.cholesky), but saves space by overwriting the input A, instead of creating a copy. An [InexactError](https://docs.julialang.org/en/v1/base/base/#Core.InexactError) exception is thrown if the factorization produces a number not representable by the element type of A, e.g. for integer types.


Also note that A + B will create a new Matrix as a result. If you are doing this iteratively, you probably should compute C .= A .+ B and reuse C. Matrix multiplications can be done in place with mul!.

And B[:,ifree] allocates a new vector. Use, perhaps, @view(B1[:,ifree]).

I donâ€™t know how much each of these things might make a difference there for you.

1 Like

I was aware of that, but the old copies should be discharged by the gc(), right? I was not too much worried about that because it takes place only a limited number of times, and it is not where the bulk of the computations lie

Although what you describe seems reasonable to me, there may be some subtlety about scope and object references in play here. Instead of doing so much in your main function, perhaps you could have an intermediate function which does the factorization of a given matrix, applies it to all the vectors associated with that problem, and then returns, thus isolating each cholB1KtB1ff. Such function barriers sometimes help the memory management.

2 Likes

thanks, Iâ€™ll try that and Iâ€™ll post the results

Note that OP is using sparse matrices, for which different conventions apply. You can do this:

F = cholesky(A)
# use F
cholesky!(F, B)
# use F


From the original post, it sounds like this is what OP is doing (â€śI also store the cholesky factorization for re-useâ€ť), but this is not whatâ€™s happening in the example code provided later.

Iâ€™m not an expert here, but in general, I think such reuse is only advisable if A and B have the same sparsity pattern. However, if thatâ€™s the case, itâ€™s highly recommended, as it not only saves allocations but also saves you from recomputing the symbolic factorization.

BLAS libraries donâ€™t matter here. Sparse linear algebra is handled by different libraries altogether. Cholesky factorization and solving linear systems using the factorization object are both done by SuiteSparse.CHOLMOD, and matrix-vector multiplication is handled by a Julia native function within the SparseArray stdlib.

I canâ€™t really help with the memory issue. Iâ€™ve also observed that memory will sometimes grow indefinitely in long-running code with lots of sparse arrays, so Iâ€™ve been suspecting that the wrapped memory underlying SparseMatrixCSC isnâ€™t always properly finalized, but I havenâ€™t made any attempt to find the bug. This issue might be relevant: Memory leak in solving ODE many times Â· Issue #1946 Â· SciML/OrdinaryDiffEq.jl Â· GitHub.

2 Likes

Is your Julia program parallel, i.e. with two tasks (via @spawn or @threads)? Juliaâ€™s gc is not very good at managing memory in such cases. Actually, the only times Iâ€™ve had problems with memory usage growing without bounds have been with allocation in multiple threads.

1 Like

yes, the code is parallel and uses @threads in a for loop for assembling the global stiffness matrix from the elemental stiffness matrices, but no memory is allocated within the for loop, all the required arrays are allocated in advance. I will try to remove that and report what happens.

what I meant by re-using is that the same factorization is used for a few different problems before a new stiffness matrix is assembled and factorized. That is because the cost for factorization is the highest and it makes sense to keep using the same factorized matrix at the cost of taking more iterations rather than assembling and factorizing at each step (that would take one iteration only)

reusing the factorization in the sense you showed in your post proved not to be effective in my case

Iâ€™m facing the same problem, i.e. memory usage growing without bounds and I need to use multiple cores with allocations. How did you solve the problem in that case? @Andrea_Vigliotti did you test what happened when you removed parallelization?

In my case, parallelization is done via @everywhere and pmap. (and no SparseArrays)

if Sys.free_memory()/2^30 < 6.0