Garbage collection not triggering on SLURM cluster

Hi,

So we are trying to run parallelised code on our SLURM based cluster however are quickly encountering large scale memory issues. The memory utilisation is much more than a single loop of code will generate suggesting that the garbage collection is not running and that all of the memory is just piling up. Currently we are requesting 16 cores and a total of 128GB of memory on a single node. After 10500 iterations we exceeded our memory limit. Each iteration generates a 121 step vector which are mutliplied together inside the Distributed for loop.

Clearly we have enough memory as we easily do more than 16 iterations.

function simulate(cluster:: Vector{SpinSim.ClusterSpin.Cluster}, 𝜏:: StepRangeLen)
    decay = @Distributed (.*) for clust in cluster
        simulate(clust, 𝜏)
    end
    return decay
end

Solutions we have tried:

  1. We have tried forcing gc at random intervals.
  2. We tried forcing gc whenever the memory allocation had exceed a specific value
  3. We tried using 1.9.0-beta4 and using the --heap-size-hint option.

None of these seemed to make a difference.

  • Does gc work normally on HPC and Slurm clusters?
  • Any help solving would be greatly appreciated as we have hit a dead end.
3 Likes

We experienced the same issue, does your ‘simulate’ function use JuMP by any chance?

1 Like

No we don’t use JuMP. The simulate command just does a density operator forumlism. So it is just a bunch of matrix exponentials and matrix multiplications. No special packages are being used here.

1 Like

I’d be really curious to hear about possible solutions as well (I’ve run into a similar problem in the past). To my current understanding, the problem is that Julia “sees” the memory of the whole node, but not the limit that Slurm enforces (might be the wrong interpretation though).

The only solutions I found so far are to either specify more memory (e.g. allocate a whole node for the job) or reduce the allocations in the program.

If you just do matrix operations, it might be worth trying to use as many non-allocating functions as possible, which might also speed up the simulation in general.


Here are some related discussions/issues that might be helpful:

1 Like

First time posting, but I want to add that I have the same issue, and I have tried many things. I use no additional libraries apart from Distributed.

My code is not that sophisticated and really consists of a function that loops N times with different starting parameters, to then return some values. Thus, I use parallel code to scan the parameters, but ran into this problem.

As for parallel code using Distributed I have tried Distributed for loops and pmap(). I have also tried separate parallel julia instances using SBATCH in SLURM, on one node, without the use of Distributed. With either approach the function runs, and I can observe that the free memory steadily drops until the processes run out of memory. I have not had this problem on a laptop, or desktop computer.
I can run this on the cluster with fewer iterations, as the program will end before the free memory depletes completely.

I have tried other approaches as well and when using pmap() I was adding GC.safepoint() at each iteration and running GC.gc() every 100.000 iteration. This seemed to help a little, but the code still threw an OOM after 2 days. (Idea for GC.safepoint() came from Poor performance of garbage collection in multi-threaded application)

I have hypothesized a little (although I have no evidence), and I have landed on basicly what @sevi writes. Could the problem be that the GC does not recognize the other processes, and thus it will not collect garbage, as it assumes that it has many times more free memory than it really has?

If I read the discussion on github correctly, about the command line flag, a hard limit was discussed, but as it stands now, --heap-size-hint seemed to be the best way. I do find it interesting then that OP mentions trying --heap-size-hint, without success. It seems from the discussion that julia determines the memory available by reading the hardware. Maybe an artificial limit, basically changing what julia sees as the total memory would be a solution, although I do not know if this is feasible.

Sorry for the long post :))

1 Like

Sorry to resurrect an old thread but we have also run into this issue with JuMP. Did you find any solutions?

I encountered a similar issue with Julia tasks being OOM killed on a SLURM cluster.

It appears that by limiting memory usage with the --heap-size-hint flag, the problem has been resolved (still conducting tests as I write this).

In response to @HKaras’ experience, I’ve noticed that the --heap-size-hint flag must be set lower than the requested memory in the SLURM job to prevent OOM kills. For instance, in my test SLURM job requesting 128GB of memory, I’ve set the --heap-size-hint to 100GB. After running for an hour, the code continues to run, whereas previously it would terminate within 15 minutes (prior to using the --heap-size-hint or if the amount of memory set in the --heap-size-hint equaled 128GB).

For reference, I’m using Julia version 1.9.4.

1 Like