Garbage collection not triggering on SLURM cluster

Hi,

So we are trying to run parallelised code on our SLURM based cluster however are quickly encountering large scale memory issues. The memory utilisation is much more than a single loop of code will generate suggesting that the garbage collection is not running and that all of the memory is just piling up. Currently we are requesting 16 cores and a total of 128GB of memory on a single node. After 10500 iterations we exceeded our memory limit. Each iteration generates a 121 step vector which are mutliplied together inside the Distributed for loop.

Clearly we have enough memory as we easily do more than 16 iterations.

function simulate(cluster:: Vector{SpinSim.ClusterSpin.Cluster}, 𝜏:: StepRangeLen)
    decay = @Distributed (.*) for clust in cluster
        simulate(clust, 𝜏)
    end
    return decay
end

Solutions we have tried:

  1. We have tried forcing gc at random intervals.
  2. We tried forcing gc whenever the memory allocation had exceed a specific value
  3. We tried using 1.9.0-beta4 and using the --heap-size-hint option.

None of these seemed to make a difference.

  • Does gc work normally on HPC and Slurm clusters?
  • Any help solving would be greatly appreciated as we have hit a dead end.
1 Like

We experienced the same issue, does your ‘simulate’ function use JuMP by any chance?

No we don’t use JuMP. The simulate command just does a density operator forumlism. So it is just a bunch of matrix exponentials and matrix multiplications. No special packages are being used here.

I’d be really curious to hear about possible solutions as well (I’ve run into a similar problem in the past). To my current understanding, the problem is that Julia “sees” the memory of the whole node, but not the limit that Slurm enforces (might be the wrong interpretation though).

The only solutions I found so far are to either specify more memory (e.g. allocate a whole node for the job) or reduce the allocations in the program.

If you just do matrix operations, it might be worth trying to use as many non-allocating functions as possible, which might also speed up the simulation in general.


Here are some related discussions/issues that might be helpful: