Hi,
So we are trying to run parallelised code on our SLURM based cluster however are quickly encountering large scale memory issues. The memory utilisation is much more than a single loop of code will generate suggesting that the garbage collection is not running and that all of the memory is just piling up. Currently we are requesting 16 cores and a total of 128GB of memory on a single node. After 10500 iterations we exceeded our memory limit. Each iteration generates a 121 step vector which are mutliplied together inside the Distributed for loop.
Clearly we have enough memory as we easily do more than 16 iterations.
function simulate(cluster:: Vector{SpinSim.ClusterSpin.Cluster}, 𝜏:: StepRangeLen)
decay = @Distributed (.*) for clust in cluster
simulate(clust, 𝜏)
end
return decay
end
Solutions we have tried:
- We have tried forcing gc at random intervals.
- We tried forcing gc whenever the memory allocation had exceed a specific value
- We tried using 1.9.0-beta4 and using the
--heap-size-hint
option.
None of these seemed to make a difference.
- Does gc work normally on HPC and Slurm clusters?
- Any help solving would be greatly appreciated as we have hit a dead end.