First time posting, but I want to add that I have the same issue, and I have tried many things. I use no additional libraries apart from Distributed
.
My code is not that sophisticated and really consists of a function that loops N times with different starting parameters, to then return some values. Thus, I use parallel code to scan the parameters, but ran into this problem.
As for parallel code using Distributed
I have tried Distributed
for loops and pmap()
. I have also tried separate parallel julia instances using SBATCH
in SLURM, on one node, without the use of Distributed
. With either approach the function runs, and I can observe that the free memory steadily drops until the processes run out of memory. I have not had this problem on a laptop, or desktop computer.
I can run this on the cluster with fewer iterations, as the program will end before the free memory depletes completely.
I have tried other approaches as well and when using pmap()
I was adding GC.safepoint()
at each iteration and running GC.gc()
every 100.000 iteration. This seemed to help a little, but the code still threw an OOM after 2 days. (Idea for GC.safepoint()
came from Poor performance of garbage collection in multi-threaded application)
I have hypothesized a little (although I have no evidence), and I have landed on basicly what @sevi writes. Could the problem be that the GC does not recognize the other processes, and thus it will not collect garbage, as it assumes that it has many times more free memory than it really has?
If I read the discussion on github correctly, about the command line flag, a hard limit was discussed, but as it stands now, --heap-size-hint
seemed to be the best way. I do find it interesting then that OP mentions trying --heap-size-hint
, without success. It seems from the discussion that julia determines the memory available by reading the hardware. Maybe an artificial limit, basically changing what julia sees as the total memory would be a solution, although I do not know if this is feasible.
Sorry for the long post :))