Unexpected OOM errrors in julia 1.9.0 and 1.9.1 with Distributed

I have a relatively simple script that used to run fine on 1.8.X version: it executes lots of relatively small jobs via pmap.
Executing on julia 1.8.5 with -p 32 , creates 32 processes , each using approximately 2Gb and after a few hours the calculation successfully finishes. However on 1.9.0 and 1.9.1 after a few minutes each julia processes starts to allocate lots of memory and once it reaches about 10Gb per task, all of them gets killed by oom-killer.
Reducing number of tasks, by making each one of them process more data via internal loop does not seem to help.

Potentially this is related to Garbage collection not aggressive enough on Slurm Cluster , although my jobs seem to run fine on 1.8.X

1 Like

We have encountered similar issues when during CI testing for Trixi.jl. What helped us in the parallel case was to add a --heap-size-hint=1G to each invocation of Julia, such that the garbage collector is more aggressive:

This actually helped to resolve some issues we had with parallel runs in a CI job before with v1.8, thus maybe it will help you as well.

I tried running my script with --heap-size-hint=2G, but it didn’t change anything. Perhaps, this flag doesn’t propagate to julia processes that are started with -p flag?

Good question. I have never used Distributed before, only MPI-based parallelism.

Explicitly calling GC.gc() at the end of each parallel job, seem to have solved the problem - still running after 15min without OOM.

Is this the same as OOM despite `--heap-size-hint` · Issue #50658 · JuliaLang/julia · GitHub ?

Looks like it.