Unexpected OOM errrors in julia 1.9.0 and 1.9.1 with Distributed

vfonov · June 9, 2023, 12:53pm

I have a relatively simple script that used to run fine on 1.8.X version: it executes lots of relatively small jobs via pmap.
Executing on julia 1.8.5 with -p 32 , creates 32 processes , each using approximately 2Gb and after a few hours the calculation successfully finishes. However on 1.9.0 and 1.9.1 after a few minutes each julia processes starts to allocate lots of memory and once it reaches about 10Gb per task, all of them gets killed by oom-killer.
Reducing number of tasks, by making each one of them process more data via internal loop does not seem to help.

Potentially this is related to Garbage collection not aggressive enough on Slurm Cluster , although my jobs seem to run fine on 1.8.X

sloede · June 9, 2023, 2:26pm

We have encountered similar issues when during CI testing for Trixi.jl. What helped us in the parallel case was to add a --heap-size-hint=1G to each invocation of Julia, such that the garbage collector is more aggressive:

github.com

trixi-framework/Trixi.jl/blob/c47b6f6ae038535d04318c3294ff3f4a4cc41d11/test/runtests.jl#L30


      
            # To reduce their impact, we do not test MPI with coverage on Windows.
            # This reduces the chance to hit a spurious test failure by one half.
            # In addition, it looks like the Linux GitHub runners run out of memory during the 3D tests
            # with coverage, so we currently do not test MPI with coverage on Linux. For more details,
            # see the discussion at https://github.com/trixi-framework/Trixi.jl/pull/1062#issuecomment-1035901020
            cmd = string(Base.julia_cmd())
            coverage = occursin("--code-coverage", cmd) && !occursin("--code-coverage=none", cmd)
            if !(coverage && Sys.iswindows()) && !(coverage && Sys.islinux())
              # We provide a `--heap-size-hint` to avoid/reduce out-of-memory errors during CI testing
              mpiexec() do cmd
                run(`$cmd -n $TRIXI_MPI_NPROCS $(Base.julia_cmd()) --threads=1 --check-bounds=yes --heap-size-hint=1G $(abspath("test_mpi.jl"))`)
              end
            end
          end
          
          
@time if TRIXI_TEST == "all" || TRIXI_TEST == "threaded" || TRIXI_TEST == "threaded_legacy"
            # Do a dummy `@test true`:
            # If the process errors out the testset would error out as well,
            # cf. https://github.com/JuliaParallel/MPI.jl/pull/391
            @test true

This actually helped to resolve some issues we had with parallel runs in a CI job before with v1.8, thus maybe it will help you as well.

vfonov · June 9, 2023, 2:38pm

I tried running my script with --heap-size-hint=2G, but it didn’t change anything. Perhaps, this flag doesn’t propagate to julia processes that are started with -p flag?

sloede · June 9, 2023, 2:45pm

Good question. I have never used Distributed before, only MPI-based parallelism.

vfonov · June 9, 2023, 3:06pm

Explicitly calling GC.gc() at the end of each parallel job, seem to have solved the problem - still running after 15min without OOM.

nathan · September 15, 2023, 8:36pm

Is this the same as OOM despite `--heap-size-hint` · Issue #50658 · JuliaLang/julia · GitHub ?

vfonov · September 28, 2023, 2:38am

Looks like it.

Topic		Replies	Views
Garbage collection not triggering on SLURM cluster Julia at Scale question	6	1209	March 4, 2024
Julia Execution get out of memory error General Usage	3	4711	August 5, 2017
Julia killed with Out of memory error on Linux -- runs fine on MacOS General Usage memory , os	8	858	September 12, 2023
Unexpected OutOfMemory error on HPC General Usage hpc , memory , clustering	17	1753	April 8, 2020
Mpirun and julia General Usage question	3	823	December 7, 2016

Unexpected OOM errrors in julia 1.9.0 and 1.9.1 with Distributed

Related topics