While attempting to train some neural differential equations on a HPC cluster, I’ve been getting some unexpected OOM errors from Slurm. Here’s a MWE:
GC.enable_logging(true) # Uncomment these lines if running in a Docker container # using Pkg # Pkg.instantiate() using OrdinaryDiffEq, SciMLSensitivity, Zygote function rhs(u, p, t) θ, ω = u return [ω, -p * sin(θ)] end u0 = ones(2) tspan = (0.0, 0.1) p = ones(10000) # Dummy params prob = ODEProblem(rhs, u0, tspan, p) while true gradients = Zygote.gradient(Zygote.Params([p])) do sol = solve(prob, Tsit5()) return sol end end
Here’s the same script in a GitHub repo along with a Julia env: GitHub - white-alistair/MemoryLeak.jl
I can observe from the GC logging (see example log file in the Git repo) that the amount of memory collected increases slightly each time the GC is called. At the same time, the memory usage of the process is also increasing steadily. This continues until we run out of memory.
I tried to reproduce the problem locally in a Docker container with the same memory limit as the default on our cluster, but in this case the GC appears to behave differently. Initially, we observe the same gradual increase in memory usage as we see on the cluster. However, just as the process is approaching the memory limit, the GC appears to kick in much more aggressively and avoids the OOM error completely. This is how the stats look when that happens (after which memory usage remains stable):
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS ea1e88659be6 competent_raman 100.15% 3.218GiB / 3.418GiB 94.16% 105MB / 985kB 0B / 578MB 5
If my jobs on the cluster would be garbage collected in the same way as the jobs in the Docker container, then everything would be fine!
Others things I’ve tried:
- Increasing the memory limit on Slurm, up to 8gb, but we still run out of memory eventually.
- Peppering my code with
GC.gc(false), and even
ccall(:malloc_trim, Cvoid, (Cint,), 0), but none of it seemed to make a difference.
- Decreasing the number of parameters to 1,000, and then increasing it to 100,000, neither of which appears to reproduce the problematic memory usage…?
Version info on the cluster:
Julia Version 1.8.2 Commit 36034abf260 (2022-09-29 15:21 UTC) Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 16 × Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-13.0.1 (ORCJIT, haswell) Threads: 1 on 16 virtual cores Environment: LD_LIBRARY_PATH = /p/system/packages/julia/1.8.2/lib JULIA_ROOT = /p/system/packages/julia/1.8.2
Version info on my machine:
Julia Version 1.8.2 Commit 36034abf260 (2022-09-29 15:21 UTC) Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 4 × 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake) Threads: 1 on 4 virtual cores Environment: JULIA_GPG = 3673DF529D9049477F76B37566E3C7DC03D6E495 JULIA_PATH = /usr/local/julia JULIA_VERSION = 1.8.2
I would be very grateful for any help understanding and fixing this.