While attempting to train some neural differential equations on a HPC cluster, I’ve been getting some unexpected OOM errors from Slurm. Here’s a MWE:
GC.enable_logging(true)
# Uncomment these lines if running in a Docker container
# using Pkg
# Pkg.instantiate()
using OrdinaryDiffEq, SciMLSensitivity, Zygote
function rhs(u, p, t)
θ, ω = u
return [ω, -p[1] * sin(θ)]
end
u0 = ones(2)
tspan = (0.0, 0.1)
p = ones(10000) # Dummy params
prob = ODEProblem(rhs, u0, tspan, p)
while true
gradients = Zygote.gradient(Zygote.Params([p])) do
sol = solve(prob, Tsit5())
return sol[1][1]
end
end
Here’s the same script in a GitHub repo along with a Julia env: GitHub - white-alistair/MemoryLeak.jl
I can observe from the GC logging (see example log file in the Git repo) that the amount of memory collected increases slightly each time the GC is called. At the same time, the memory usage of the process is also increasing steadily. This continues until we run out of memory.
I tried to reproduce the problem locally in a Docker container with the same memory limit as the default on our cluster, but in this case the GC appears to behave differently. Initially, we observe the same gradual increase in memory usage as we see on the cluster. However, just as the process is approaching the memory limit, the GC appears to kick in much more aggressively and avoids the OOM error completely. This is how the stats look when that happens (after which memory usage remains stable):
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
ea1e88659be6 competent_raman 100.15% 3.218GiB / 3.418GiB 94.16% 105MB / 985kB 0B / 578MB 5
If my jobs on the cluster would be garbage collected in the same way as the jobs in the Docker container, then everything would be fine!
Others things I’ve tried:
- Increasing the memory limit on Slurm, up to 8gb, but we still run out of memory eventually.
- Peppering my code with
GC.gc(true)
,GC.gc(false)
, and evenccall(:malloc_trim, Cvoid, (Cint,), 0)
, but none of it seemed to make a difference. - Decreasing the number of parameters to 1,000, and then increasing it to 100,000, neither of which appears to reproduce the problematic memory usage…?
Version info on the cluster:
Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
Threads: 1 on 16 virtual cores
Environment:
LD_LIBRARY_PATH = /p/system/packages/julia/1.8.2/lib
JULIA_ROOT = /p/system/packages/julia/1.8.2
Version info on my machine:
Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 4 × 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
Threads: 1 on 4 virtual cores
Environment:
JULIA_GPG = 3673DF529D9049477F76B37566E3C7DC03D6E495
JULIA_PATH = /usr/local/julia
JULIA_VERSION = 1.8.2
I would be very grateful for any help understanding and fixing this.