Memory Leak Using OrdinaryDiffEq and Zygote on a HPC Cluster

While attempting to train some neural differential equations on a HPC cluster, I’ve been getting some unexpected OOM errors from Slurm. Here’s a MWE:

GC.enable_logging(true)

# Uncomment these lines if running in a Docker container
# using Pkg
# Pkg.instantiate()

using OrdinaryDiffEq, SciMLSensitivity, Zygote

function rhs(u, p, t)
    θ, ω = u
    return [ω, -p[1] * sin(θ)]
end

u0 = ones(2)
tspan = (0.0, 0.1)
p = ones(10000)  # Dummy params

prob = ODEProblem(rhs, u0, tspan, p)

while true
    gradients = Zygote.gradient(Zygote.Params([p])) do
        sol = solve(prob, Tsit5())
        return sol[1][1]
    end
end

Here’s the same script in a GitHub repo along with a Julia env: GitHub - white-alistair/MemoryLeak.jl

I can observe from the GC logging (see example log file in the Git repo) that the amount of memory collected increases slightly each time the GC is called. At the same time, the memory usage of the process is also increasing steadily. This continues until we run out of memory.

I tried to reproduce the problem locally in a Docker container with the same memory limit as the default on our cluster, but in this case the GC appears to behave differently. Initially, we observe the same gradual increase in memory usage as we see on the cluster. However, just as the process is approaching the memory limit, the GC appears to kick in much more aggressively and avoids the OOM error completely. This is how the stats look when that happens (after which memory usage remains stable):

CONTAINER ID   NAME              CPU %     MEM USAGE / LIMIT     MEM %     NET I/O         BLOCK I/O    PIDS
ea1e88659be6   competent_raman   100.15%   3.218GiB / 3.418GiB   94.16%    105MB / 985kB   0B / 578MB   5

If my jobs on the cluster would be garbage collected in the same way as the jobs in the Docker container, then everything would be fine!

Others things I’ve tried:

  1. Increasing the memory limit on Slurm, up to 8gb, but we still run out of memory eventually.
  2. Peppering my code with GC.gc(true), GC.gc(false), and even ccall(:malloc_trim, Cvoid, (Cint,), 0), but none of it seemed to make a difference.
  3. Decreasing the number of parameters to 1,000, and then increasing it to 100,000, neither of which appears to reproduce the problematic memory usage…?

Version info on the cluster:

Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
  Threads: 1 on 16 virtual cores
Environment:
  LD_LIBRARY_PATH = /p/system/packages/julia/1.8.2/lib
  JULIA_ROOT = /p/system/packages/julia/1.8.2

Version info on my machine:

Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
  Threads: 1 on 4 virtual cores
Environment:
  JULIA_GPG = 3673DF529D9049477F76B37566E3C7DC03D6E495
  JULIA_PATH = /usr/local/julia
  JULIA_VERSION = 1.8.2

I would be very grateful for any help understanding and fixing this.

1 Like