MPI.jl + Zygote OOM on embarrassingly parallel tasks

We have simple NN:

using MPI
using Zygote
using CUDA
using Flux
using LinearAlgebra

MPI.Init()

comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)

nx, ny, nz, nt = parse.(Int, ARGS[1:4])
T = Float32

x = rand(T, 5, nx*ny*nz*nt)
y = rand(T, 1, nx*ny*nz*nt)

weights = Dict(
    :w1 => rand(T, 20, 5),
    :w2 => rand(T, 128, 20),
    :w3 => rand(T, 1, 128)
)

x = x |> gpu
y = y |> gpu
weights = Dict(k => gpu(v) for (k, v) in pairs(weights))

function forward(weights, x)
    w1, w2, w3 = weights[:w1], weights[:w2], weights[:w3]
    return norm(relu.(w3 * relu.(w2 * (w1 * x))) - y)
end

forward(weights, x)
gradient_weights = Zygote.gradient(weights -> forward(weights, x), weights)

MPI.Finalize()

We are using NVIDIA A100 SMX 80GB GPUs.

Max problem size before OOMing:

Doing nx, ny, nz, nt = 128, 128, 64, 15 works on one - two GPUs

We can only do up to nx, ny, nz, nt = 128, 128, 64, 9 on 4 GPUs

Why does this scale so poorly ?

Native question: how do you make sure that each MPI rank uses a different GPU? Or what’s the idea here?

1 Like

How much CPU memory do you have? If you’re not using any sort of resource manager (e.g. slurm), the easiest option would be to use the --heap-size-hint option to limit each process to a fraction of the total memory.

1 Like

One other thing worth trying is to avoid using CPU memory at all, and use CUDA.rand instead.

1 Like

We use slurm. so something like:

salloc --nodes=1 --constraint=gpu --gpus=4 --qos=interactive --time=00:20:00 --ntasks=4 --gpus-per-task=1 --gpu-bind=none

and then:

mpiexecjl --project=./ julia-1.8 scaling.jl 128 128 64 15

Hello,

We do use slurm and allocate constant memory per task.

Hey,

I have tried this as well. The error denotes that the issue arises when trying to allocate too much GPU memory:

ERROR: LoadError: Out of GPU memory trying to allocate 7.500 GiB Effective GPU memory usage: 99.98% (79.136 GiB/79.154 GiB) Memory pool usage: 9.023 GiB (16.531 GiB reserved)

I would also like to note that binding or not binding the gpus to the tasks has no effect on the outcome

It looks like all 4 GPUs are attached to the same node. This means CUDA.jl will use GPU ID 0 by default unless you tell it not to. Flux.gpu (which has nothing to do with Zygote, so I don’t think Zygote actually matters here), calls CUDA.cu under the hood. This means each of your tasks is trying to allocate memory on the same GPU, thus causing OOMs when they probably shouldn’t be.

When not using MPI, the best way to assign different GPUs per process/task would be to follow Multiple GPUs · CUDA.jl. I’ll let the HPC experts in this thread comment on whether there’s something special which is available when using MPI.

2 Likes

Try

CUDA.device!(MPI.Comm_rank(MPI.COMM_WORLD))
1 Like

Hello @ToucheSir, thank you. this seems to be the issue. even when binding one gpu to one task through slurm, I noticed that all tasks are mapped to one gpu.

Thank you !

Hey @simonbyrne, thank you this seems to be the solution. I will also note that for this to work, I had to unbind the gpus so they are visible to all tasks on a node (--gpu-bind=none)

1 Like