MPI.jl + Zygote OOM on embarrassingly parallel tasks

richardr2926 · February 23, 2024, 4:21am

We have simple NN:

using MPI
using Zygote
using CUDA
using Flux
using LinearAlgebra

MPI.Init()

comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)

nx, ny, nz, nt = parse.(Int, ARGS[1:4])
T = Float32

x = rand(T, 5, nx*ny*nz*nt)
y = rand(T, 1, nx*ny*nz*nt)

weights = Dict(
    :w1 => rand(T, 20, 5),
    :w2 => rand(T, 128, 20),
    :w3 => rand(T, 1, 128)
)

x = x |> gpu
y = y |> gpu
weights = Dict(k => gpu(v) for (k, v) in pairs(weights))

function forward(weights, x)
    w1, w2, w3 = weights[:w1], weights[:w2], weights[:w3]
    return norm(relu.(w3 * relu.(w2 * (w1 * x))) - y)
end

forward(weights, x)
gradient_weights = Zygote.gradient(weights -> forward(weights, x), weights)

MPI.Finalize()

We are using NVIDIA A100 SMX 80GB GPUs.

Max problem size before OOMing:

Doing nx, ny, nz, nt = 128, 128, 64, 15 works on one - two GPUs

We can only do up to nx, ny, nz, nt = 128, 128, 64, 9 on 4 GPUs

Why does this scale so poorly ?

carstenbauer · February 23, 2024, 6:59pm

Native question: how do you make sure that each MPI rank uses a different GPU? Or what’s the idea here?

simonbyrne · February 24, 2024, 10:20pm

How much CPU memory do you have? If you’re not using any sort of resource manager (e.g. slurm), the easiest option would be to use the --heap-size-hint option to limit each process to a fraction of the total memory.

simonbyrne · February 24, 2024, 10:22pm

One other thing worth trying is to avoid using CPU memory at all, and use CUDA.rand instead.

richardr2926 · February 29, 2024, 8:27pm

We use slurm. so something like:

salloc --nodes=1 --constraint=gpu --gpus=4 --qos=interactive --time=00:20:00 --ntasks=4 --gpus-per-task=1 --gpu-bind=none

and then:

mpiexecjl --project=./ julia-1.8 scaling.jl 128 128 64 15

richardr2926 · February 29, 2024, 8:42pm

Hello,

We do use slurm and allocate constant memory per task.

richardr2926 · February 29, 2024, 8:45pm

Hey,

I have tried this as well. The error denotes that the issue arises when trying to allocate too much GPU memory:

ERROR: LoadError: Out of GPU memory trying to allocate 7.500 GiB Effective GPU memory usage: 99.98% (79.136 GiB/79.154 GiB) Memory pool usage: 9.023 GiB (16.531 GiB reserved)

richardr2926 · February 29, 2024, 8:58pm

I would also like to note that binding or not binding the gpus to the tasks has no effect on the outcome

ToucheSir · February 29, 2024, 11:00pm

It looks like all 4 GPUs are attached to the same node. This means CUDA.jl will use GPU ID 0 by default unless you tell it not to. Flux.gpu (which has nothing to do with Zygote, so I don’t think Zygote actually matters here), calls CUDA.cu under the hood. This means each of your tasks is trying to allocate memory on the same GPU, thus causing OOMs when they probably shouldn’t be.

When not using MPI, the best way to assign different GPUs per process/task would be to follow Multiple GPUs · CUDA.jl. I’ll let the HPC experts in this thread comment on whether there’s something special which is available when using MPI.

simonbyrne · March 1, 2024, 9:08pm

Try

CUDA.device!(MPI.Comm_rank(MPI.COMM_WORLD))

richardr2926 · March 1, 2024, 11:27pm

Hello @ToucheSir, thank you. this seems to be the issue. even when binding one gpu to one task through slurm, I noticed that all tasks are mapped to one gpu.

Thank you !

richardr2926 · March 1, 2024, 11:29pm

Hey @simonbyrne, thank you this seems to be the solution. I will also note that for this to work, I had to unbind the gpus so they are visible to all tasks on a node (--gpu-bind=none)

Topic		Replies	Views
Mpirun and julia General Usage question	3	823	December 7, 2016
Parallel without communication using MPI Julia at Scale	3	766	October 8, 2018
Pmap with multiple GPUs GPU	8	983	October 5, 2020
How to have multiple processes use the same GPU? Specific Domains parallel , cuda , memory	2	1482	March 18, 2022
Julia Execution get out of memory error General Usage	3	4713	August 5, 2017

MPI.jl + Zygote OOM on embarrassingly parallel tasks

Related topics