Flux runs out of memory

I’m running training conv net on 2000x2000 pixel image with Flux/Zygote on RTX 3070 (mobile) with 8GB of RAM and I’m getting out of memory exception when gradients are calculated:

ERROR: LoadError: Out of GPU memory trying to allocate 641.462 MiB
Effective GPU memory usage: 96.70% (7.735 GiB/8.000 GiB)
Memory pool usage: 5.872 GiB (6.000 GiB reserved)

My MWE:

using CUDA
using Flux

device = gpu

img = rand(Float32, 2000, 2000, 1, 1) |> gpu

model = Chain(
    Conv((5, 5), 1=>16, sigmoid),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu)
) |> gpu

ps = params(model)
loss = Flux.mse
opt = ADAM(0.001)

test = model(img)

target = rand(Float32, size(test)...) |> gpu

for epoch in 1:1
    gs = Flux.gradient(ps) do
            loss(model(img), target)
        end

    Flux.Optimise.update!(opt, ps, gs)
end

Am I doing somethin stupid? I would not expectect this NN architecture to not fit into the memory.

The model itself comfortably fits into memory, but I wonder if the activations and intermediate allocations might be pushing things over the edge.

# img = rand(Float32, 25, 25, 1, 1) |> gpu; test = model(img); target = rand(Float32, size(test)...) |> gpu;
julia> CUDA.@time gradient(() -> Flux.mse(model(img), target), ps);
 33.297392 seconds (86.89 M CPU allocations: 4.516 GiB, 4.16% gc time) (60 GPU allocations: 13.179 MiB, 0.00% memmgmt time)

julia> CUDA.@time gradient(() -> Flux.mse(model(img), target), ps);
  0.716580 seconds (68.97 k CPU allocations: 3.692 MiB) (48 GPU allocations: 444.008 KiB, 0.01% memmgmt time)

Switching to the large input still OOMs after this, but I can make things work by forcing a full GC and telling CUDA to clear cached memory.

julia> GC.gc(true); CUDA.reclaim();

julia> CUDA.@time gradient(() -> Flux.mse(model(img), target), ps);
  2.497996 seconds (136.12 k CPU allocations: 7.278 MiB, 31.56% gc time) (71 GPU allocations: 12.529 GiB, 10.02% memmgmt time)

julia> CUDA.@time gradient(() -> Flux.mse(model(img), target), ps);
  1.724317 seconds (80.52 k CPU allocations: 4.334 MiB, 44.30% gc time) (67 GPU allocations: 9.652 GiB, 4.16% memmgmt time)

This is on an RTX 2070, also with 8GB of VRAM.

So the first pass allocates quite a bit more (perhaps because it’s compiling the gradient machinery as it goes). It’s not a great workaround, but warming up with a small input, aggressively collecting memory and then using the large input works.

The bigger question is where all these allocations come from. Assuming optimal memory (re)use, we have (2000, 2000, 1, 1) -> Conv((5, 5), 1=>16) -> (1996, 1996, 16, 1) for the first layer. Each subsequent layer is (W, W, 16, 1) -> Conv((5, 5), 16=>16) -> (W-4, W-4, 16, 1). In total, that’s 6 activation arrays with ~378 million items. x4 bytes because we’re working with Float32, that’s around 1.4GiB. Model params and optimizer state are negligible in comparison (<10MiB, I believe). I can’t account for all of the allocations either (back of the napkin math says 48 for forward, backwards and parameters), so CuDNN may be doing quite a bit of it behind the scenes (+18 CuDNN calls for forwards and backwards would be 66, much closer to the reported 67-71).

1 Like

If you are feeling adventurous, you could try out NNlib#346 on this. Which tries to re-arrange how Conv works, to save not quite half of the memory.

After playing around a bit more, one culprit for poor memory reclamation may be the use of globals. This version does not require any calls to GC or CUDA.reclaim():

using CUDA
using Flux

cpu_model = Chain(
    Conv((5, 5), 1=>16, sigmoid),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu),
    Conv((5, 5), 16=>16, relu)
)
model = gpu(cpu_model)
ps = params(model)

grad(model, ps, img, target) = gradient(() -> Flux.mse(model(img), target), ps)

let img_small = CUDA.rand(Float32, 25, 25, 1, 1), target_small = CUDA.rand(Float32, 1, 1, 16, 1)
    grad(model, ps, img_small, target_small)
end

let
    img = CUDA.rand(Float32, 2000, 2000, 1, 1)
    target = CUDA.rand(Float32, Flux.outputsize(cpu_model, size(img))...)
    CUDA.@profile for i in 1:5
        NVTX.@range "iter $i" begin
            grad(model, ps, img, target)
        end
    end
end

The @profile and @range macros are there because I looked at the results in Nsight Systems, and you can too! As expected, the GC will occasionally kick off and bring memory usage back in line.

2 Likes

Thanks @ToucheSir. I’ve tried both solutions, copied the code verbatim, but it always resulted in the OOM. I have also checked the behaviour on my other machine with RTX 2080 (desktop) on both Win10 and Linux. Still no luck.

BTW, my Manifest.toml looks like this: Manifest.toml - Pastebin.com
CUDA version is:

julia> CUDA.version()
v"11.6.0"

I’d like to note that for me I cannot even execute a single iteration of the loop which contains grad(...) call. It’s not like I need to do 5 iteration (as in your example) to fill up the memory to trigger OOM when GC kicks in too late.

I guess if I don’t find a better solution I’ll need to try to make the model smaller by increasing the stride and/or adding pooling.

Can you confirm that the model works on the small input? If not, that’s likely a CUDA issue.

Another thing to keep in mind is that your system is consuming VRAM as well. I ran my tests on a machine that doesn’t usually have any kind of desktop session active. If nvidia-smi reports more than a few hundred MB being used when Julia is not running, then you may just not have enough overall.