Flux runs out of memory

Also just to confirm, had you added the following block:

using ChainRulesCore
import ChainRulesCore: rrule

function ChainRulesCore.rrule(cfg::RuleConfig, c::Chain, x::AbstractArray)
    duo = accumulate(c.layers; init=(x, nothing)) do (input, _), layer
        out, back = rrule_via_ad(cfg, layer, input)
    end
    outs = map(first, duo)
    backs = map(last, duo)
    function un_chain(dout)
        multi = accumulate(reverse(backs); init=(nothing, dout)) do (_, delta), back
            dlayer, din = back(delta)
        end
        layergrads =
            foreach(CUDA.unsafe_free!, outs)
        foreach(CUDA.unsafe_free!, map(last, multi[1:end-1]))
        return (Tangent{Chain}(; layers=reverse(map(first, multi))), last(multi[end]))
    end
    outs[end], un_chain
end
# Could restrict this to x::CuArray... for testing instead write NaN into non-CuArrays, piratically:
CUDA.unsafe_free!(x::Array) = fill!(x, NaN)
CUDA.unsafe_free!(x::Flux.Zygote.Fill) = nothing

This has been the major helper that @mcabbott provided which roughly doubled the batch size capacity without having to disable CUDA pool which had significant adverse effect on performance.

I sometimes wonder if one could take this even further and provide a callback function which gets the gradient and the thing the gradient is for instead of returning them.

One could then immediately apply any optimizer update and then free parameter gradients. Probably alot of things which does not work out with it (non-linear optimizers and parameters reoccuring?) though so it might be difficult to make generic.

Now I tried including this block. Same result.

Hey everyone, I am running into the same problem.

I am training a residual U-Net for 3D image segmentation with FastAI.jl on GCloud with a 16GB T4 GPU but keep getting out of memory problems on the GPU. After searching online I made sure to set JULIA_CUDA_MEMORY_POOL to “none” and added a callback after every epoch that runs GC.gc(true) and CUDA.reclaim() . I think the problem is a memory leak, as it only occurs after ~30 epochs. I can also see the GPU utility dropping before the crash (see dashboard screenshot below). When I decrease the input image size it happens later. When I decrease the model size it happens earlier (this I really do not understand?). I posted the question here on StackOverflow. Does anyone have an idea what could be the problem or what I can try to figure it out?

Some context info:

  • I am using Julia 1.9.0 with FastAI.jl 0.5.1, Flux.jl 0.13.16, CUDA.jl 4.2.0
  • The VM is Ubuntu 22.04 x86_64 with CUDA toolkit 12.1, NVIDIA driver 530.30.02, and an NVIDIA Tesla T4 GPU with 16GB RAM
  • The model is a residual U-Net with approximately 9.5 million parameters, the input data are 3D Float32 images with size (96, 96, 96) and I am using a batch size of 2.

Some things I’ve tried:

  • I can reproduce the behaviour reliably, it happens every time after the same amount of epochs
  • If I decrease the input image size, it still happens but later (epoch 60)
  • If I decrease the model size it happens earlier (this I especially don’t understand)
  • I’ve set JULIA_CUDA_MEMORY_POOL to none and added a callback after each epoch that executes GC.gc(true) and CUDA.reclaim()
  • I’ve upgraded the GPU to an NVIDIA L4 with 24GB RAM. With this GPU I can train until epoch 125 but then the same thing happens (gradual decrease of GPU utility and finally an OOM error). I can work with this for now, simply restarting the training after every crash, but it is not ideal :confused:

This is a screenshot of the dashboard with VM and GPU metrics, you can see the GPU utility dropping in several steps just before the crash:

Here is one possible reason which matches your symptoms quite well: CUDA memory leak for Flux.Optimizer · Issue #148 · FluxML/FluxTraining.jl · GitHub

1 Like

Thank you very much, that was it! I replaced my Flux.Nesterov optimiser with Optimisers.Nesterov from Optimisers.jl. This new training run is already training longer than I managed before and I do not see those symptoms above anymore.

1 Like