GPU memory usage increasing on each epoch (Flux)

I’m catching up on the last few years of Flux progress (thanks, Flux devs!) by training some toy models, and am surprised to find that GPU memory usage is increasing on each training epoch, even when all training data is pre-copied to the GPU once. Here’s a minimal example that just generates some random data and trains a linear model:

using Flux
using MLUtils: DataLoader
using CUDA

function gpu_free()
    GC.gc()
    CUDA.reclaim()
end

function increasing_gpu_memory_usage()
    n_obs = 300_000
    n_feature = 1000
    X = rand(n_feature, n_obs)
    y = rand(1, n_obs)
    train_data = DataLoader((X, y) |> gpu; batchsize = 2048, shuffle = false)

    model = Dense(n_feature, 1) |> gpu
    loss(m, _x, _y) = Flux.Losses.mse(m(_x), _y)
    opt_state = Flux.setup(Flux.Adam(), model)
    for epoch in 1:8
        @info "Start of epoch $(epoch)"
        CUDA.memory_status()
        train_time = @elapsed Flux.train!(loss, model, train_data, opt_state)
        @info "Epoch $(epoch) train time $(round(train_time, digits=3))"
    end
    return
end

And example output:

julia> gpu_free(); increasing_gpu_memory_usage()
[ Info: Start of epoch 1
Effective GPU memory usage: 31.51% (2.521 GiB/7.999 GiB)
Memory pool usage: 1.119 GiB (1.125 GiB reserved)
[ Info: Epoch 1 train time 0.574
[ Info: Start of epoch 2
Effective GPU memory usage: 59.64% (4.771 GiB/7.999 GiB)
Memory pool usage: 3.363 GiB (3.375 GiB reserved)
[ Info: Epoch 2 train time 0.259
[ Info: Start of epoch 3
Effective GPU memory usage: 90.55% (7.243 GiB/7.999 GiB)
Memory pool usage: 5.608 GiB (5.625 GiB reserved)
[ Info: Epoch 3 train time 0.719
[ Info: Start of epoch 4
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 7.853 GiB (7.875 GiB reserved)
[ Info: Epoch 4 train time 0.952
[ Info: Start of epoch 5
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 10.097 GiB (10.125 GiB reserved)
[ Info: Epoch 5 train time 0.972
[ Info: Start of epoch 6
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 12.342 GiB (12.344 GiB reserved)
[ Info: Epoch 6 train time 22.584
[ Info: Start of epoch 7
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 1.410 GiB (14.344 GiB reserved)
[ Info: Epoch 7 train time 0.282
[ Info: Start of epoch 8
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 3.654 GiB (14.344 GiB reserved)
[ Info: Epoch 8 train time 0.112

Notice how both GPU memory usage and training time increase each epoch until epoch 6, which is very slow and after which memory pool usage goes back down, so presumably GC time?

Naively I would think that since the full training data is type converted and copied to the GPU once ((X,y) |> gpu), and the entire training data fits cleanly in GPU memory, memory usage shouldn’t increase by a few GB each epoch, which of course super slows down training. Is there something simple I’m doing wrong here, or is this an inevitable consequence of computing gradients requiring GPU-side allocation?

(versions: Flux 0.14.15, CUDA 5.2.0, Julia 1.10.2)

Thanks in advance for the help!

1 Like

Unfortunately, the memory management heuristic in CUDA.jl performs poorly under some workloads. Forcing garbage collection by inserting a GC.gc(false); CUDA.reclaim() at the end of each epoch helps a lot.
You may also want to GC every few mini-batch iterations, in which case you have to roll your own training loop

for (i, (x, y)) in enumerate(train_data)
  g = gradient(model -> loss(model, x, y), model)[1]
  Flux.Optimise.update!(opt_state, model, g)
  if i % 10 == 0
    # @info "Batch $(i)"
    # CUDA.memory_status()
    GC.gc(false)
  end
end

A related issue is

Hopefully, an improvement will come from

2 Likes

This is very helpful, thanks. One question: empirically, adding the GC.gc(false) call each epoch helps a lot with exploding memory and runtime, but adding an each-epoch CUDA.reclaim() on top of that is actually detrimental. How should I think about why this happens?

It seems like reclaim is fast when there isn’t anything to reclaim, but I guess there’s some economy of scale / amortization and it’s better to reclaim a bunch all at once, so calling it every epoch can add excessive overhead…of course depending on hardware and data size etc.?

Which is expected; CUDA.reclaim() should not be called without good reason. It wipes the memory pool containing a cache of allocations, making subsequent allocations much more expensive. It should only be used if you run into an otherwise unsolvable OOM, and happens because of libraries allocating outside of the memory pool.

One possibly solution for this, is Consider running GC when allocating and synchronizing by maleadt · Pull Request #2304 · JuliaGPU/CUDA.jl · GitHub, which would run the GC more often when performing certain CUDA.jl allocations instead of waiting for the GC to come by. Testing that out in order to fine-tune the heuristics, or posting a MWE there, would be useful.

That said, I am slightly confused by the memory_status() output though:

The memory pool normally shouldn’t exceed the size of the device. Are you perhaps using unified memory somewhere?

Yeah, I’m using a RTX 3060 Ti on Windows (well, typically Windows Subsystem for Linux), and Windows shows 16GB graphics memory, of which 8GB shared and 8GB dedicated. Is it recommended to use environment variables to try to limit to under 8GB as in the docs here?

Not necessarily, I was just confused by the output. memory_status() should probably show 16GiB too, but I’m not sure if the CUDA API exposes a way to query the full available memory range.