GPU memory usage increasing on each epoch (Flux)

evanfields · April 15, 2024, 2:40am

I’m catching up on the last few years of Flux progress (thanks, Flux devs!) by training some toy models, and am surprised to find that GPU memory usage is increasing on each training epoch, even when all training data is pre-copied to the GPU once. Here’s a minimal example that just generates some random data and trains a linear model:

using Flux
using MLUtils: DataLoader
using CUDA

function gpu_free()
    GC.gc()
    CUDA.reclaim()
end

function increasing_gpu_memory_usage()
    n_obs = 300_000
    n_feature = 1000
    X = rand(n_feature, n_obs)
    y = rand(1, n_obs)
    train_data = DataLoader((X, y) |> gpu; batchsize = 2048, shuffle = false)

    model = Dense(n_feature, 1) |> gpu
    loss(m, _x, _y) = Flux.Losses.mse(m(_x), _y)
    opt_state = Flux.setup(Flux.Adam(), model)
    for epoch in 1:8
        @info "Start of epoch $(epoch)"
        CUDA.memory_status()
        train_time = @elapsed Flux.train!(loss, model, train_data, opt_state)
        @info "Epoch $(epoch) train time $(round(train_time, digits=3))"
    end
    return
end

And example output:

julia> gpu_free(); increasing_gpu_memory_usage()
[ Info: Start of epoch 1
Effective GPU memory usage: 31.51% (2.521 GiB/7.999 GiB)
Memory pool usage: 1.119 GiB (1.125 GiB reserved)
[ Info: Epoch 1 train time 0.574
[ Info: Start of epoch 2
Effective GPU memory usage: 59.64% (4.771 GiB/7.999 GiB)
Memory pool usage: 3.363 GiB (3.375 GiB reserved)
[ Info: Epoch 2 train time 0.259
[ Info: Start of epoch 3
Effective GPU memory usage: 90.55% (7.243 GiB/7.999 GiB)
Memory pool usage: 5.608 GiB (5.625 GiB reserved)
[ Info: Epoch 3 train time 0.719
[ Info: Start of epoch 4
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 7.853 GiB (7.875 GiB reserved)
[ Info: Epoch 4 train time 0.952
[ Info: Start of epoch 5
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 10.097 GiB (10.125 GiB reserved)
[ Info: Epoch 5 train time 0.972
[ Info: Start of epoch 6
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 12.342 GiB (12.344 GiB reserved)
[ Info: Epoch 6 train time 22.584
[ Info: Start of epoch 7
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 1.410 GiB (14.344 GiB reserved)
[ Info: Epoch 7 train time 0.282
[ Info: Start of epoch 8
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 3.654 GiB (14.344 GiB reserved)
[ Info: Epoch 8 train time 0.112

Notice how both GPU memory usage and training time increase each epoch until epoch 6, which is very slow and after which memory pool usage goes back down, so presumably GC time?

Naively I would think that since the full training data is type converted and copied to the GPU once ((X,y) |> gpu), and the entire training data fits cleanly in GPU memory, memory usage shouldn’t increase by a few GB each epoch, which of course super slows down training. Is there something simple I’m doing wrong here, or is this an inevitable consequence of computing gradients requiring GPU-side allocation?

(versions: Flux 0.14.15, CUDA 5.2.0, Julia 1.10.2)

Thanks in advance for the help!

CarloLucibello · April 15, 2024, 4:20am

Unfortunately, the memory management heuristic in CUDA.jl performs poorly under some workloads. Forcing garbage collection by inserting a GC.gc(false); CUDA.reclaim() at the end of each epoch helps a lot.
You may also want to GC every few mini-batch iterations, in which case you have to roll your own training loop

for (i, (x, y)) in enumerate(train_data)
  g = gradient(model -> loss(model, x, y), model)[1]
  Flux.Optimise.update!(opt_state, model, g)
  if i % 10 == 0
    # @info "Batch $(i)"
    # CUDA.memory_status()
    GC.gc(false)
  end
end

A related issue is

github.com/FluxML/Flux.jl

Significant time spent moving medium-size arrays to GPU, type instability

opened 02:44PM - 27 Mar 24 UTC

closed 08:37AM - 30 Mar 24 UTC

BioTurboNick

There are occasions where `@profview` shows seemingly inordinate amount of time …spent moving data to the GPU given the array size, and possibly excessive GPU memory usage? Not sure what I should be expecting. Could be related to having two array outputs rather than one? I've also see type instability reported through `@code_warntype` and Cthulhu that I'm not sure how to resolve. Using a toy example to show the effect: ```julia ]activate --temp ]add cuDNN, CUDA, Flux using CUDA, cuDNN, Flux, Statistics struct Split{T1 <: Dense, T2 <: Dense} s1::T1 s2::T2 max_sources::Int end function Split(feature_dim::Int, max_sources::Int) Split( Dense(feature_dim => 2 * max_sources), Dense(feature_dim => max_sources), max_sources) end Flux.@layer Split (m::Split)(input) = reshape(m.s1(input), 2, m.max_sources, :), m.s2(input) function imagegen_test(batch_size) return randn(Float32, 2048, batch_size), (randn(Float32, 2, 20, batch_size), randn(Float32, 20, batch_size)) end function test() training_batch_size = 256 iters_per_eval = 64 network = Split(2048, 128) |> gpu optimizer = Flux.Optimise.Adam(1E-4) opt_state = Flux.setup(optimizer, network) kernel_sigmas_gpu = Float32[64.0, 320.0, 640.0, 1920.0] |> gpu for i ∈ 1:iters_per_eval training_data = imagegen_test(training_batch_size) |> gpu Flux.train!(network, (training_data,), opt_state) do m, x, y θ_pred, intensity_pred = m(x) loss_func(θ_pred, intensity_pred, y..., kernel_sigmas_gpu) end end return nothing end pairwise_cityblock(c) = dropdims(sum((Flux.unsqueeze(c, 2) .- Flux.unsqueeze(c, 3)) .|> abs, dims = 1), dims = 1) function kernel_loss(K, predicted_weights, target_weights) weights = [predicted_weights; -target_weights] embedding_loss = batched_vec(Flux.unsqueeze(weights, 1), batched_vec(K, weights)) return dropdims(embedding_loss, dims = 1) end function multiscale_l1_laplacian_loss(θ_predicted, w_predicted, θ_target, w_target, inv_scale_factors) D = pairwise_cityblock([θ_predicted θ_target]) losses = kernel_loss.(eachslice(exp.(-D ./ reshape(inv_scale_factors, 1, 1, 1, :)), dims = 4), Ref(w_predicted), Ref(w_target)) return sum(losses) end function loss_func(x1, y1, x2, y2, kernel_sigmas) mean(multiscale_l1_laplacian_loss(x1, y1, x2, y2, kernel_sigmas)) end test() @profview test() ```

Hopefully, an improvement will come from

evanfields · April 16, 2024, 2:41am

This is very helpful, thanks. One question: empirically, adding the GC.gc(false) call each epoch helps a lot with exploding memory and runtime, but adding an each-epoch CUDA.reclaim() on top of that is actually detrimental. How should I think about why this happens?

It seems like reclaim is fast when there isn’t anything to reclaim, but I guess there’s some economy of scale / amortization and it’s better to reclaim a bunch all at once, so calling it every epoch can add excessive overhead…of course depending on hardware and data size etc.?

maleadt · April 16, 2024, 5:19am

Which is expected; CUDA.reclaim() should not be called without good reason. It wipes the memory pool containing a cache of allocations, making subsequent allocations much more expensive. It should only be used if you run into an otherwise unsolvable OOM, and happens because of libraries allocating outside of the memory pool.

One possibly solution for this, is Consider running GC when allocating and synchronizing by maleadt · Pull Request #2304 · JuliaGPU/CUDA.jl · GitHub, which would run the GC more often when performing certain CUDA.jl allocations instead of waiting for the GC to come by. Testing that out in order to fine-tune the heuristics, or posting a MWE there, would be useful.

That said, I am slightly confused by the memory_status() output though:

The memory pool normally shouldn’t exceed the size of the device. Are you perhaps using unified memory somewhere?

evanfields · April 16, 2024, 1:14pm

Yeah, I’m using a RTX 3060 Ti on Windows (well, typically Windows Subsystem for Linux), and Windows shows 16GB graphics memory, of which 8GB shared and 8GB dedicated. Is it recommended to use environment variables to try to limit to under 8GB as in the docs here?

maleadt · April 16, 2024, 2:16pm

Not necessarily, I was just confused by the output. memory_status() should probably show 16GiB too, but I’m not sure if the CUDA API exposes a way to query the full available memory range.

Topic		Replies	Views
Memory usage increasing with each epoch Machine Learning cuda , flux	18	711	April 14, 2025
Flux runs out of memory Machine Learning memory-allocation , flux	25	4322	June 1, 2023
Why is it consuming and not freeing GPU memory? GPU	5	462	April 18, 2024
Flux's model-zoo CIFAR10 example saturates 8GB gpu General Usage gpu , flux	5	652	June 29, 2020
Training Flux LSTM on GPU is slower than on CPU Machine Learning question , flux , lstm	1	263	May 16, 2024

GPU memory usage increasing on each epoch (Flux)

Related topics