I’m catching up on the last few years of Flux progress (thanks, Flux devs!) by training some toy models, and am surprised to find that GPU memory usage is increasing on each training epoch, even when all training data is pre-copied to the GPU once. Here’s a minimal example that just generates some random data and trains a linear model:
using Flux
using MLUtils: DataLoader
using CUDA
function gpu_free()
GC.gc()
CUDA.reclaim()
end
function increasing_gpu_memory_usage()
n_obs = 300_000
n_feature = 1000
X = rand(n_feature, n_obs)
y = rand(1, n_obs)
train_data = DataLoader((X, y) |> gpu; batchsize = 2048, shuffle = false)
model = Dense(n_feature, 1) |> gpu
loss(m, _x, _y) = Flux.Losses.mse(m(_x), _y)
opt_state = Flux.setup(Flux.Adam(), model)
for epoch in 1:8
@info "Start of epoch $(epoch)"
CUDA.memory_status()
train_time = @elapsed Flux.train!(loss, model, train_data, opt_state)
@info "Epoch $(epoch) train time $(round(train_time, digits=3))"
end
return
end
And example output:
julia> gpu_free(); increasing_gpu_memory_usage()
[ Info: Start of epoch 1
Effective GPU memory usage: 31.51% (2.521 GiB/7.999 GiB)
Memory pool usage: 1.119 GiB (1.125 GiB reserved)
[ Info: Epoch 1 train time 0.574
[ Info: Start of epoch 2
Effective GPU memory usage: 59.64% (4.771 GiB/7.999 GiB)
Memory pool usage: 3.363 GiB (3.375 GiB reserved)
[ Info: Epoch 2 train time 0.259
[ Info: Start of epoch 3
Effective GPU memory usage: 90.55% (7.243 GiB/7.999 GiB)
Memory pool usage: 5.608 GiB (5.625 GiB reserved)
[ Info: Epoch 3 train time 0.719
[ Info: Start of epoch 4
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 7.853 GiB (7.875 GiB reserved)
[ Info: Epoch 4 train time 0.952
[ Info: Start of epoch 5
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 10.097 GiB (10.125 GiB reserved)
[ Info: Epoch 5 train time 0.972
[ Info: Start of epoch 6
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 12.342 GiB (12.344 GiB reserved)
[ Info: Epoch 6 train time 22.584
[ Info: Start of epoch 7
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 1.410 GiB (14.344 GiB reserved)
[ Info: Epoch 7 train time 0.282
[ Info: Start of epoch 8
Effective GPU memory usage: 100.00% (7.999 GiB/7.999 GiB)
Memory pool usage: 3.654 GiB (14.344 GiB reserved)
[ Info: Epoch 8 train time 0.112
Notice how both GPU memory usage and training time increase each epoch until epoch 6, which is very slow and after which memory pool usage goes back down, so presumably GC time?
Naively I would think that since the full training data is type converted and copied to the GPU once ((X,y) |> gpu
), and the entire training data fits cleanly in GPU memory, memory usage shouldn’t increase by a few GB each epoch, which of course super slows down training. Is there something simple I’m doing wrong here, or is this an inevitable consequence of computing gradients requiring GPU-side allocation?
(versions: Flux 0.14.15, CUDA 5.2.0, Julia 1.10.2)
Thanks in advance for the help!