So I was able to train my first network outside of a tutorial. It was very satisfying to get the 100-1000x improvement over cpu.
I tried using multiple linux tools to monitor vram (that apparently get their info from the same place): rocm-smi, amdgpu_top, and nvtop. Finally I settled on nvtop for now since it charts the history. First of all, I would be interested to know what others use.
However, my data is only 780 x 600k x 32 bits = 1.74 GiB. Indeed, varinfo() shows:
X 1.740 GiB 780×598712 Matrix{Float32}
y 2.284 MiB 1×598712 Matrix{Float32}
And (IIUC) my model is only 216_449*32/8/2^10 = 845 KiB
. I guess double that if we include the gradients, so maybe 2 MiB? Although the REPL reports only 1 .5 KiB:
julia> model = Flux.Chain(
Dense(size(X)[1], 64*4, relu),
Dense(64*4, 16*4, relu),
Dense(16*4, 1, relu)) |> gpu
Chain(
Dense(780 => 256, relu), # 199_936 parameters
Dense(256 => 64, relu), # 16_448 parameters
Dense(64 => 1, relu), # 65 parameters
) # Total: 6 arrays, 216_449 parameters, 1.521 KiB.
Yet when I load the entire dataset into memory I see (in all the tools above) that vram jumps around from ~5 GiB to 16 GiB (the max).
Here is the output of the REPL:
julia> gpu_train_loader = Flux.DataLoader((X, y) |> gpu, batchsize = batchsize, shuffle = true)
1-element DataLoader(::Tuple{ROCArray{Float32, 2, AMDGPU.Runtime.Mem.HIPBuffer}, ROCArray{Float32, 2, AMDGPU.Runtime.Mem.HIPBuffer}}, shuffle=true, batchsize=598712)
with first element:
(780×598712 ROCArray{Float32, 2, AMDGPU.Runtime.Mem.HIPBuffer}, 1×598712 ROCArray{Float32, 2, AMDGPU.Runtime.Mem.HIPBuffer},)
julia> @time for epoch in 1:epochs
for (x, y) in gpu_train_loader
grads = gradient(loss, model, x, y)
Flux.update!(opt_state, model, grads[1])
end
end
64.180294 seconds (19.97 M allocations: 2.522 GiB, 17.85% gc time, 26.96% compilation time)
Run a second time (no compilation) gives:
56.122555 seconds (372.78 k allocations: 1.360 GiB, 13.00% gc time)
My second question is whether this is the expected behavior. It seems inefficient to be reallocating the memory like shown by nvtop/etc, but really I have no idea.
PS, I guess the Adam optimizer also stores a bunch of Float32s, so maybe add another few MiB. But still, how do you get from ~2 GiB to ~16 GiB:
Summary
julia> opt_state = Flux.setup(opt, model) |> gpu
(layers = ((weight = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8),
(Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0],
Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], (0.9, 0.8))),
bias = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8),
(Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.8))), σ = ()),
(weight = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8),
(Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0],
Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], (0.9, 0.8))),
bias = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8),
(Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.8))), σ = ()),
(weight = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8),
(Float32[0.0 0.0 … 0.0 0.0],
Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.8))),
bias = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8),
(Float32[0.0], Float32[0.0], (0.9, 0.8))), σ = ())),)