AMDGPU: Very aggressive allocation? And what tools to monitor vram usage?

So I was able to train my first network outside of a tutorial. It was very satisfying to get the 100-1000x improvement over cpu.

I tried using multiple linux tools to monitor vram (that apparently get their info from the same place): rocm-smi, amdgpu_top, and nvtop. Finally I settled on nvtop for now since it charts the history. First of all, I would be interested to know what others use.

However, my data is only 780 x 600k x 32 bits = 1.74 GiB. Indeed, varinfo() shows:

X   1.740 GiB 780×598712 Matrix{Float32}
y   2.284 MiB 1×598712 Matrix{Float32}

And (IIUC) my model is only 216_449*32/8/2^10 = 845 KiB. I guess double that if we include the gradients, so maybe 2 MiB? Although the REPL reports only 1 .5 KiB:

julia> model = Flux.Chain(
        Dense(size(X)[1], 64*4, relu),
        Dense(64*4, 16*4, relu),
        Dense(16*4, 1, relu)) |> gpu
Chain(
  Dense(780 => 256, relu),              # 199_936 parameters
  Dense(256 => 64, relu),               # 16_448 parameters
  Dense(64 => 1, relu),                 # 65 parameters
)                   # Total: 6 arrays, 216_449 parameters, 1.521 KiB.

Yet when I load the entire dataset into memory I see (in all the tools above) that vram jumps around from ~5 GiB to 16 GiB (the max).

Here is the output of the REPL:

julia> gpu_train_loader = Flux.DataLoader((X, y) |> gpu, batchsize = batchsize, shuffle = true)
1-element DataLoader(::Tuple{ROCArray{Float32, 2, AMDGPU.Runtime.Mem.HIPBuffer}, ROCArray{Float32, 2, AMDGPU.Runtime.Mem.HIPBuffer}}, shuffle=true, batchsize=598712)
  with first element:
  (780×598712 ROCArray{Float32, 2, AMDGPU.Runtime.Mem.HIPBuffer}, 1×598712 ROCArray{Float32, 2, AMDGPU.Runtime.Mem.HIPBuffer},)

julia> @time for epoch in 1:epochs
           for (x, y) in gpu_train_loader
               grads = gradient(loss, model, x, y)
               Flux.update!(opt_state, model, grads[1])
           end
       end
 64.180294 seconds (19.97 M allocations: 2.522 GiB, 17.85% gc time, 26.96% compilation time)

Run a second time (no compilation) gives:

56.122555 seconds (372.78 k allocations: 1.360 GiB, 13.00% gc time)

My second question is whether this is the expected behavior. It seems inefficient to be reallocating the memory like shown by nvtop/etc, but really I have no idea.

PS, I guess the Adam optimizer also stores a bunch of Float32s, so maybe add another few MiB. But still, how do you get from ~2 GiB to ~16 GiB:

Summary
julia> opt_state = Flux.setup(opt, model) |> gpu
(layers = ((weight = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8), 
(Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], 
Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], (0.9, 0.8))), 
bias = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8), 
(Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 
Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.8))), σ = ()), 
(weight = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8), 
(Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], 
Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], (0.9, 0.8))), 
bias = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8), 
(Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 
Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.8))), σ = ()), 
(weight = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8), 
(Float32[0.0 0.0 … 0.0 0.0], 
Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.8))), 
bias = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8), 
(Float32[0.0], Float32[0.0], (0.9, 0.8))), σ = ())),)

So when an allocation happens it causes the GPU memory pool to grow.
But because we do not immediately free arrays, when they are no longer used, the next allocation also causes the memory pool to grow and so on.

But when we do actually free GPU arrays, we do no reclaim that memory back to OS.
So the grown pool size stays the same and monitoring tool will show you that it is still using all 16 GB of VRAM but in reality it might not.

3 Likes

Also, @time is likely only showing CPU allocations. With CUDA.jl I know you can use CUDA.@time (not the same macro!) to count GPU allocations too, but I don’t know what the equivalent for AMDGPU.jl is.

1 Like

For AMDGPU.jl there’s AMDGPU.@elapsed

1 Like

Depending on specifics (eg, step size for some reason), I will see it spike to 16 GiB then back down, repeatedly. That is what seemed like a possible inefficiency to me. Is it possible to log the allocations/etc from your end to estimate how much memory is actually available? From looking around it seems the standard monitoring tools can’t do this, but from within the software it may be possible.

For the current pool you can get used memory with:

julia> pool = AMDGPU.HIP.memory_pool(dev);

julia> used_memory = HIP.used_memory(pool);

Or free memory with:

julia> free = AMDGPU.Mem.free();
julia> total = AMDGPU.Mem.total();

But this one might give you a delayed result, because it is updating this info in snapshots.

1 Like

And the spike you are seeing, because once we hit VRAM limit we try to manually invoke GC in the hope of freeing some memory.

And we do this in several rounds, from least expensive to most expensive.
So the last round does full device synchronization and reclaims memory back to OS.
This is indeed inefficient, but until GC learns about GPU memory there’s no easy solution around this

1 Like

Where do I get dev from? I tried dev = AMDGPU.device() and dev = Flux.get_device(; verbose=true)

For the flux one:

julia> dev = Flux.get_device(; verbose=true)
[ Info: Using backend set in preferences: AMDGPU.
(::Flux.FluxAMDGPUDevice) (generic function with 1 method)

julia> pool = AMDGPU.HIP.memory_pool(dev.deviceID)
AMDGPU.HIP.HIPMemoryPool(Ptr{Nothing} @0x0000000001e7fe50)

julia> HIP.used_memory(pool)
ERROR: UndefVarError: `HIP` not defined
Stacktrace:
 [1] top-level scope
   @ REPL[61]:1
 [2] top-level scope
   @ ~/.julia/packages/AMDGPU/goZLq/src/tls.jl:200

Thanks, this has all been very helpful.

My bad, you get it like:

julia> dev = AMDGPU.device()

And for the nice formatting of the memory you can use:

julia> Base.format_bytes(used)

That’s what I thought at first but it didn’t work:

julia> dev = AMDGPU.device()
HIPDevice(name="AMD Radeon VII", id=1, gcn_arch=gfx906:sramecc+:xnack-)

julia> pool = AMDGPU.HIP.memory_pool(dev)
AMDGPU.HIP.HIPMemoryPool(Ptr{Nothing} @0x0000000001e7fe50)

julia> HIP.used_memory(pool)
ERROR: UndefVarError: `HIP` not defined
Stacktrace:
 [1] top-level scope
   @ REPL[92]:1
 [2] top-level scope
   @ ~/.julia/packages/AMDGPU/goZLq/src/tls.jl:200

This is AMDGPUv0.8.2.

Use AMDGPU.HIP.used_memory(pool), prepend AMDGPU.

1 Like

Actually, that fixed the error but it is always returning zero:

julia> used = AMDGPU.HIP.used_memory(pool)
0x0000000000000000

julia> Base.format_bytes(used)
"0 bytes"

This is as nvtop says 15 GiB. The other two work well for me though.

Do you have other Julia processes running at the same time?
HIP is able to track memory usage only from the current process, it does not know about memory usage from other processes (unless you use the same pool accross them, maybe).

Memory usage works for me:

julia> AMDGPU.HIP.used_memory(AMDGPU.HIP.memory_pool(AMDGPU.device()))
0x0000000000000000

julia> x = ROCArray{Float32}(undef, 32 * 1024^2);

julia> AMDGPU.HIP.used_memory(AMDGPU.HIP.memory_pool(AMDGPU.device()))
0x0000000008000000

julia> Base.format_bytes(AMDGPU.HIP.used_memory(AMDGPU.HIP.memory_pool(AMDGPU.device())))
"128.000 MiB"

julia> finalize(x)

julia> Base.format_bytes(AMDGPU.HIP.used_memory(AMDGPU.HIP.memory_pool(AMDGPU.device())))
"0 bytes"

julia> AMDGPU.HIP.used_memory(AMDGPU.HIP.memory_pool(AMDGPU.device()))
0x0000000000000000

julia> x = ROCArray{Float32}(undef, 32 * 1024^2);

julia> AMDGPU.HIP.used_memory(AMDGPU.HIP.memory_pool(AMDGPU.device()))
0x0000000000000000

julia> Base.format_bytes(AMDGPU.HIP.used_memory(AMDGPU.HIP.memory_pool(AMDGPU.device())))
"0 bytes"

I do see two processes, but wouldn’t think it should matter. The tree looks like this:

- [...]
   - bash
      - julia --threads=32 --project=.
         - ~/.julia/juliaup/julia-1.10.0-rc2+0.x64.linux.gnu/bin/julia --threads=32 --project=.

I see nvtop shows the lower level process as associated with the GPU. Maybe it is due to using juliaup?

Unlikely due to juliaup, because you are allocation within the same process.
Can you try doing device synchronization after allocation and then check the memory?

julia> AMDGPU.device_synchronize()

Nope, same result.