AMDGPU: Very aggressive allocation? And what tools to monitor vram usage?

Tetrakai · December 11, 2023, 4:50pm

So I was able to train my first network outside of a tutorial. It was very satisfying to get the 100-1000x improvement over cpu.

I tried using multiple linux tools to monitor vram (that apparently get their info from the same place): rocm-smi, amdgpu_top, and nvtop. Finally I settled on nvtop for now since it charts the history. First of all, I would be interested to know what others use.

However, my data is only 780 x 600k x 32 bits = 1.74 GiB. Indeed, varinfo() shows:

X   1.740 GiB 780×598712 Matrix{Float32}
y   2.284 MiB 1×598712 Matrix{Float32}

And (IIUC) my model is only 216_449*32/8/2^10 = 845 KiB. I guess double that if we include the gradients, so maybe 2 MiB? Although the REPL reports only 1 .5 KiB:

julia> model = Flux.Chain(
        Dense(size(X)[1], 64*4, relu),
        Dense(64*4, 16*4, relu),
        Dense(16*4, 1, relu)) |> gpu
Chain(
  Dense(780 => 256, relu),              # 199_936 parameters
  Dense(256 => 64, relu),               # 16_448 parameters
  Dense(64 => 1, relu),                 # 65 parameters
)                   # Total: 6 arrays, 216_449 parameters, 1.521 KiB.

Yet when I load the entire dataset into memory I see (in all the tools above) that vram jumps around from ~5 GiB to 16 GiB (the max).

Here is the output of the REPL:

julia> gpu_train_loader = Flux.DataLoader((X, y) |> gpu, batchsize = batchsize, shuffle = true)
1-element DataLoader(::Tuple{ROCArray{Float32, 2, AMDGPU.Runtime.Mem.HIPBuffer}, ROCArray{Float32, 2, AMDGPU.Runtime.Mem.HIPBuffer}}, shuffle=true, batchsize=598712)
  with first element:
  (780×598712 ROCArray{Float32, 2, AMDGPU.Runtime.Mem.HIPBuffer}, 1×598712 ROCArray{Float32, 2, AMDGPU.Runtime.Mem.HIPBuffer},)

julia> @time for epoch in 1:epochs
           for (x, y) in gpu_train_loader
               grads = gradient(loss, model, x, y)
               Flux.update!(opt_state, model, grads[1])
           end
       end
 64.180294 seconds (19.97 M allocations: 2.522 GiB, 17.85% gc time, 26.96% compilation time)

Run a second time (no compilation) gives:

56.122555 seconds (372.78 k allocations: 1.360 GiB, 13.00% gc time)

My second question is whether this is the expected behavior. It seems inefficient to be reallocating the memory like shown by nvtop/etc, but really I have no idea.

PS, I guess the Adam optimizer also stores a bunch of Float32s, so maybe add another few MiB. But still, how do you get from ~2 GiB to ~16 GiB:

Summary

julia> opt_state = Flux.setup(opt, model) |> gpu
(layers = ((weight = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8), 
(Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], 
Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], (0.9, 0.8))), 
bias = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8), 
(Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 
Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.8))), σ = ()), 
(weight = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8), 
(Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], 
Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], (0.9, 0.8))), 
bias = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8), 
(Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 
Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], (0.9, 0.8))), σ = ()), 
(weight = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8), 
(Float32[0.0 0.0 … 0.0 0.0], 
Float32[0.0 0.0 … 0.0 0.0], (0.9, 0.8))), 
bias = Leaf(Adam(0.01, (0.9, 0.8), 1.0e-8), 
(Float32[0.0], Float32[0.0], (0.9, 0.8))), σ = ())),)

pxl-th · December 11, 2023, 6:27pm

So when an allocation happens it causes the GPU memory pool to grow.
But because we do not immediately free arrays, when they are no longer used, the next allocation also causes the memory pool to grow and so on.

But when we do actually free GPU arrays, we do no reclaim that memory back to OS.
So the grown pool size stays the same and monitoring tool will show you that it is still using all 16 GB of VRAM but in reality it might not.

ToucheSir · December 11, 2023, 6:51pm

Also, @time is likely only showing CPU allocations. With CUDA.jl I know you can use CUDA.@time (not the same macro!) to count GPU allocations too, but I don’t know what the equivalent for AMDGPU.jl is.

pxl-th · December 11, 2023, 7:30pm

For AMDGPU.jl there’s AMDGPU.@elapsed

Tetrakai · December 11, 2023, 8:33pm

Depending on specifics (eg, step size for some reason), I will see it spike to 16 GiB then back down, repeatedly. That is what seemed like a possible inefficiency to me. Is it possible to log the allocations/etc from your end to estimate how much memory is actually available? From looking around it seems the standard monitoring tools can’t do this, but from within the software it may be possible.

pxl-th · December 11, 2023, 8:46pm

For the current pool you can get used memory with:

julia> pool = AMDGPU.HIP.memory_pool(dev);

julia> used_memory = HIP.used_memory(pool);

Or free memory with:

julia> free = AMDGPU.Mem.free();
julia> total = AMDGPU.Mem.total();

But this one might give you a delayed result, because it is updating this info in snapshots.

pxl-th · December 11, 2023, 8:48pm

And the spike you are seeing, because once we hit VRAM limit we try to manually invoke GC in the hope of freeing some memory.

And we do this in several rounds, from least expensive to most expensive.
So the last round does full device synchronization and reclaims memory back to OS.
This is indeed inefficient, but until GC learns about GPU memory there’s no easy solution around this

Tetrakai · December 11, 2023, 9:51pm

Where do I get dev from? I tried dev = AMDGPU.device() and dev = Flux.get_device(; verbose=true)

For the flux one:

julia> dev = Flux.get_device(; verbose=true)
[ Info: Using backend set in preferences: AMDGPU.
(::Flux.FluxAMDGPUDevice) (generic function with 1 method)

julia> pool = AMDGPU.HIP.memory_pool(dev.deviceID)
AMDGPU.HIP.HIPMemoryPool(Ptr{Nothing} @0x0000000001e7fe50)

julia> HIP.used_memory(pool)
ERROR: UndefVarError: `HIP` not defined
Stacktrace:
 [1] top-level scope
   @ REPL[61]:1
 [2] top-level scope
   @ ~/.julia/packages/AMDGPU/goZLq/src/tls.jl:200

Thanks, this has all been very helpful.

pxl-th · December 11, 2023, 10:02pm

My bad, you get it like:

julia> dev = AMDGPU.device()

And for the nice formatting of the memory you can use:

julia> Base.format_bytes(used)

Tetrakai · December 11, 2023, 10:08pm

That’s what I thought at first but it didn’t work:

julia> dev = AMDGPU.device()
HIPDevice(name="AMD Radeon VII", id=1, gcn_arch=gfx906:sramecc+:xnack-)

julia> pool = AMDGPU.HIP.memory_pool(dev)
AMDGPU.HIP.HIPMemoryPool(Ptr{Nothing} @0x0000000001e7fe50)

julia> HIP.used_memory(pool)
ERROR: UndefVarError: `HIP` not defined
Stacktrace:
 [1] top-level scope
   @ REPL[92]:1
 [2] top-level scope
   @ ~/.julia/packages/AMDGPU/goZLq/src/tls.jl:200

This is AMDGPUv0.8.2.

pxl-th · December 11, 2023, 10:34pm

Use AMDGPU.HIP.used_memory(pool), prepend AMDGPU.

Tetrakai · December 11, 2023, 10:44pm

Actually, that fixed the error but it is always returning zero:

julia> used = AMDGPU.HIP.used_memory(pool)
0x0000000000000000

julia> Base.format_bytes(used)
"0 bytes"

This is as nvtop says 15 GiB. The other two work well for me though.

pxl-th · December 12, 2023, 3:52pm

Do you have other Julia processes running at the same time?
HIP is able to track memory usage only from the current process, it does not know about memory usage from other processes (unless you use the same pool accross them, maybe).

Memory usage works for me:

julia> AMDGPU.HIP.used_memory(AMDGPU.HIP.memory_pool(AMDGPU.device()))
0x0000000000000000

julia> x = ROCArray{Float32}(undef, 32 * 1024^2);

julia> AMDGPU.HIP.used_memory(AMDGPU.HIP.memory_pool(AMDGPU.device()))
0x0000000008000000

julia> Base.format_bytes(AMDGPU.HIP.used_memory(AMDGPU.HIP.memory_pool(AMDGPU.device())))
"128.000 MiB"

julia> finalize(x)

julia> Base.format_bytes(AMDGPU.HIP.used_memory(AMDGPU.HIP.memory_pool(AMDGPU.device())))
"0 bytes"

Tetrakai · December 12, 2023, 4:45pm


julia> AMDGPU.HIP.used_memory(AMDGPU.HIP.memory_pool(AMDGPU.device()))
0x0000000000000000

julia> x = ROCArray{Float32}(undef, 32 * 1024^2);

julia> AMDGPU.HIP.used_memory(AMDGPU.HIP.memory_pool(AMDGPU.device()))
0x0000000000000000

julia> Base.format_bytes(AMDGPU.HIP.used_memory(AMDGPU.HIP.memory_pool(AMDGPU.device())))
"0 bytes"

I do see two processes, but wouldn’t think it should matter. The tree looks like this:

- [...]
   - bash
      - julia --threads=32 --project=.
         - ~/.julia/juliaup/julia-1.10.0-rc2+0.x64.linux.gnu/bin/julia --threads=32 --project=.

I see nvtop shows the lower level process as associated with the GPU. Maybe it is due to using juliaup?

pxl-th · December 12, 2023, 5:09pm

Unlikely due to juliaup, because you are allocation within the same process.
Can you try doing device synchronization after allocation and then check the memory?

julia> AMDGPU.device_synchronize()

Tetrakai · December 12, 2023, 5:22pm

Nope, same result.

Topic		Replies	Views
Lux tutorial: AMDGPU 20x slower than CPU New to Julia flux , amdgpu , lux	17	1390	December 5, 2023
How to track total memory usage of Julia process over time Profiling memory-allocation	17	5734	June 6, 2025
GPU memory usage increasing on each epoch (Flux) Machine Learning cuda , flux	5	683	April 16, 2024
Flux runs out of memory Machine Learning memory-allocation , flux	25	4332	June 1, 2023
Memory usage increasing with each epoch Machine Learning cuda , flux	18	731	April 14, 2025

AMDGPU: Very aggressive allocation? And what tools to monitor vram usage?

Related topics