Unreasonable memory usage with M4 GPU

Andrea_Pagnani · December 16, 2024, 2:42pm

Dearests,

my fight to make a reasonable use of my M4 GPU continues.

Metal.versioninfo()

macOS 15.1.1, Darwin 24.1.0

Toolchain:

Julia: 1.11.2
LLVM: 16.0.6

Julia packages:

Metal.jl: 1.4.2
GPUArrays: 10.3.1
GPUCompiler: 0.27.8
KernelAbstractions: 0.9.31
ObjectiveC: 3.1.0
LLVM: 9.1.3
LLVMDowngrader_jll: 0.3.0+2

1 device:

Apple M4 Pro (48.953 MiB allocated)

I developed a simple optimization problem (more of an MWE than what I need to do). I observe an explosion in memory. Before giving you the not-so-minimal example let me explain what I see.

function trainmodel!(model::Model; nepochs=100, verbose=true)
    opt = Flux.setup(Flux.Optimisers.Adam(0.1), model)
    for it in 1:nepochs
        grads = Flux.gradient(model) do m
            TestMetal.losslinearalgebra(m)
        end
        verbose && println("it = $it |grad| =  $(norm(grads[1].msa))")
        Flux.Optimise.update!(opt, model, grads[1])
        GC.gc()
    end
end

The crux of the problem is that without the GC.gc() command after update!, the memory explodes when I use MtlArray Arrays (explode = computer becomes unresponsive for large memory usage). For normal Arrays, there is no problem.

If you want to run the full thing is a bit complicated, but doable. I created the gist below:

gist.github.com

https://gist.github.com/pagnani/b9168ba36c0a2a5ac897faeb6b965056

testmetal.jl

module TestMetal
using Metal, KernelAbstractions,Flux, Zygote, Tullio, ExtractMacro, LinearAlgebra
export Model
struct Model{T1,T2,T3}
    msa::T3
    normW::T1
    cij::T2
    fi::T1
end

This file has been truncated. show original

To use it you should

julia> include("testmetal.jl"); using .TestMetal
julia> q,L,M = 21,53,10_000; modelgpu=TestMetal.Model(q,L,M,gpu=true); modelcpu=TestMetal.Model(q,L,M,gpu=false);
julia> TestMetal.trainmodel!(modelgpu,nepochs=100,verbose=true) # beware that this is where my computer becomes unresponsive

Worth reporting upstream to Metal?
Thanks
A

pxl-th · December 20, 2024, 5:30pm

You may keep an eye for caching allocator then

github.com/JuliaGPU/GPUArrays.jl

Add caching allocator interface

JuliaGPU:master ← JuliaGPU:pxl-th/cache-alloc

opened 02:22PM - 15 Dec 24 UTC

pxl-th

+373 -142

Since Julia's GC is not aware of GPU memory, in scenarious with lots of allocati…ons we end up in either OOM situations or in excessively high memory usage. Even though the program may require only fraction of it. To help with GPU memory utilizaton in a program with repeating blocks of code, we can wrap those regions in a scope that will utilize caching allocator every time the program enters this scope during execution. For example, this is especially useful when training models, where you compute loss, gradients w.r.t. loss and perform in-place parameter update of the model. ```julia model = ... for i in 1:1000 GPUArrays.@cache_scope kab :loop begin loss, grads = ... update!(optimizer, model, grads) end end ``` The caching allocator is defined by its name and is per-device (it will use current TLS device). ### Example In the following example we apply caching allocator at every iteration of the for-loop. Every iteration requires 2 GiB of gpu memory, without caching allocator GC wouldn't be able to free arrays in time resulting in higher memory usage. With caching allocator, memory usage stays at exactly 2 GiB. After the loop, we free all cached memory if there's any (e.g. CUDA.jl will bulk-free immediately after execution of expression inside `@cache_scope`, because it has performant allocator). ```julia kab = CUDABackend() n = 1024^3 CUDA.@sync for i in 1:1000 GPUArrays.@cache_scope kab :loop begin sin.(CUDA.rand(Float32, n)) end end GPUArrays.invalidate_cache_allocator!(kab, :loop) ``` ### Backend differences - Because CUDA has more performant allocator, CUDA.jl will bulk-free arrays at the end of `expr` execution, instead of caching the arrays (`free_immediately=true`). - AMDGPU.jl instead caches them (`free_immediately=false`) until user invalidates the cache. ### Performance impact Executing [GaussianSplatting.jl](https://github.com/JuliaNeuralGraphics/GaussianSplatting.jl/pull/26) benchmark (1k training iterations) on RX 7900XTX: ||Without caching allocator|With caching allocator| |-|-|-| |GPU memory utilization|![image](https://github.com/user-attachments/assets/92e0e802-9784-4e5a-85d0-7dfa7b4e8dbf)|![image](https://github.com/user-attachments/assets/ab4c68eb-2caa-4a56-9c0a-9176285ef66d)| |Time|`59.656476` seconds|`46.365646` seconds| ### TODO - [x] Support for 1.10. - [x] Support bulk-freeing instead of caching. - [x] Add PR description. - [x] Documentation. - [x] Tests. ### PRs for other GPU backends - AMDGPU: https://github.com/JuliaGPU/AMDGPU.jl/pull/710 - CUDA: https://github.com/JuliaGPU/CUDA.jl/pull/2593

Andrea_Pagnani · December 21, 2024, 8:48am

Thx! I’ll keep an eye to your PR

Topic		Replies	Views
Memory is not freed with CUDA and two REPLs GPU cuda	8	1518	May 7, 2021
Why is it consuming and not freeing GPU memory? GPU	5	464	April 18, 2024
GPU memory usage increasing on each epoch (Flux) Machine Learning cuda , flux	5	679	April 16, 2024
Reseting Device GPU	20	1875	July 6, 2021
Freeing memory in the GPU with CUDAdrv / CUDAnative / CuArrays GPU	8	3048	November 13, 2018

Unreasonable memory usage with M4 GPU

Related topics