OOM when using Flux and loops

Fabrice_Rosay · April 14, 2022, 5:28am

The following code works with actor a residual network of depth 5 and 64 filters if n=1000 (using Flux)
But if the lines calling GC are commented, it fails with OOM.
Using CUDA 3.9 and Nvidia Driver Version: 510.47.03 CUDA Version: 11.6.
Problem is calling GC like make that make the loop very slow.

Thanks in advance.

Any Idea where this come from ?

function optimize(actor,n)
    timings=zeros(n)
    for k in 1:n
        batch=zeros(Float32,7,7,1,k)
        batch_gpu=batch|>gpu
        t=time()
        ev=actor(batch_gpu)|>cpu
        t=time()-t
        CUDA.unsafe_free!(batch_gpu)
        if k%10==0
            GC.gc()
            GC.gc()
            CUDA.reclaim()
        end
        timings[k]=t/k
    end
    argmin(timings)
end

jling · April 14, 2022, 5:50am

Can you not re-use batch? Instead of making a new copy every time

Fabrice_Rosay · April 14, 2022, 6:18am

This is just a MWE, in my real use case, this would be problematic, but not impossible.
Though first it shouldn’t go oom, just getting slower to handle memory allocation, second it’s the gpu that goes oom, so I believe reusing batch wouldn’t help. Sorry if this wasn’t clear.
Besides using a single batch, you would need to take views, those allocate and are not always well behaved on gpu side.

Fabrice_Rosay · April 14, 2022, 1:10pm

Even stranger unless i’m missing something obvious. This one doesn’t OOM:

function optimize(actor,n)
    timings=zeros(n)
    for k in 1:n
        batch=rand(Float32,9,9,3,1000)
        batch_gpu=batch|>gpu

        p_gpu,v_gpu=actor(batch_gpu)
        p=p_gpu|>cpu
        v=v_gpu|>cpu
        
    
        CUDA.unsafe_free!(batch_gpu)
        CUDA.unsafe_free!(p_gpu)
        CUDA.unsafe_free!(v_gpu)
        
    end
   
end

But this one crashes with OOM, the difference being that the fourth dimension of batchsize is k, depending on the loop indice… I’m really puzzled. I tried to change the fourth dimension with 1, 100 or 1000 it is ok, if it stays constant… (for the record my card is a nvidia 3080, with 10 GB)

function optimize(actor,n)
    timings=zeros(n)
    for k in 1:n
        batch=rand(Float32,9,9,3,k)
        batch_gpu=batch|>gpu

        p_gpu,v_gpu=actor(batch_gpu)
        p=p_gpu|>cpu
        v=v_gpu|>cpu
        
    
        CUDA.unsafe_free!(batch_gpu)
        CUDA.unsafe_free!(p_gpu)
        CUDA.unsafe_free!(v_gpu)
        
    end
    argmin(timings)
end

This one also crashes. So this seems to be related to the varying size of batch.

function optimize(actor,n)
    timings=zeros(n)
    for k in 1:n
        batch=rand(Float32,9,9,3,rand(1:1000))
        batch_gpu=batch|>gpu

        p_gpu,v_gpu=actor(batch_gpu)
        p=p_gpu|>cpu
        v=v_gpu|>cpu
        
    
        CUDA.unsafe_free!(batch_gpu)
        CUDA.unsafe_free!(p_gpu)
        CUDA.unsafe_free!(v_gpu)
        
    end
    argmin(timings)
end

Fabrice_Rosay · April 14, 2022, 1:37pm

I came across https://github.com/JuliaGPU/CUDA.jl/issues/1461 and tried CUDA 3.8.5 then it doesn’t crashes anymore, but it wasn’t enough to solve the same problem (not the MWE), which regressing to 3.8.0 did. So definitely something bad happen between 3.8.0 and 3.8.5, which got worse with 3.9.0.
It seems that it has to do with the algo changing with batchsize for doing convolutions.

Topic		Replies	Views
Running OOM trying to load data to GPU Machine Learning gpu , flux	2	286	July 8, 2023
Unreliable computations on GPU with Flux General Usage gpu , cuda , flux	1	328	December 15, 2022
CUDA arrays not working well with broadcast!(), and other in-place operations inside a loop GPU gpu , broadcast , loops	4	722	June 1, 2022
Flux - Batch data loop in callback causing GPU Memory Error Machine Learning flux	5	1760	August 1, 2020
Flux on gpu and inference optimization GPU	2	332	January 17, 2023

OOM when using Flux and loops

Related topics