OOM when using Flux and loops

The following code works with actor a residual network of depth 5 and 64 filters if n=1000 (using Flux)
But if the lines calling GC are commented, it fails with OOM.
Using CUDA 3.9 and Nvidia Driver Version: 510.47.03 CUDA Version: 11.6.
Problem is calling GC like make that make the loop very slow.

Thanks in advance.

Any Idea where this come from ?

function optimize(actor,n)
    timings=zeros(n)
    for k in 1:n
        batch=zeros(Float32,7,7,1,k)
        batch_gpu=batch|>gpu
        t=time()
        ev=actor(batch_gpu)|>cpu
        t=time()-t
        CUDA.unsafe_free!(batch_gpu)
        if k%10==0
            GC.gc()
            GC.gc()
            CUDA.reclaim()
        end
        timings[k]=t/k
    end
    argmin(timings)
end

Can you not re-use batch? Instead of making a new copy every time

This is just a MWE, in my real use case, this would be problematic, but not impossible.
Though first it shouldn’t go oom, just getting slower to handle memory allocation, second it’s the gpu that goes oom, so I believe reusing batch wouldn’t help. Sorry if this wasn’t clear.
Besides using a single batch, you would need to take views, those allocate and are not always well behaved on gpu side.

Even stranger unless i’m missing something obvious. This one doesn’t OOM:

function optimize(actor,n)
    timings=zeros(n)
    for k in 1:n
        batch=rand(Float32,9,9,3,1000)
        batch_gpu=batch|>gpu

        p_gpu,v_gpu=actor(batch_gpu)
        p=p_gpu|>cpu
        v=v_gpu|>cpu
        
    
        CUDA.unsafe_free!(batch_gpu)
        CUDA.unsafe_free!(p_gpu)
        CUDA.unsafe_free!(v_gpu)
        
    end
   
end

But this one crashes with OOM, the difference being that the fourth dimension of batchsize is k, depending on the loop indice… I’m really puzzled. I tried to change the fourth dimension with 1, 100 or 1000 it is ok, if it stays constant… (for the record my card is a nvidia 3080, with 10 GB)

function optimize(actor,n)
    timings=zeros(n)
    for k in 1:n
        batch=rand(Float32,9,9,3,k)
        batch_gpu=batch|>gpu

        p_gpu,v_gpu=actor(batch_gpu)
        p=p_gpu|>cpu
        v=v_gpu|>cpu
        
    
        CUDA.unsafe_free!(batch_gpu)
        CUDA.unsafe_free!(p_gpu)
        CUDA.unsafe_free!(v_gpu)
        
    end
    argmin(timings)
end

This one also crashes. So this seems to be related to the varying size of batch.

function optimize(actor,n)
    timings=zeros(n)
    for k in 1:n
        batch=rand(Float32,9,9,3,rand(1:1000))
        batch_gpu=batch|>gpu

        p_gpu,v_gpu=actor(batch_gpu)
        p=p_gpu|>cpu
        v=v_gpu|>cpu
        
    
        CUDA.unsafe_free!(batch_gpu)
        CUDA.unsafe_free!(p_gpu)
        CUDA.unsafe_free!(v_gpu)
        
    end
    argmin(timings)
end

I came across https://github.com/JuliaGPU/CUDA.jl/issues/1461 and tried CUDA 3.8.5 then it doesn’t crashes anymore, but it wasn’t enough to solve the same problem (not the MWE), which regressing to 3.8.0 did. So definitely something bad happen between 3.8.0 and 3.8.5, which got worse with 3.9.0.
It seems that it has to do with the algo changing with batchsize for doing convolutions.

1 Like