GC hitting hard

Hi,
I have a complicated function that if i call it twice in row with the exact same arguments, the second call is 50% slower. The function calls many kernels that execute on the gpu.
I noticed that one of the sub-function allocates a CuArray. On the first call the allocation takes like milliseconds, but on the second call the exact same allocation (namely CUDA.zeros(Float32,sizex,sizey)) the time is around 0.5s saying 99,7% GC time.
Subsequent call get slower and slower but not that much. I tried to pre-allocate or unsafe_free! everything, the GC keep hiting.
Any idea where this could come from ?

What’s the CUDA capability of your driver? Can you post the output of CUDA.versioninfo()? It should be at least 11.2, if it isn’t try upgrading your driver.

CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.84.0

Libraries:

  • CUBLAS: 11.3.0
  • CURAND: 10.2.2
  • CUFFT: 10.3.0
  • CUSOLVER: 11.0.1
  • CUSPARSE: 11.3.0
  • CUPTI: 14.0.0
  • NVML: 11.0.0+460.84
  • CUDNN: 8.0.4 (for CUDA 11.1.0)
  • CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:

  • Julia: 1.6.1
  • LLVM: 11.0.1
  • PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
  • Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
0: GeForce RTX 3080 (sm_86, 273.062 MiB / 9.780 GiB available)

And the CUDA.jl version? Should be 3.0 or higher.

It is the last one 3.3 I think. I updated few hours ago( the problem existed before updating).
I also tried older allocator and reseting the device between each execution. It didn’t change anything, and in fact it got worse.

The involved function calls Flux multiple times. If instead of calling Flux, i use some fixed array, then it seems the slow down is not appearing or very slowly. Would there be a turn around or shall i write my own kernel for neural network (those are simple, no convolution, only Dense layer) ?

Can you try with redefining the forward pass of the dense layer as

(a::Dense)(x) = a.\sigma<TAB>(a.W * x .+ a.b)

And see if the problem persists?

Could you also share some code snippets that can repro this issue? How is your model defined?

Here is the network definition:


mutable struct resnets
     c1
 end

 Flux.@functor resnets

 resnets(n_filter,nfilter)=resnets(Dense(n_filter,nfilter,relu,bias = Flux.Zeros()))


function (m::resnets)(x,training=false)
    if training
        return relu.(x .+m.c1(x))
    else
        x.=relu.(x .+m.c1(x))
    end
end
mutable struct networkf
   base
    res
    policy 
    value
    feature
end


Flux.@functor networkf

to_cpu(nn::networkf)=nn|>cpu
function (m::networkf)(x;training=false)
    b=m.base(x)
    if training
    for r in m.res
        b=r(b,training)
    end
    else
        for r in m.res
            r(b)
        end
    end
    if training
        return (m.policy(b),m.value(b),m.feature(b))
    else
        return (m.policy(b),m.value(b))
    end
end



function ressimplesf(in,out,n_filter,n_tower)

     return networkf(Dense(in,n_filter,relu,bias = Flux.Zeros()),[resnets(n_filter,n_filter) for k in 1:n_tower]
     ,Dense(n_filter,out),
     Dense(n_filter,1,tanh),Dense(n_filter,42,tanh))|>gpu
end

I Tried to define a new structure just for the forward pass and convert network. I Think the calculation follow your suggestion which I didn’t try yet.

With the new structure the effect still happens but much later and everything is faster.

Here it is

mutable struct snetwork2{A,B,C,D,E,F}
    base::A
    res::Vector{B}
    policy::C
    policy_bias::D
    value::E
    value_bias::F
end

snetwork2(n::Int,k::Int)=snetwork2(CuArray(randn(Float32,(n,84))/512),[CuArray(randn(Float32,(n,n))/512) for j in 1:k],CuArray(randn(Float32,(7,n))/512),
CuArray(randn(Float32,7)/512),CuArray(randn(Float32,(1,n))/512),CuArray(randn(Float32,1)/512))


function (m::snetwork2)(x::T;training=false) where T<:CuArray
    b=relu.(m.base*x)
    for w in m.res
        b.=relu.(b.+relu.(w*b))
    end
    #return m.policy*b,tanh.(m.value*b)
    return m.policy*b.+m.policy_bias,tanh.(m.value*b.+m.value_bias)
end

function convert_back(net::networkf)
    return snetwork2(net.base.weight,[w.c1.weight for w in net.res],net.policy.weight,net.policy.bias,net.value.weight,net.value.bias)
end

To see the slow down just the following code works:

actor=ressimplesf(84,7,512,4)
x=CuArray(rand(Float32,(84,32*1024)))
for i in 1:10000
   p,v=actor(x)
end

or just repeat a few times the benchmark lines, at start gc is very low then it starts to kick and the timing goes higher, wich i didn’t expect.

function test(actor,x)
 p,v=actor(x)
end
@benchmark test(actor,x)

This is problematic for mcts with neural network as you have to make lot of forward pass. It used to be very problematic but with CUDA getting better the effect was light, particularly because evaluation is slow for big network so that you can GC.gc(true) frequently without hitting perf too much. But in my use case doing so kill all the performance. In the first place I don’t understand why calculation taking place on the gpu should allocate (don’t flame me;).

Well I’m a bit ashamed, but I almost solved the problem.
CUDA and Flux are not guilty.
When doing the calculation on GPU for AlphaGPU you initialize huge buffers on Device. Well it happens I thought they were on device… I did all initialization in the following manner, for example:

features=CuArray(zeros(Float32,(81,32*1024))

This had the side effect to allocate also huge and useless buffers on CPU, so that GC had to deal with them. As lauching CUDA kernel or using FLUX on GPU, or simply because on device buffers are dealt with GC, GC was triggered randomly in the supposedly only gpu part of the program, facing huge amounts of Array to free, since incurring a lot of pressure.
I changed the initialization using the specific CUDA commands:

features=CUDA.zeros(Float32,(81,32*1024))

This was sufficient to suppress almost all GC pressure leading to more stable and faster iteration.
On the dark side shame, on the bright side, now it can “solve” 4 in a row in around 25 minutes.
Sorry for the misleading first post.
“Ce qui ne nous tue pas nous rend plus fort”

PS: @dhairyagandhi96,

(a::Dense)(x) = a.\sigma<TAB>(a.W * x .+ a.b)

seems slightly faster and allocate less.

3 Likes

Great to hear you found a solution :slightly_smiling_face: Memory issues like can be hard to debug, we should probably finish reboot https://github.com/JuliaLang/julia/pull/33467 at some point.