GC hitting hard

Fabrice_Rosay · June 17, 2021, 4:35pm

Hi,
I have a complicated function that if i call it twice in row with the exact same arguments, the second call is 50% slower. The function calls many kernels that execute on the gpu.
I noticed that one of the sub-function allocates a CuArray. On the first call the allocation takes like milliseconds, but on the second call the exact same allocation (namely CUDA.zeros(Float32,sizex,sizey)) the time is around 0.5s saying 99,7% GC time.
Subsequent call get slower and slower but not that much. I tried to pre-allocate or unsafe_free! everything, the GC keep hiting.
Any idea where this could come from ?

maleadt · June 17, 2021, 4:37pm

What’s the CUDA capability of your driver? Can you post the output of CUDA.versioninfo()? It should be at least 11.2, if it isn’t try upgrading your driver.

Fabrice_Rosay · June 17, 2021, 4:40pm

CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.84.0

Libraries:

CUBLAS: 11.3.0
CURAND: 10.2.2
CUFFT: 10.3.0
CUSOLVER: 11.0.1
CUSPARSE: 11.3.0
CUPTI: 14.0.0
NVML: 11.0.0+460.84
CUDNN: 8.0.4 (for CUDA 11.1.0)
CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:

Julia: 1.6.1
LLVM: 11.0.1
PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
0: GeForce RTX 3080 (sm_86, 273.062 MiB / 9.780 GiB available)

maleadt · June 17, 2021, 4:47pm

And the CUDA.jl version? Should be 3.0 or higher.

Fabrice_Rosay · June 17, 2021, 4:51pm

It is the last one 3.3 I think. I updated few hours ago( the problem existed before updating).
I also tried older allocator and reseting the device between each execution. It didn’t change anything, and in fact it got worse.

Fabrice_Rosay · June 19, 2021, 10:33pm

The involved function calls Flux multiple times. If instead of calling Flux, i use some fixed array, then it seems the slow down is not appearing or very slowly. Would there be a turn around or shall i write my own kernel for neural network (those are simple, no convolution, only Dense layer) ?

dhairyagandhi96 · June 21, 2021, 11:55am

Can you try with redefining the forward pass of the dense layer as

(a::Dense)(x) = a.\sigma<TAB>(a.W * x .+ a.b)

And see if the problem persists?

Could you also share some code snippets that can repro this issue? How is your model defined?

Fabrice_Rosay · June 21, 2021, 7:19pm

Here is the network definition:


mutable struct resnets
     c1
 end

 Flux.@functor resnets

 resnets(n_filter,nfilter)=resnets(Dense(n_filter,nfilter,relu,bias = Flux.Zeros()))


function (m::resnets)(x,training=false)
    if training
        return relu.(x .+m.c1(x))
    else
        x.=relu.(x .+m.c1(x))
    end
end
mutable struct networkf
   base
    res
    policy 
    value
    feature
end


Flux.@functor networkf

to_cpu(nn::networkf)=nn|>cpu
function (m::networkf)(x;training=false)
    b=m.base(x)
    if training
    for r in m.res
        b=r(b,training)
    end
    else
        for r in m.res
            r(b)
        end
    end
    if training
        return (m.policy(b),m.value(b),m.feature(b))
    else
        return (m.policy(b),m.value(b))
    end
end



function ressimplesf(in,out,n_filter,n_tower)

     return networkf(Dense(in,n_filter,relu,bias = Flux.Zeros()),[resnets(n_filter,n_filter) for k in 1:n_tower]
     ,Dense(n_filter,out),
     Dense(n_filter,1,tanh),Dense(n_filter,42,tanh))|>gpu
end

I Tried to define a new structure just for the forward pass and convert network. I Think the calculation follow your suggestion which I didn’t try yet.

With the new structure the effect still happens but much later and everything is faster.

Here it is

mutable struct snetwork2{A,B,C,D,E,F}
    base::A
    res::Vector{B}
    policy::C
    policy_bias::D
    value::E
    value_bias::F
end

snetwork2(n::Int,k::Int)=snetwork2(CuArray(randn(Float32,(n,84))/512),[CuArray(randn(Float32,(n,n))/512) for j in 1:k],CuArray(randn(Float32,(7,n))/512),
CuArray(randn(Float32,7)/512),CuArray(randn(Float32,(1,n))/512),CuArray(randn(Float32,1)/512))


function (m::snetwork2)(x::T;training=false) where T<:CuArray
    b=relu.(m.base*x)
    for w in m.res
        b.=relu.(b.+relu.(w*b))
    end
    #return m.policy*b,tanh.(m.value*b)
    return m.policy*b.+m.policy_bias,tanh.(m.value*b.+m.value_bias)
end

function convert_back(net::networkf)
    return snetwork2(net.base.weight,[w.c1.weight for w in net.res],net.policy.weight,net.policy.bias,net.value.weight,net.value.bias)
end

To see the slow down just the following code works:

actor=ressimplesf(84,7,512,4)
x=CuArray(rand(Float32,(84,32*1024)))
for i in 1:10000
   p,v=actor(x)
end

or just repeat a few times the benchmark lines, at start gc is very low then it starts to kick and the timing goes higher, wich i didn’t expect.

function test(actor,x)
 p,v=actor(x)
end
@benchmark test(actor,x)

This is problematic for mcts with neural network as you have to make lot of forward pass. It used to be very problematic but with CUDA getting better the effect was light, particularly because evaluation is slow for big network so that you can GC.gc(true) frequently without hitting perf too much. But in my use case doing so kill all the performance. In the first place I don’t understand why calculation taking place on the gpu should allocate (don’t flame me;).

Fabrice_Rosay · June 24, 2021, 6:25am

Well I’m a bit ashamed, but I almost solved the problem.
CUDA and Flux are not guilty.
When doing the calculation on GPU for AlphaGPU you initialize huge buffers on Device. Well it happens I thought they were on device… I did all initialization in the following manner, for example:

features=CuArray(zeros(Float32,(81,32*1024))

This had the side effect to allocate also huge and useless buffers on CPU, so that GC had to deal with them. As lauching CUDA kernel or using FLUX on GPU, or simply because on device buffers are dealt with GC, GC was triggered randomly in the supposedly only gpu part of the program, facing huge amounts of Array to free, since incurring a lot of pressure.
I changed the initialization using the specific CUDA commands:

features=CUDA.zeros(Float32,(81,32*1024))

This was sufficient to suppress almost all GC pressure leading to more stable and faster iteration.
On the dark side shame, on the bright side, now it can “solve” 4 in a row in around 25 minutes.
Sorry for the misleading first post.
“Ce qui ne nous tue pas nous rend plus fort”

PS: @dhairyagandhi96,

(a::Dense)(x) = a.\sigma<TAB>(a.W * x .+ a.b)

seems slightly faster and allocate less.

maleadt · June 24, 2021, 7:35am

Great to hear you found a solution Memory issues like can be hard to debug, we should probably ~~finish~~ reboot https://github.com/JuliaLang/julia/pull/33467 at some point.

Topic		Replies	Views
GPU code has a high amount of CPU allocations? GPU	7	519	February 8, 2023
Some CUDA functions suddenly become very slow New to Julia	3	199	July 14, 2024
CUDA v2 - performance regression on matrix multiplication GPU	14	1721	November 10, 2020
Reseting Device GPU	20	1876	July 6, 2021
Is there a way to avoid allocations when calling a Flux model? Especially on a GPU Machine Learning memory-allocation , flux	5	877	May 4, 2021

GC hitting hard

Related topics