Hi,
I have a complicated function that if i call it twice in row with the exact same arguments, the second call is 50% slower. The function calls many kernels that execute on the gpu.
I noticed that one of the sub-function allocates a CuArray. On the first call the allocation takes like milliseconds, but on the second call the exact same allocation (namely CUDA.zeros(Float32,sizex,sizey)) the time is around 0.5s saying 99,7% GC time.
Subsequent call get slower and slower but not that much. I tried to pre-allocate or unsafe_free! everything, the GC keep hiting.
Any idea where this could come from ?
What’s the CUDA capability of your driver? Can you post the output of CUDA.versioninfo()
? It should be at least 11.2, if it isn’t try upgrading your driver.
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.84.0
Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+460.84
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)
Toolchain:
- Julia: 1.6.1
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
1 device:
0: GeForce RTX 3080 (sm_86, 273.062 MiB / 9.780 GiB available)
And the CUDA.jl version? Should be 3.0 or higher.
It is the last one 3.3 I think. I updated few hours ago( the problem existed before updating).
I also tried older allocator and reseting the device between each execution. It didn’t change anything, and in fact it got worse.
The involved function calls Flux multiple times. If instead of calling Flux, i use some fixed array, then it seems the slow down is not appearing or very slowly. Would there be a turn around or shall i write my own kernel for neural network (those are simple, no convolution, only Dense layer) ?
Can you try with redefining the forward pass of the dense layer as
(a::Dense)(x) = a.\sigma<TAB>(a.W * x .+ a.b)
And see if the problem persists?
Could you also share some code snippets that can repro this issue? How is your model defined?
Here is the network definition:
mutable struct resnets
c1
end
Flux.@functor resnets
resnets(n_filter,nfilter)=resnets(Dense(n_filter,nfilter,relu,bias = Flux.Zeros()))
function (m::resnets)(x,training=false)
if training
return relu.(x .+m.c1(x))
else
x.=relu.(x .+m.c1(x))
end
end
mutable struct networkf
base
res
policy
value
feature
end
Flux.@functor networkf
to_cpu(nn::networkf)=nn|>cpu
function (m::networkf)(x;training=false)
b=m.base(x)
if training
for r in m.res
b=r(b,training)
end
else
for r in m.res
r(b)
end
end
if training
return (m.policy(b),m.value(b),m.feature(b))
else
return (m.policy(b),m.value(b))
end
end
function ressimplesf(in,out,n_filter,n_tower)
return networkf(Dense(in,n_filter,relu,bias = Flux.Zeros()),[resnets(n_filter,n_filter) for k in 1:n_tower]
,Dense(n_filter,out),
Dense(n_filter,1,tanh),Dense(n_filter,42,tanh))|>gpu
end
I Tried to define a new structure just for the forward pass and convert network. I Think the calculation follow your suggestion which I didn’t try yet.
With the new structure the effect still happens but much later and everything is faster.
Here it is
mutable struct snetwork2{A,B,C,D,E,F}
base::A
res::Vector{B}
policy::C
policy_bias::D
value::E
value_bias::F
end
snetwork2(n::Int,k::Int)=snetwork2(CuArray(randn(Float32,(n,84))/512),[CuArray(randn(Float32,(n,n))/512) for j in 1:k],CuArray(randn(Float32,(7,n))/512),
CuArray(randn(Float32,7)/512),CuArray(randn(Float32,(1,n))/512),CuArray(randn(Float32,1)/512))
function (m::snetwork2)(x::T;training=false) where T<:CuArray
b=relu.(m.base*x)
for w in m.res
b.=relu.(b.+relu.(w*b))
end
#return m.policy*b,tanh.(m.value*b)
return m.policy*b.+m.policy_bias,tanh.(m.value*b.+m.value_bias)
end
function convert_back(net::networkf)
return snetwork2(net.base.weight,[w.c1.weight for w in net.res],net.policy.weight,net.policy.bias,net.value.weight,net.value.bias)
end
To see the slow down just the following code works:
actor=ressimplesf(84,7,512,4)
x=CuArray(rand(Float32,(84,32*1024)))
for i in 1:10000
p,v=actor(x)
end
or just repeat a few times the benchmark lines, at start gc is very low then it starts to kick and the timing goes higher, wich i didn’t expect.
function test(actor,x)
p,v=actor(x)
end
@benchmark test(actor,x)
This is problematic for mcts with neural network as you have to make lot of forward pass. It used to be very problematic but with CUDA getting better the effect was light, particularly because evaluation is slow for big network so that you can GC.gc(true) frequently without hitting perf too much. But in my use case doing so kill all the performance. In the first place I don’t understand why calculation taking place on the gpu should allocate (don’t flame me;).
Well I’m a bit ashamed, but I almost solved the problem.
CUDA and Flux are not guilty.
When doing the calculation on GPU for AlphaGPU you initialize huge buffers on Device. Well it happens I thought they were on device… I did all initialization in the following manner, for example:
features=CuArray(zeros(Float32,(81,32*1024))
This had the side effect to allocate also huge and useless buffers on CPU, so that GC had to deal with them. As lauching CUDA kernel or using FLUX on GPU, or simply because on device buffers are dealt with GC, GC was triggered randomly in the supposedly only gpu part of the program, facing huge amounts of Array to free, since incurring a lot of pressure.
I changed the initialization using the specific CUDA commands:
features=CUDA.zeros(Float32,(81,32*1024))
This was sufficient to suppress almost all GC pressure leading to more stable and faster iteration.
On the dark side shame, on the bright side, now it can “solve” 4 in a row in around 25 minutes.
Sorry for the misleading first post.
“Ce qui ne nous tue pas nous rend plus fort”
PS: @dhairyagandhi96,
(a::Dense)(x) = a.\sigma<TAB>(a.W * x .+ a.b)
seems slightly faster and allocate less.
Great to hear you found a solution Memory issues like can be hard to debug, we should probably finish reboot https://github.com/JuliaLang/julia/pull/33467 at some point.