I have been trying to release a Flux backend for my AlphaZero.jl library, but I am encountering a mysterious CUDNN error (CUDNN_STATUS_INTERNAL_ERROR
).
Here is some info:
- The error happens consistently at every run, after about a minute. It never happens exactly at the same time and always happens during an inference query of the same kind that had been run successfully thousands of times before.
- The error only happens when network inference is run in an asynchronous Julia Task, which is the case when using multiple MCTS workers. (Tasks are not run in parallel as there is a global lock. Also, only one task is doing inference and accesses the GPU).
- The error only happens when convolution layers are used (at least, it does not happen when replacing the network by an MLP.)
- The error happens with both CuArray’s splitting pool and binned pool (although it happens slightly sooner with the binned pool on average).
- The error happens with Flux but not with Knet.
- When the error happens, the GPU memory is not necessarily full (there was at least 200MB free in all my experiments).
Does anyone have an hypothesis about what’s happening here? Any indication I can use to help build a minimal working example would be welcome!
My config
- Julia 1.4.1, CUDAapi v4.0.0, CuArrays 2.2.0, Flux v0.10.4
- Nvidia RTX 2070 (8GB)
To replicate
The bug can be replicated as follows.
git clone -b flux-bug git@github.com:jonathan-laurent/AlphaZero.jl.git
cd AlphaZero.jl
julia --color=yes --project scripts/alphazero.jl --game connect-four train
After about a minute, I get the following:
Initializing a new AlphaZero environment
Initial report
Number of network parameters: 620,552
Number of regularized network parameters: 617,408
Memory footprint per MCTS node: 380 bytes
Running benchmark: AlphaZero against MCTS (1000 rollouts)
Progress: 22%|███████████▏ | ETA: 0:04:02
CUDNNError: CUDNN_STATUS_INTERNAL_ERROR (code 4)
Stacktrace:
[1] throw_api_error(::CuArrays.CUDNN.cudnnStatus_t) at /home/jonathan/.julia/packages/CuArrays/l0gXB/src/dnn/error.jl:19
[2] macro expansion at /home/jonathan/.julia/packages/CuArrays/l0gXB/src/dnn/error.jl:30 [inlined]
[3] cudnnCreate(::Base.RefValue{Ptr{Nothing}}) at /home/jonathan/.julia/packages/CUDAapi/XuSHC/src/call.jl:93
[4] cudnnCreate at /home/jonathan/.julia/packages/CuArrays/l0gXB/src/dnn/base.jl:3 [inlined]
[5] #515 at /home/jonathan/.julia/packages/CuArrays/l0gXB/src/dnn/CUDNN.jl:50 [inlined]
[6] get!(::CuArrays.CUDNN.var"#515#518"{CUDAdrv.CuContext}, ::IdDict{Any,Any}, ::Any) at ./abstractdict.jl:663
[7] handle() at /home/jonathan/.julia/packages/CuArrays/l0gXB/src/dnn/CUDNN.jl:49
[8] macro expansion at /home/jonathan/.julia/packages/CuArrays/l0gXB/src/utils.jl:36 [inlined]
[9] cudnnConvolutionForward(::CuArrays.CuArray{Float32,4,Nothing}, ::CuArrays.CuArray{Float32,4,Nothing}, ::CuArrays.CuArray{Float32,4,Nothing}, ::NNlib.DenseConvDims{2,(3, 3),3,64,(1, 1),(1, 1, 1, 1),(1, 1),false}; algo::Int64, alpha::Int64, beta::Int64) at /home/jonathan/.julia/packages/CuArrays/l0gXB/src/dnn/conv.jl:72
[10] conv!(::CuArrays.CuArray{Float32,4,Nothing}, ::CuArrays.CuArray{Float32,4,Nothing}, ::CuArrays.CuArray{Float32,4,Nothing}, ::NNlib.DenseConvDims{2,(3, 3),3,64,(1, 1),(1, 1, 1, 1),(1, 1),false}; alpha::Int64, algo::Int64) at /home/jonathan/.julia/packages/CuArrays/l0gXB/src/dnn/nnlib.jl:61
[11] conv! at /home/jonathan/.julia/packages/CuArrays/l0gXB/src/dnn/nnlib.jl:58 [inlined]
[12] conv(::CuArrays.CuArray{Float32,4,Nothing}, ::CuArrays.CuArray{Float32,4,Nothing}, ::NNlib.DenseConvDims{2,(3, 3),3,64,(1, 1),(1, 1, 1, 1),(1, 1),false}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/jonathan/.julia/packages/NNlib/FAI3o/src/conv.jl:116
[13] conv(::CuArrays.CuArray{Float32,4,Nothing}, ::CuArrays.CuArray{Float32,4,Nothing}, ::NNlib.DenseConvDims{2,(3, 3),3,64,(1, 1),(1, 1, 1, 1),(1, 1),false}) at /home/jonathan/.julia/packages/NNlib/FAI3o/src/conv.jl:114
[14] (::Flux.Conv{2,2,typeof(identity),CuArrays.CuArray{Float32,4,Nothing},CuArrays.CuArray{Float32,1,Nothing}})(::CuArrays.CuArray{Float32,4,Nothing}) at /home/jonathan/.julia/packages/Flux/Fj3bt/src/layers/conv.jl:61
[15] applychain(::Tuple{Flux.Conv{2,2,typeof(identity),CuArrays.CuArray{Float32,4,Nothing},CuArrays.CuArray{Float32,1,Nothing}},Flux.BatchNorm{typeof(NNlib.relu),CuArrays.CuArray{Float32,1,Nothing},CuArrays.CuArray{Float32,1,Nothing},Float32},Flux.Chain{Tuple{Flux.SkipConnection,AlphaZero.FluxNets.var"#19#20"}},Flux.Chain{Tuple{Flux.SkipConnection,AlphaZero.FluxNets.var"#19#20"}},Flux.Chain{Tuple{Flux.SkipConnection,AlphaZero.FluxNets.var"#19#20"}},Flux.Chain{Tuple{Flux.SkipConnection,AlphaZero.FluxNets.var"#19#20"}},Flux.Chain{Tuple{Flux.SkipConnection,AlphaZero.FluxNets.var"#19#20"}},Flux.Chain{Tuple{Flux.SkipConnection,AlphaZero.FluxNets.var"#19#20"}},Flux.Chain{Tuple{Flux.SkipConnection,AlphaZero.FluxNets.var"#19#20"}}}, ::CuArrays.CuArray{Float32,4,Nothing}) at /home/jonathan/.julia/packages/Flux/Fj3bt/src/layers/basic.jl:36
[16] (::Flux.Chain{Tuple{Flux.Conv{2,2,typeof(identity),CuArrays.CuArray{Float32,4,Nothing},CuArrays.CuArray{Float32,1,Nothing}},Flux.BatchNorm{typeof(NNlib.relu),CuArrays.CuArray{Float32,1,Nothing},CuArrays.CuArray{Float32,1,Nothing},Float32},Flux.Chain{Tuple{Flux.SkipConnection,AlphaZero.FluxNets.var"#19#20"}},Flux.Chain{Tuple{Flux.SkipConnection,AlphaZero.FluxNets.var"#19#20"}},Flux.Chain{Tuple{Flux.SkipConnection,AlphaZero.FluxNets.var"#19#20"}},Flux.Chain{Tuple{Flux.SkipConnection,AlphaZero.FluxNets.var"#19#20"}},Flux.Chain{Tuple{Flux.SkipConnection,AlphaZero.FluxNets.var"#19#20"}},Flux.Chain{Tuple{Flux.SkipConnection,AlphaZero.FluxNets.var"#19#20"}},Flux.Chain{Tuple{Flux.SkipConnection,AlphaZero.FluxNets.var"#19#20"}}}})(::CuArrays.CuArray{Float32,4,Nothing}) at /home/jonathan/.julia/packages/Flux/Fj3bt/src/layers/basic.jl:38
[17] forward(::ResNet{Game}, ::CuArrays.CuArray{Float32,4,Nothing}) at /home/jonathan/test/AlphaZero.jl/src/networks/flux.jl:184
[18] evaluate(::ResNet{Game}, ::CuArrays.CuArray{Float32,4,Nothing}, ::CuArrays.CuArray{Float32,2,Nothing}) at /home/jonathan/test/AlphaZero.jl/src/networks/network.jl:288
[19] evaluate_batch(::ResNet{Game}, ::Array{StaticArrays.SArray{Tuple{7,6},UInt8,2,42},1}) at /home/jonathan/test/AlphaZero.jl/src/networks/network.jl:313
[20] macro expansion at ./util.jl:308 [inlined]
[21] inference_server(::AlphaZero.MCTS.Env{Game,StaticArrays.SArray{Tuple{7,6},UInt8,2,42},ResNet{Game}}) at /home/jonathan/test/AlphaZero.jl/src/mcts.jl:409
[22] macro expansion at /home/jonathan/test/AlphaZero.jl/src/util.jl:64 [inlined]
[23] (::AlphaZero.MCTS.var"#21#23"{AlphaZero.MCTS.Env{Game,StaticArrays.SArray{Tuple{7,6},UInt8,2,42},ResNet{Game}}})() at ./task.jl:358