CUDNNError: CUDNN_STATUS_NOT_INITIALIZED (code 1) during Pkg.test("CUDA")

Using Julia v1.6.1 I get several error during testing the CUDA package (v3.1.0). Most of them go back to CUDNNError: CUDNN_STATUS_NOT_INITIALIZED (code 1) - more details below.

The corresponding test runs through without a glitch using Julia v1.5.4 and CUDA v2.4.3.

Am I missing something?

julia> using CUDA
julia> import Pkg

julia> [CUDA.capability(dev) for dev in CUDA.devices()]
1-element Vector{VersionNumber}:

julia> Pkg.test("CUDA")
     Testing CUDA
     Testing Running tests...
┌ Info: System information:
│ CUDA toolkit 10.2.89, artifact installation
│ CUDA driver 10.2.0
│ NVIDIA driver 440.44.0
│ Libraries: 
│ - CUBLAS: 10.2.2
│ - CURAND: 10.1.2
│ - CUFFT: 10.1.2
│ - CUSOLVER: 10.3.0
│ - CUSPARSE: 10.3.1
│ - CUPTI: 12.0.0
│ - NVML: 10.0.0+440.44
│ - CUDNN: 8.10.0 (for CUDA 10.2.0)
│ - CUTENSOR: 1.2.1 (for CUDA 10.2.0)
│ Toolchain:
│ - Julia: 1.6.1
│ - LLVM: 11.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5
│ - Device support: sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│ 1 device:
└   0: Tesla V100-PCIE-16GB (sm_70, 14.612 GiB / 15.782 GiB available)
[ Info: Testing using 1 device(s): 1. Tesla V100-PCIE-16GB (UUID df8dbe6c-02af-2641-1396-1f916a568a3b)
                                         |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                            (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
initialization                       (2) |     5.37 |   0.00 |  0.0 |       0.00 |   305.00 |   0.12 |  2.2 |     405.22 |   910.82 |
apiutils                             (2) |     0.17 |   0.00 |  0.0 |       0.00 |   305.00 |   0.00 |  0.0 |       1.81 |   910.82 |
array                                (2) |   108.26 |   0.02 |  0.0 |       5.29 |   425.00 |   6.10 |  5.6 |   14619.60 |  1089.48 |
broadcast                            (2) |    28.90 |   0.00 |  0.0 |       0.00 |   347.00 |   1.38 |  4.8 |    3225.32 |  1169.78 |
codegen                              (2) |     9.62 |   0.00 |  0.0 |       0.00 |   501.00 |   0.46 |  4.7 |    1263.18 |  1316.51 |
cublas                               (2) |   106.60 |   0.02 |  0.0 |      14.50 |   501.00 |   5.42 |  5.1 |   14330.21 |  1846.28 |
cufft                                (2) |    26.37 |   0.01 |  0.0 |     110.15 |   581.00 |   1.04 |  4.0 |    2830.28 |  2170.42 |
curand                               (2) |     0.15 |   0.00 |  0.0 |       0.00 |   415.00 |   0.04 | 25.6 |       6.94 |  2170.42 |
cusparse                             (2) |    32.67 |   0.01 |  0.0 |       8.45 |   585.00 |   1.07 |  3.3 |    2871.67 |  2388.04 |
examples                             (2) |   125.62 |   0.00 |  0.0 |       0.00 |   409.00 |   0.03 |  0.0 |      30.52 |  2388.04 |
exceptions                           (2) |    76.20 |   0.00 |  0.0 |       0.00 |   409.00 |   0.00 |  0.0 |      38.38 |  2388.04 |
      From worker 2:	WARNING: Method definition #5674#kernel(Any) in module Main at /home/arndt/.julia/packages/CUDA/k52QH/test/execution.jl:316 overwritten at /home/arndt/.julia/packages/CUDA/k52QH/test/execution.jl:324.
execution                            (2) |    88.04 |   0.00 |  0.0 |       0.95 |   535.00 |   3.82 |  4.3 |   10200.83 |  2453.33 |
iterator                             (2) |     2.29 |   0.00 |  0.0 |       1.16 |   411.00 |   0.12 |  5.4 |     367.05 |  2453.33 |
nvml                                 (2) |     0.36 |   0.00 |  0.0 |       0.00 |   409.00 |   0.04 | 11.1 |      26.51 |  2453.33 |
nvtx                                 (2) |     0.22 |   0.00 |  0.0 |       0.00 |   409.00 |   0.00 |  0.0 |      26.91 |  2453.33 |
pointer                              (2) |     0.23 |   0.00 |  0.2 |       0.00 |   411.00 |   0.00 |  0.0 |      14.08 |  2453.33 |
pool                                 (2) |     4.90 |   0.00 |  0.0 |       0.00 |   409.00 |   0.99 | 20.2 |     787.15 |  2453.33 |
random                               (2) |     9.84 |   0.13 |  1.3 |       0.02 |   415.00 |   0.42 |  4.3 |    1097.25 |  2453.33 |
sorting                              (2) |   155.96 |   0.01 |  0.0 |     401.81 |  4201.00 |   5.27 |  3.4 |   20167.87 |  6535.18 |
statistics                           (2) |    46.98 |   0.00 |  0.0 |       1.21 |   479.00 |   1.96 |  4.2 |    5024.35 |  6535.18 |
texture                              (2) |    60.13 |   0.00 |  0.0 |       0.09 |   419.00 |   2.46 |  4.1 |    7300.62 |  6535.18 |
threading                            (2) |     4.96 |   0.00 |  0.1 |      10.94 |   483.00 |   0.19 |  3.8 |     280.41 |  6535.18 |
utils                                (2) |     0.81 |   0.00 |  0.0 |       0.00 |   479.00 |   0.05 |  5.7 |      60.88 |  6535.18 |
cudadrv/context                      (2) |     1.11 |   0.00 |  0.0 |       0.00 |   409.00 |   0.00 |  0.0 |       5.80 |  6535.18 |
cudadrv/devices                      (2) |     0.25 |   0.00 |  0.0 |       0.00 |   409.00 |   0.00 |  0.0 |      27.17 |  6535.18 |
cudadrv/errors                       (2) |     0.11 |   0.00 |  0.0 |       0.00 |   409.00 |   0.00 |  0.0 |      10.21 |  6535.18 |
cudadrv/events                       (2) |     0.08 |   0.00 |  0.0 |       0.00 |   409.00 |   0.00 |  0.0 |       4.13 |  6535.18 |
cudadrv/execution                    (2) |     0.57 |   0.00 |  0.1 |       0.00 |   411.00 |   0.00 |  0.0 |      48.27 |  6535.18 |
cudadrv/memory                       (2) |     2.89 |   0.00 |  0.0 |       0.02 |   413.00 |   0.11 |  3.8 |     215.71 |  6535.18 |
cudadrv/module                       (2) |     0.34 |   0.00 |  0.0 |       0.00 |   411.00 |   0.00 |  0.0 |      27.03 |  6535.18 |
cudadrv/occupancy                    (2) |     0.11 |   0.00 |  0.0 |       0.00 |   409.00 |   0.00 |  0.0 |       6.48 |  6535.18 |
cudadrv/pool                         (2) |         failed at 2021-05-02T20:07:35.854
cudadrv/profile                      (3) |     0.27 |   0.00 |  0.0 |       0.00 |   305.00 |   0.01 |  4.3 |      56.65 |   910.82 |
cudadrv/stream                       (3) |     0.09 |   0.00 |  0.0 |       0.00 |   305.00 |   0.00 |  0.0 |       6.31 |   910.82 |
cudadrv/version                      (3) |     0.02 |   0.00 |  0.0 |       0.00 |   305.00 |   0.00 |  0.0 |       0.08 |   910.82 |
cudnn/activation                     (3) |         failed at 2021-05-02T20:08:00.603
cudnn/convolution                    (4) |     0.10 |   0.00 |  0.0 |       0.00 |   305.00 |   0.00 |  0.0 |       7.08 |   910.82 |
cudnn/dropout                        (4) |         failed at 2021-05-02T20:08:17.856
cudnn/inplace                        (5) |         failed at 2021-05-02T20:08:34.127
cudnn/multiheadattn                  (6) |         failed at 2021-05-02T20:09:05.614
cudnn/normalization                  (7) |         failed at 2021-05-02T20:09:50.015
cudnn/optensor                       (8) |         failed at 2021-05-02T20:10:11.980
cudnn/pooling                        (9) |         failed at 2021-05-02T20:10:38.808
cudnn/reduce                        (10) |         failed at 2021-05-02T20:11:04.523
cudnn/rnn                           (11) |         failed at 2021-05-02T20:11:31.604
cudnn/softmax                       (12) |         failed at 2021-05-02T20:11:51.517
cudnn/tensor                        (13) |     1.84 |   0.46 | 24.9 |       0.00 |   311.00 |   0.17 |  9.3 |     227.36 |   910.82 |
cusolver/dense                      (13) |   100.96 |   0.06 |  0.1 |    1147.73 |   541.00 |   5.01 |  5.0 |   14323.69 |  1338.73 |
cusolver/multigpu                   (13) |     7.09 |   0.00 |  0.1 |     403.17 |  1668.00 |   0.30 |  4.3 |     812.35 |  1399.38 |
cusolver/sparse                     (13) |     8.57 |   0.00 |  0.0 |       0.19 |   531.00 |   0.25 |  2.9 |     737.23 |  1638.07 |
cusparse/interfaces                 (13) |    15.05 |   0.00 |  0.0 |       0.17 |   449.00 |   0.45 |  3.0 |    1734.17 |  1739.48 |
cutensor/base                       (13) |     0.13 |   0.00 |  0.2 |       1.11 |   307.00 |   0.00 |  0.0 |      13.01 |  1739.48 |
cutensor/contractions               (13) |    46.21 |   0.02 |  0.0 |   32033.88 |   515.00 |   2.58 |  5.6 |    7569.16 |  1877.41 |
cutensor/elementwise_binary         (13) |    20.88 |   0.01 |  0.0 |      54.99 |   445.00 |   0.83 |  4.0 |    3101.33 |  1994.87 |
cutensor/elementwise_trinary        (13) |    24.05 |   0.00 |  0.0 |      24.44 |   405.00 |   1.11 |  4.6 |    3651.29 |  2290.24 |
cutensor/permutations               (13) |     3.48 |   0.00 |  0.1 |      12.22 |   385.00 |   0.14 |  4.0 |     502.30 |  2325.46 |
cutensor/reductions                 (13) |    14.89 |   0.00 |  0.0 |      41.72 |   375.00 |   0.55 |  3.7 |    1636.56 |  2325.85 |
device/array                        (13) |     4.47 |   0.00 |  0.0 |       0.00 |   347.00 |   0.20 |  4.6 |     591.53 |  2325.85 |
device/intrinsics                   (13) |   253.97 |   0.00 |  0.0 |       0.01 |  2127.00 |   9.81 |  3.9 |   29907.82 |  2485.50 |
device/ldg                          (13) |    10.60 |   0.00 |  0.0 |       0.00 |   347.00 |   0.49 |  4.6 |    1504.53 |  2509.30 |
device/random                       (13) |    45.53 |   0.00 |  0.0 |       0.14 |   351.00 |   1.51 |  3.3 |    4896.64 |  2721.20 |
device/wmma                         (13) |    66.30 |   0.00 |  0.0 |       0.38 |   349.00 |   2.24 |  3.4 |    7019.36 |  2797.96 |
gpuarrays/math                      (13) |     5.12 |   0.00 |  0.0 |       0.00 |   347.00 |   0.21 |  4.1 |     669.62 |  2826.11 |
gpuarrays/indexing scalar           (13) |    13.25 |   0.00 |  0.0 |       0.00 |   347.00 |   0.50 |  3.8 |    1482.82 |  2841.47 |
gpuarrays/input output              (13) |     1.80 |   0.00 |  0.0 |       0.00 |   307.00 |   0.11 |  6.1 |     187.77 |  2841.47 |
gpuarrays/value constructors        (13) |    10.37 |   0.00 |  0.0 |       0.00 |   347.00 |   0.33 |  3.2 |    1025.49 |  2841.47 |
gpuarrays/indexing multidimensional (13) |    38.73 |   0.00 |  0.0 |       0.69 |   347.00 |   1.98 |  5.1 |    4448.97 |  2883.22 |
gpuarrays/interface                 (13) |     7.19 |   0.00 |  0.0 |       0.00 |   347.00 |   0.19 |  2.7 |     814.65 |  2938.44 |
gpuarrays/iterator constructors     (13) |     2.53 |   0.00 |  0.0 |       0.02 |   347.00 |   0.16 |  6.1 |     232.22 |  2948.52 |
gpuarrays/uniformscaling            (13) |    17.91 |   0.00 |  0.0 |       0.01 |   347.00 |   0.54 |  3.0 |    1413.19 |  2964.54 |
gpuarrays/linear algebra            (13) |   141.42 |   0.01 |  0.0 |       1.24 |   417.00 |   6.04 |  4.3 |   19084.92 |  3396.66 |
gpuarrays/conversions               (13) |     2.86 |   0.00 |  0.0 |       0.01 |   307.00 |   0.13 |  4.6 |     363.67 |  3419.54 |
gpuarrays/constructors              (13) |     1.40 |   0.00 |  0.1 |       0.03 |   307.00 |   0.00 |  0.0 |      89.28 |  3420.58 |
gpuarrays/random                    (13) |    24.63 |   0.00 |  0.0 |       0.03 |   349.00 |   0.99 |  4.0 |    2743.27 |  3510.09 |
gpuarrays/base                      (13) |    27.75 |   0.00 |  0.0 |      17.44 |   379.00 |   1.43 |  5.2 |    3955.00 |  3590.78 |
gpuarrays/mapreduce essentials      (13) |   170.25 |   0.01 |  0.0 |       3.19 |   351.00 |   6.46 |  3.8 |   19538.75 |  3902.67 |
gpuarrays/broadcasting              (13) |   130.10 |   0.00 |  0.0 |       1.19 |   349.00 |   5.63 |  4.3 |   14696.51 |  3947.41 |
gpuarrays/mapreduce derivatives     (13) |   330.18 |   0.01 |  0.0 |       3.06 |   353.00 |  10.40 |  3.1 |   29137.73 |  5158.99 |
Worker 2 failed running test cudadrv/pool:
Some tests did not pass: 0 passed, 0 failed, 1 errored, 0 broken.
cudadrv/pool: Error During Test at /home/arndt/.julia/packages/CUDA/k52QH/test/setup.jl:56
  Got exception outside of a @test
  LoadError: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
    [1] throw_api_error(res::CUDA.cudaError_enum)
      @ CUDA ~/.julia/packages/CUDA/k52QH/lib/cudadrv/error.jl:88
    [2] macro expansion
      @ ~/.julia/packages/CUDA/k52QH/lib/cudadrv/error.jl:96 [inlined]
    [3] cuDeviceGetAttribute
      @ ~/.julia/packages/CUDA/k52QH/lib/utils/call.jl:26 [inlined]
    [4] attribute(dev::CuDevice, code::CUDA.CUdevice_attribute_enum)
      @ CUDA ~/.julia/packages/CUDA/k52QH/lib/cudadrv/devices.jl:83
    [5] top-level scope
      @ ~/.julia/packages/CUDA/k52QH/test/cudadrv/pool.jl:2
    [6] include
      @ ./client.jl:444 [inlined]
    [7] #9
      @ ~/.julia/packages/CUDA/k52QH/test/runtests.jl:79 [inlined]
    [8] macro expansion
      @ ~/.julia/packages/CUDA/k52QH/test/setup.jl:57 [inlined]
    [9] macro expansion
      @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined]
   [10] macro expansion
      @ ~/.julia/packages/CUDA/k52QH/test/setup.jl:57 [inlined]
   [11] macro expansion
      @ ~/.julia/packages/CUDA/k52QH/src/utilities.jl:28 [inlined]
   [12] macro expansion
      @ ~/.julia/packages/CUDA/k52QH/src/pool.jl:572 [inlined]
   [13] top-level scope
      @ ~/.julia/packages/CUDA/k52QH/test/setup.jl:56
   [14] eval
      @ ./boot.jl:360 [inlined]
   [15] runtests(f::Function, name::String, time_source::Symbol, snoop::Nothing)
      @ Main ~/.julia/packages/CUDA/k52QH/test/setup.jl:68
   [16] (::Distributed.var"#106#108"{Distributed.CallMsg{:call_fetch}})()
      @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:278
   [17] run_work_thunk(thunk::Distributed.var"#106#108"{Distributed.CallMsg{:call_fetch}}, print_error::Bool)
      @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:63
   [18] macro expansion
      @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Distributed/src/process_messages.jl:278 [inlined]
   [19] (::Distributed.var"#105#107"{Distributed.CallMsg{:call_fetch}, Distributed.MsgHeader, Sockets.TCPSocket})()
      @ Distributed ./task.jl:411
  in expression starting at /home/arndt/.julia/packages/CUDA/k52QH/test/cudadrv/pool.jl:2
Worker 3 failed running test cudnn/activation:
Some tests did not pass: 3 passed, 0 failed, 40 errored, 0 broken.
cudnn/activation: Error During Test at /home/arndt/.julia/packages/CUDA/k52QH/test/cudnn/activation.jl:47
  Test threw exception
  Expression: y0 ≈ cudnnActivationForward(cx; mode, nanOpt, coef, alpha) |> Array
    [1] throw_api_error(res::CUDA.CUDNN.cudnnStatus_t)
      @ CUDA.CUDNN ~/.julia/packages/CUDA/k52QH/lib/cudnn/error.jl:22
    [2] macro expansion
      @ ~/.julia/packages/CUDA/k52QH/lib/cudnn/error.jl:39 [inlined]
    [3] cudnnCreate()
      @ CUDA.CUDNN ~/.julia/packages/CUDA/k52QH/lib/cudnn/base.jl:3
    [4] #686
      @ ~/.julia/packages/CUDA/k52QH/lib/cudnn/CUDNN.jl:68 [inlined]
    [5] (::CUDA.APIUtils.var"#2#5"{CUDA.CUDNN.var"#686#690", CUDA.APIUtils.HandleCache{CuContext, Ptr{Nothing}}, CuContext})()
      @ CUDA.APIUtils ~/.julia/packages/CUDA/k52QH/lib/utils/cache.jl:22
    [6] lock(f::CUDA.APIUtils.var"#2#5"{CUDA.CUDNN.var"#686#690", CUDA.APIUtils.HandleCache{CuContext, Ptr{Nothing}}, CuContext}, l::ReentrantLock)
      @ Base ./lock.jl:187
    [7] check_cache
      @ ~/.julia/packages/CUDA/k52QH/lib/utils/cache.jl:20 [inlined]
    [8] pop!
      @ ~/.julia/packages/CUDA/k52QH/lib/utils/cache.jl:41 [inlined]
    [9] (::CUDA.CUDNN.var"#685#689"{NamedTuple{(:device, :context, :stream, :math_mode, :math_precision), Tuple{CuDevice, CuContext, CuStream, CUDA.MathMode, Symbol}}})()
      @ CUDA.CUDNN ~/.julia/packages/CUDA/k52QH/lib/cudnn/CUDNN.jl:67
   [10] get!
      @ ./iddict.jl:163 [inlined]
   [11] handle()
      @ CUDA.CUDNN ~/.julia/packages/CUDA/k52QH/lib/cudnn/CUDNN.jl:66
   [12] #cudnnActivationForwardAD#645
      @ ~/.julia/packages/CUDA/k52QH/lib/cudnn/activation.jl:48 [inlined]
   [13] #cudnnActivationForwardWithDefaults#644
      @ ~/.julia/packages/CUDA/k52QH/lib/cudnn/activation.jl:42 [inlined]
   [14] #cudnnActivationForward#640
      @ ~/.julia/packages/CUDA/k52QH/lib/cudnn/activation.jl:21 [inlined]
   [15] (::var"#activationtest#10"{var"#activationtest#8#11"{CuArray{Float64, 1}, CuArray{Float64, 1}, Vector{Float64}, Vector{Float64}}})(; mode::cudnnActivationMode_t, nanOpt::cudnnNanPropagation_t, coef::Int64, alpha::Int64, beta::Int64)
      @ Main ~/.julia/packages/CUDA/k52QH/test/cudnn/activation.jl:47
   [16] macro expansion
      @ ~/.julia/packages/CUDA/k52QH/test/cudnn/activation.jl:53 [inlined]
   [17] macro expansion
      @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined]
   [18] top-level scope
      @ ~/.julia/packages/CUDA/k52QH/test/cudnn/activation.jl:24

.... (more errors of the same kind) ....

cudnnCreate throwing CUDNN_STATUS_NOT_INITIALIZED is almost a bug, and we can’t do much about it. Are you running close to out-of-memory, perhaps?

I do not think that memory is a problem here. I have 250G on the main board and 15G on the GPU (all free and available).

I jus tested again - same setup - but today the tests first downloaded artifacts:

     Testing Running tests...
  Downloaded artifact: CUDA112
  Downloaded artifact: CUDNN_CUDA112
  Downloaded artifact: CUTENSOR_CUDA112

which solved the problem.

Were you using a local CUDA installation (JULIA_CUDA_USE_BINARYBUILDER=false)? If so, it could have been that your CUDNN library was incompatible with your CUDA toolkit, which could explain the error.

I did not set:

I just tried it - but then the testing errs completely in the first set of tests.

I guess it is the new CUDA toolkit installation: above there it was

┌ Info: System information:
│ CUDA toolkit 10.2.89, artifact installation
│ CUDA driver 10.2.0
│ NVIDIA driver 440.44.0
│ Libraries: 
│ - CUBLAS: 10.2.2
│ - CURAND: 10.1.2
│ - CUFFT: 10.1.2
│ - CUSOLVER: 10.3.0
│ - CUSPARSE: 10.3.1
│ - CUPTI: 12.0.0
│ - NVML: 10.0.0+440.44
│ - CUDNN: 8.10.0 (for CUDA 10.2.0)
│ - CUTENSOR: 1.2.1 (for CUDA 10.2.0)

and now it is:

┌ Info: System information:
│ CUDA toolkit 11.2.2, artifact installation
│ CUDA driver 11.2.0
│ NVIDIA driver 460.39.0
│ Libraries: 
│ - CUBLAS: 11.4.1
│ - CURAND: 10.2.3
│ - CUFFT: 10.4.1
│ - CUSOLVER: 11.1.0
│ - CUSPARSE: 11.4.1
│ - CUPTI: 14.0.0
│ - NVML: 11.0.0+460.39
│ - CUDNN: 8.10.0 (for CUDA 11.2.0)
│ - CUTENSOR: 1.2.2 (for CUDA 11.1.0)

This obviously helped.I guess the new version came after the artifacts were downloaded in the meantime. How was this event triggered?