CUDNNError: CUDNN_STATUS_NOT_SUPPORTED (code 9) with Transformers.jl

Tomas_Pevny · September 1, 2023, 6:37pm

Hi,

I am optimizing prompt for llama2 model with Transformers.jl and I occasionally see this error.

CUDNNError: CUDNN_STATUS_NOT_SUPPORTED (code 9)
Stacktrace:
  [1] throw_api_error
    @ ~/.julia/packages/cuDNN/YkZhm/src/libcudnn.jl:11
  [2] check
    @ ~/.julia/packages/cuDNN/YkZhm/src/libcudnn.jl:21 [inlined]
  [3] cudnnSetTensorNdDescriptorEx
    @ ~/.julia/packages/CUDA/tVtYo/lib/utils/call.jl:26
  [4] cudnnTensorDescriptor
    @ ~/.julia/packages/cuDNN/YkZhm/src/descriptors.jl:40
  [5] #cudnnTensorDescriptor#607
    @ ~/.julia/packages/cuDNN/YkZhm/src/tensor.jl:9 [inlined]
  [6] #cudnnSoftmaxForward!#688
    @ ~/.julia/packages/cuDNN/YkZhm/src/softmax.jl:17 [inlined]
  [7] cudnnSoftmaxForward!
    @ ~/.julia/packages/cuDNN/YkZhm/src/softmax.jl:17 [inlined]
  [8] #softmax!#50
    @ ~/.julia/packages/NNlibCUDA/C6t0p/src/cudnn/softmax.jl:73
  [9] softmax!
    @ ~/.julia/packages/NNlibCUDA/C6t0p/src/cudnn/softmax.jl:70 [inlined]
 [10] softmax!
    @ ~/.julia/packages/NNlibCUDA/C6t0p/src/cudnn/softmax.jl:70
 [11] #_collapseddims#15
    @ ~/.julia/packages/NeuralAttentionlib/3zeYG/src/matmul/collapseddims.jl:141
 [12] _collapseddims
    @ ~/.julia/packages/NeuralAttentionlib/3zeYG/src/matmul/collapseddims.jl:138 [inline
...

The stacktrace is not complete not to clutter, but I think it covers the important part. But I do not know, what to think about it. Could be due to being close to the memory limit of the GPU?

maleadt · September 1, 2023, 7:44pm

Unlikely, that would manifest as a different error. It seems like NNlib is invoking CUDNN using invalid params here. Maybe try running with JULIA_DEBUG=cuDNN, and inspecting the arguments/inputs to the API call that fail. If you cross-reference to the NVIDIA docs of cudnnSetTensorNdDescriptorEx, you might learn what is being set incorrectly here.

Tomas_Pevny · September 1, 2023, 8:00pm

Thanks tim, i will try to hunt this down. This is good advice.

Per · March 11, 2024, 1:29pm

Did you find the cause of this? I’m asking because I’m getting the exact same error. Unlike the above case, my code does not involve Transformers.jl, but just as above the error only occurs when operating close to the memory limit of the GPU.

Tomas_Pevny · March 17, 2024, 7:40pm

Hi Per,

I think it was on the end some basic problem, but I do not remember which one. Do you use more then one GPU? One of the things I have been playing with was to spread the model across multiple GPUs, which might be the case. The second thing I have realized was that to take gradient with respect to llama2, I had to use GPU with 80gb of memory.

Tomas

Per · March 19, 2024, 6:27am

Hi Tomas, No, I only use one GPU, with 24 GB of RAM. (I’ve not yet tried to make a MWE.)

Topic		Replies	Views
CUDNN produces a lot of errors and warnings GPU	2	484	February 8, 2023
CuArrays warning: You are using CUDNN 7.1.4 for CUDA 9.0.0 with CUDA toolkit 9.2.148; these might be incompatible GPU cuarrays	4	1075	March 13, 2020
CUDA.jl - MethodError: no method matching CUDA.CuArray GPU hpc , cuda , slurm , cudajl	0	296	December 16, 2023
Yolo.jl [CUDA] related issue GPU question	8	1084	April 24, 2022
CUDNNError: CUDNN_STATUS_NOT_INITIALIZED (code 1) during Pkg.test("CUDA") GPU package , gpu	5	1532	May 6, 2021

CUDNNError: CUDNN_STATUS_NOT_SUPPORTED (code 9) with Transformers.jl

Related topics