Fresh CUDA and LuxCUDA error ERROR: could not load symbol "cublasLtMatmulDescCreate":

From Lux polynomial fitting tutorial I get during the training step

ERROR: could not load symbol "cublasLtMatmulDescCreate":
The specified procedure could not be found.

I did the Cuda.test() and got errors

core/nvml                                     (3) |         failed 
versioninfo()
Julia Version 1.10.3
Commit 0b4590a550 (2024-04-30 10:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 28 × 13th Gen Intel(R) Core(TM) i7-13850HX
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, goldmont)
Threads: 10 default, 0 interactive, 5 GC (on 28 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 10
CUDA.versioninfo()
CUDA runtime 12.4, artifact installation
CUDA driver 12.2
NVIDIA driver 538.27.0

CUDA libraries:
- CUBLAS: 12.4.5
- CURAND: 10.3.5
- CUFFT: 11.2.1
- CUSOLVER: 11.6.1
- CUSPARSE: 12.3.1
- CUPTI: 22.0.0
- NVML: 12.0.0+538.27

Julia packages:
- CUDA: 5.3.4
- CUDA_Driver_jll: 0.8.1+0
- CUDA_Runtime_jll: 0.12.1+0

Toolchain:
- Julia: 1.10.3
- LLVM: 15.0.7

1 device:
  0: NVIDIA RTX 3500 Ada Generation Laptop GPU (sm_89, 10.931 GiB / 11.994 GiB available)

Should I specify some path to cublasLtMatmulDescCreate if yes where should it be?

@maleadt CuBLASLt should be present in all the new CUDA versions, right?

Yeah, but we seem to be ccalling libcublas instead of libcublasLt. Can you open an issue?

Thanks for looking into it.

Note that this seems to be a Windows only issue since it works on my WSL (however GPU is much slower than CPU while training, even large models where the GPU cost should be negligible).

can you set CUDA.allowscalar(false) and check if any of the dispatches are hitting generic kernels?

I tested in both Windows and WSL (not sure which you were referring to)
It does produce the same error on Windows and does not change the timing in WSL.

My bad for WSL seems indeed faster with GPU when adding layers.

Now while precompiling (on Windows) I got this warning

│  ┌ Warning: cuBLASLt is not functional on this system. We won't be able to use optimized implementations of certain matmul operations.

Yes, we try to detect if the current cuBLASLt setting works for the given system, and if it doesn’t, we just use cuBLAS.

I got the same error using the LuxCUDA.jl package on Windows for a tutorial on NeuralODEs. It occurred just when I try to solve the problem on the GPU.

What’s the reason for that? CUBLASLt should work fine on Windows.

It is this error:

ERROR: could not load symbol "cublasLtMatmulDescCreate":
The specified procedure could not be found.

It is probably stemming from the ccall into libcublas instead of libcublaslt (the problem you mentioned). I added a check in __init__() to verify if a simple matmul works via cuBLASLt and if not, don’t try to use that at runtime.

Are you getting the error on the latest releases? Or are you simply getting a warning, in which case you should ignore it?

I think I was getting that on the latest releases. I remember that it was not just a warning, the operations on the GPU did not work.

This should be fixed on the latest version of CUDA.jl.