Note that this seems to be a Windows only issue since it works on my WSL (however GPU is much slower than CPU while training, even large models where the GPU cost should be negligible).
I tested in both Windows and WSL (not sure which you were referring to)
It does produce the same error on Windows and does not change the timing in WSL.
I got the same error using the LuxCUDA.jl package on Windows for a tutorial on NeuralODEs. It occurred just when I try to solve the problem on the GPU.
ERROR: could not load symbol "cublasLtMatmulDescCreate":
The specified procedure could not be found.
It is probably stemming from the ccall into libcublas instead of libcublaslt (the problem you mentioned). I added a check in __init__() to verify if a simple matmul works via cuBLASLt and if not, don’t try to use that at runtime.