Help with CUDA Error

Hello!

I’m still relatively new to Julia so I apologize if this is a stupid question. I am currently trying to implement a GPU backend for training universal differential equations. I just updated to Julia 1.10.3, and have started receiving the error below. From what I can gather there is an issue with cuBLAS, and have tried uninstalling and reinstalling all of my NVIDIA drivers and when that didn’t work I did a fresh install of my entire Julia environment as well as the NVIDIA drivers. I can also include a MWE if needed, but I am entirely unsure if this is due to my code or an issue with my installation

ERROR: LoadError: could not load symbol "cublasLtMatmulDescCreate":
The specified procedure could not be found. 
Stacktrace:
  [1] macro expansion
    @ C:\Users\rokko\.julia\packages\CUDA\jdJ7Z\lib\utils\call.jl:217 [inlined]
  [2] macro expansion
    @ C:\Users\rokko\.julia\packages\CUDA\jdJ7Z\lib\cublas\libcublas.jl:6314 [inlined]
  [3] (::CUDA.CUBLAS.var"#1132#1133"{Base.RefValue{Ptr{CUDA.CUBLAS.cublasLtMatmulDescOpaque_t}}, CUDA.CUBLAS.cublasComputeType_t, CUDA.cudaDataType})()
    @ CUDA.CUBLAS C:\Users\rokko\.julia\packages\CUDA\jdJ7Z\lib\utils\call.jl:31
  [4] retry_reclaim
    @ C:\Users\rokko\.julia\packages\CUDA\jdJ7Z\src\pool.jl:383 [inlined]
  [5] check
    @ C:\Users\rokko\.julia\packages\CUDA\jdJ7Z\lib\cublas\libcublas.jl:24 [inlined]
  [6] cublasLtMatmulDescCreate
    @ C:\Users\rokko\.julia\packages\CUDA\jdJ7Z\lib\utils\call.jl:30 [inlined]
  [7] _cublaslt_matmul_fused!(transy::Bool, y::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, σ::typeof(tanh_fast), transw::Bool, w::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, transx::Bool, x::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, b::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, aux::Nothing)
    @ LuxLibCUDAExt C:\Users\rokko\.julia\packages\LuxLib\tGkrf\ext\LuxLibCUDAExt\cublaslt.jl:66
  [8] _cublaslt_matmul_fused!
    @ C:\Users\rokko\.julia\packages\LuxLib\tGkrf\ext\LuxLibCUDAExt\cublaslt.jl:13 [inlined]
  [9] _cublaslt_matmul_fused!
    @ C:\Users\rokko\.julia\packages\LuxLib\tGkrf\ext\LuxLibCUDAExt\cublaslt.jl:10 [inlined]
 [10] __fused_dense_bias_activation_impl(act::typeof(tanh_fast), weight::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, b::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
    @ LuxLibCUDAExt C:\Users\rokko\.julia\packages\LuxLib\tGkrf\ext\LuxLibCUDAExt\fused_dense.jl:11
 [11] fused_dense_bias_activation
    @ C:\Users\rokko\.julia\packages\LuxLib\tGkrf\src\api\dense.jl:46 [inlined]
 [12] fused_dense_bias_activation
    @ C:\Users\rokko\.julia\packages\LuxLib\tGkrf\src\api\dense.jl:38 [inlined]
 [13] Dense
    @ C:\Users\rokko\.julia\packages\Lux\HPvHB\src\layers\basic.jl:218 [inlined]
 [14] Dense
    @ C:\Users\rokko\.julia\packages\Lux\HPvHB\src\layers\basic.jl:214 [inlined]
 [15] apply(model::Dense{true, typeof(tanh_fast), typeof(glorot_uniform), typeof(zeros32)}, x::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ps::ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))}}},
 st::@NamedTuple{})
    @ LuxCore C:\Users\rokko\.julia\packages\LuxCore\qiHPC\src\LuxCore.jl:179
 [16] macro expansion
    @ C:\Users\rokko\.julia\packages\Lux\HPvHB\src\layers\containers.jl:0 [inlined]
 [17] applychain(layers::@NamedTuple{layer_1::Dense{true, typeof(tanh_fast), typeof(glorot_uniform), typeof(zeros32)}, layer_2::Dense{true, typeof(relu), typeof(glorot_uniform), typeof(zeros32)}}, x::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ps::ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))}}}, st::@NamedTuple{layer_1::@NamedTuple{}, layer_2::@NamedTuple{}})
    @ Lux C:\Users\rokko\.julia\packages\Lux\HPvHB\src\layers\containers.jl:478
 [18] (::Chain{@NamedTuple{layer_1::Dense{true, typeof(tanh_fast), typeof(glorot_uniform), typeof(zeros32)}, layer_2::Dense{true, typeof(relu), typeof(glorot_uniform), typeof(zeros32)}}})(x::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ps::ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))}}}, st::@NamedTuple{layer_1::@NamedTuple{}, layer_2::@NamedTuple{}})  
    @ Lux C:\Users\rokko\.julia\packages\Lux\HPvHB\src\layers\containers.jl:476
 [19] extended_ude!(du::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, u::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, theta::ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, t::Float64)
    @ Main C:\Users\rokko\Desktop\Lab\UniversalDiffEq Testing\GPU testing.jl:43
 [20] ODEFunction
    @ C:\Users\rokko\.julia\packages\SciMLBase\JUp1I\src\scimlfunctions.jl:2296 [inlined]
 [21] initialize!(integrator::OrdinaryDiffEq.ODEIntegrator{Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, true, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Nothing, Float64, ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, Float64, Float32, Float32, Float64, Vector{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ODESolution{Float32, 2, Vector{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Nothing, Nothing, Vector{Float64}, Vector{Vector{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, ODEProblem{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Float64, Float64}, true, ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, ODEFunction{true, SciMLBase.AutoSpecialize, typeof(extended_ude!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, @Kwargs{}, SciMLBase.StandardODEProblem}, Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, OrdinaryDiffEq.InterpolationData{ODEFunction{true, SciMLBase.AutoSpecialize, typeof(extended_ude!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, Vector{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Vector{Float64}, Vector{Vector{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Nothing, OrdinaryDiffEq.Tsit5Cache{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, Nothing}, SciMLBase.DEStats, Nothing, Nothing, Nothing}, ODEFunction{true, SciMLBase.AutoSpecialize, typeof(extended_ude!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, OrdinaryDiffEq.Tsit5Cache{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, OrdinaryDiffEq.DEOptions{Float64, Float64, Float32, Float64, PIController{Rational{Int64}}, typeof(DiffEqBase.ODE_DEFAULT_NORM), typeof(LinearAlgebra.opnorm), Nothing, CallbackSet{Tuple{}, Tuple{}}, typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK), DataStructures.BinaryHeap{Float64, DataStructures.FasterForward}, DataStructures.BinaryHeap{Float64, DataStructures.FasterForward}, Nothing, Nothing, Int64, Tuple{}, Tuple{Int64, Int64}, Tuple{}}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, Nothing, OrdinaryDiffEq.DefaultInit, Nothing}, cache::OrdinaryDiffEq.Tsit5Cache{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False})       
    @ OrdinaryDiffEq C:\Users\rokko\.julia\packages\OrdinaryDiffEq\tAI61\src\perform_step\low_order_rk_perform_step.jl:799
 [22] __init(prob::ODEProblem{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Float64, Float64}, true, ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, ODEFunction{true, SciMLBase.AutoSpecialize, typeof(extended_ude!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, @Kwargs{}, SciMLBase.StandardODEProblem}, alg::Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, timeseries_init::Tuple{}, ts_init::Tuple{}, ks_init::Tuple{}, recompile::Type{Val{true}}; saveat::Tuple{Int64, Int64}, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_on::Bool, save_start::Bool, save_end::Nothing, callback::Nothing, dense::Bool, calck::Bool, dt::Float64, dtmin::Float64, dtmax::Float64, force_dtmin::Bool, adaptive::Bool, gamma::Rational{Int64}, abstol::Float64, reltol::Float64, qmin::Rational{Int64}, qmax::Int64, qsteady_min::Int64, qsteady_max::Int64, beta1::Nothing, beta2::Nothing, qoldinit::Rational{Int64}, controller::Nothing, fullnormalize::Bool, failfactor::Int64, maxiters::Int64, internalnorm::typeof(DiffEqBase.ODE_DEFAULT_NORM), internalopnorm::typeof(LinearAlgebra.opnorm), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK), verbose::Bool, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), progress_id::Symbol, userdata::Nothing, allow_extrapolation::Bool, initialize_integrator::Bool, alias_u0::Bool, alias_du0::Bool, initializealg::OrdinaryDiffEq.DefaultInit, kwargs::@Kwargs{tspan::Tuple{Int64, Int64}})
    @ OrdinaryDiffEq C:\Users\rokko\.julia\packages\OrdinaryDiffEq\tAI61\src\solve.jl:518
 [23] __init (repeats 5 times)
    @ C:\Users\rokko\.julia\packages\OrdinaryDiffEq\tAI61\src\solve.jl:11 [inlined]
 [24] #__solve#805
    @ C:\Users\rokko\.julia\packages\OrdinaryDiffEq\tAI61\src\solve.jl:6 [inlined]
 [25] __solve
    @ C:\Users\rokko\.julia\packages\OrdinaryDiffEq\tAI61\src\solve.jl:1 [inlined]
 [26] solve_call(_prob::ODEProblem{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Float64, Float64}, true, ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, ODEFunction{true, SciMLBase.AutoSpecialize, typeof(extended_ude!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, @Kwargs{}, SciMLBase.StandardODEProblem}, args::Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}; merge_callbacks::Bool, kwargshandle::Nothing, kwargs::@Kwargs{saveat::Tuple{Int64, Int64}, tspan::Tuple{Int64, Int64}, abstol::Float64, reltol::Float64})
    @ DiffEqBase C:\Users\rokko\.julia\packages\DiffEqBase\yM6LF\src\solve.jl:612
 [27] solve_up(prob::ODEProblem{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Float64, Float64}, true, ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, ODEFunction{true, SciMLBase.AutoSpecialize, typeof(extended_ude!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, @Kwargs{}, SciMLBase.StandardODEProblem}, sensealg::Nothing, u0::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, p::ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, args::Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}; kwargs::@Kwargs{saveat::Tuple{Int64, Int64}, tspan::Tuple{Int64, Int64}, abstol::Float64, reltol::Float64})
    @ DiffEqBase C:\Users\rokko\.julia\packages\DiffEqBase\yM6LF\src\solve.jl:1080
 [28] solve_up
    @ C:\Users\rokko\.julia\packages\DiffEqBase\yM6LF\src\solve.jl:1066 [inlined]
 [29] solve(prob::ODEProblem{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Float64, Float64}, true, ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, ODEFunction{true, SciMLBase.AutoSpecialize, typeof(extended_ude!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, @Kwargs{}, SciMLBase.StandardODEProblem}, args::Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}; sensealg::Nothing, u0::Nothing, p::Nothing, wrap::Val{true}, kwargs::@Kwargs{saveat::Tuple{Int64, Int64}, tspan::Tuple{Int64, Int64}, abstol::Float64, reltol::Float64})
    @ DiffEqBase C:\Users\rokko\.julia\packages\DiffEqBase\yM6LF\src\solve.jl:1003
 [30] _concrete_solve_adjoint(::ODEProblem{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Float64, Float64}, true, ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, ODEFunction{true, SciMLBase.AutoSpecialize, typeof(extended_ude!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, @Kwargs{}, SciMLBase.StandardODEProblem}, ::Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, ::ForwardDiffSensitivity{0, nothing}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, ::SciMLBase.ChainRulesOriginator; saveat::Tuple{Int64, Int64}, kwargs::@Kwargs{tspan::Tuple{Int64, Int64}, abstol::Float64, reltol::Float64})
    @ SciMLSensitivity C:\Users\rokko\.julia\packages\SciMLSensitivity\rXkM4\src\concrete_solve.jl:711
 [31] _solve_adjoint(prob::ODEProblem{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Float64, Float64}, true, ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, ODEFunction{true, SciMLBase.AutoSpecialize, typeof(extended_ude!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, @Kwargs{}, SciMLBase.StandardODEProblem}, sensealg::ForwardDiffSensitivity{0, nothing}, u0::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, p::ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, originator::SciMLBase.ChainRulesOriginator, args::Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}; merge_callbacks::Bool, kwargs::@Kwargs{tspan::Tuple{Int64, Int64}, saveat::Tuple{Int64, Int64}, abstol::Float64, reltol::Float64})
    @ DiffEqBase C:\Users\rokko\.julia\packages\DiffEqBase\yM6LF\src\solve.jl:1537
 [32] _solve_adjoint
    @ C:\Users\rokko\.julia\packages\DiffEqBase\yM6LF\src\solve.jl:1510 [inlined]
 [33] #rrule#6
    @ C:\Users\rokko\.julia\packages\DiffEqBase\yM6LF\ext\DiffEqBaseChainRulesCoreExt.jl:26 [inlined]
 [34] rrule
    @ C:\Users\rokko\.julia\packages\DiffEqBase\yM6LF\ext\DiffEqBaseChainRulesCoreExt.jl:22 [inlined]
 [35] rrule
    @ C:\Users\rokko\.julia\packages\ChainRulesCore\zgT0R\src\rules.jl:140 [inlined]
 [36] chain_rrule_kw
    @ C:\Users\rokko\.julia\packages\Zygote\nsBv0\src\compiler\chainrules.jl:235 [inlined]
 [37] macro expansion
    @ C:\Users\rokko\.julia\packages\Zygote\nsBv0\src\compiler\interface2.jl:0 [inlined]
 [38] _pullback
    @ C:\Users\rokko\.julia\packages\Zygote\nsBv0\src\compiler\interface2.jl:87 [inlined]
 [39] _apply
    @ .\boot.jl:838 [inlined]
 [40] adjoint
    @ C:\Users\rokko\.julia\packages\Zygote\nsBv0\src\lib\lib.jl:203 [inlined]
 [41] _pullback
    @ C:\Users\rokko\.julia\packages\ZygoteRules\M4xmc\src\adjoint.jl:67 [inlined]
 [42] #solve#51
    @ C:\Users\rokko\.julia\packages\DiffEqBase\yM6LF\src\solve.jl:1003 [inlined]
 [43] _pullback(::Zygote.Context{false}, ::DiffEqBase.var"##solve#51", ::ForwardDiffSensitivity{0, nothing}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, ::Val{true}, ::@Kwargs{tspan::Tuple{Int64, Int64}, saveat::Tuple{Int64, Int64}, abstol::Float64, reltol::Float64}, ::typeof(solve), ::ODEProblem{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Float64, Float64}, true, ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, ODEFunction{true, SciMLBase.AutoSpecialize, typeof(extended_ude!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, @Kwargs{}, SciMLBase.StandardODEProblem}, ::Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False})  
    @ Zygote C:\Users\rokko\.julia\packages\Zygote\nsBv0\src\compiler\interface2.jl:0
 [44] _apply(::Function, ::Vararg{Any})
    @ Core .\boot.jl:838
 [45] adjoint
    @ C:\Users\rokko\.julia\packages\Zygote\nsBv0\src\lib\lib.jl:203 [inlined]
 [46] _pullback
    @ C:\Users\rokko\.julia\packages\ZygoteRules\M4xmc\src\adjoint.jl:67 [inlined]
 [47] solve
    @ C:\Users\rokko\.julia\packages\DiffEqBase\yM6LF\src\solve.jl:993 [inlined]
 [48] _pullback(::Zygote.Context{false}, ::typeof(Core.kwcall), ::@NamedTuple{u0::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, p::ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, tspan::Tuple{Int64, Int64}, saveat::Tuple{Int64, Int64}, abstol::Float64, reltol::Float64, sensealg::ForwardDiffSensitivity{0, nothing}}, ::typeof(solve), ::ODEProblem{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Float64, Float64}, true, ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, ODEFunction{true, SciMLBase.AutoSpecialize, typeof(extended_ude!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, @Kwargs{}, SciMLBase.StandardODEProblem}, ::Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False})       
    @ Zygote C:\Users\rokko\.julia\packages\Zygote\nsBv0\src\compiler\interface2.jl:0
 [49] predict
    @ C:\Users\rokko\Desktop\Lab\UniversalDiffEq.jl\src\ProcessModels.jl:57 [inlined]
 [50] _pullback(::Zygote.Context{false}, ::Main.UniversalDiffEq.var"#predict#52"{ODEProblem{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Float64, Float64}, true, ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, ODEFunction{true, SciMLBase.AutoSpecialize, typeof(extended_ude!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing, Nothing, Nothing}, @Kwargs{}, SciMLBase.StandardODEProblem}}, ::CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::Int64, ::Int64, ::ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)}}}, ::LuxCUDADevice{Nothing})
    @ Zygote C:\Users\rokko\.julia\packages\Zygote\nsBv0\src\compiler\interface2.jl:0
 [51] loss_function
    @ C:\Users\rokko\Desktop\Lab\UniversalDiffEq.jl\src\helpers.jl:98 [inlined]
 [52] _pullback(ctx::Zygote.Context{false}, f::Main.UniversalDiffEq.var"#loss_function#28"{LinearAlgebra.Transpose{Float64, Matrix{Float64}}, Vector{Int64}, Main.UniversalDiffEq.LinkFunction, Main.UniversalDiffEq.LossFunction, Main.UniversalDiffEq.ProcessModel, Main.UniversalDiffEq.LossFunction, Main.UniversalDiffEq.Regularization, Main.UniversalDiffEq.Regularization, LuxCUDADevice{Nothing}}, args::ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(uhat = ViewAxis(1:51, ShapedAxis((3, 17))), process_model = ViewAxis(52:122, Axis(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)), process_loss = 123:122, observation_model = 123:122, observation_loss = 123:122, process_regularization = 123:122, observation_regularization = 123:122)}}})
    @ Zygote C:\Users\rokko\.julia\packages\Zygote\nsBv0\src\compiler\interface2.jl:0
 [53] #324
    @ C:\Users\rokko\Desktop\Lab\UniversalDiffEq.jl\src\Optimizers.jl:9 [inlined]
 [54] _pullback(::Zygote.Context{false}, ::Main.UniversalDiffEq.var"#324#329"{Main.UniversalDiffEq.UDE}, ::ComponentVector{Float32, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Tuple{Axis{(uhat = ViewAxis(1:51, ShapedAxis((3, 17))), process_model = ViewAxis(52:122, Axis(NNparams = ViewAxis(1:62, Axis(layer_1 = ViewAxis(1:40, Axis(weight = ViewAxis(1:30, ShapedAxis((10, 3))), bias = ViewAxis(31:40, ShapedAxis((10, 1))))), layer_2 = ViewAxis(41:62, Axis(weight = ViewAxis(1:20, ShapedAxis((2, 10))), bias = ViewAxis(21:22, ShapedAxis((2, 1))))))), ODEparams = 63:69, u0 = 70:71)), process_loss = 123:122, observation_model = 123:122, observation_loss = 123:122, process_regularization = 123:122, observation_regularization = 123:122)}}}, ::SciMLBase.NullParameters)
    @ Zygote C:\Users\rokko\.julia\packages\Zygote\nsBv0\src\compiler\interface2.jl:0

(This is not the full stack-trace however I ran out of characters so included what I could)

What GPU are you using?

I’m using a NVIDIA RTX 2060

You should always mention which version of CUDA.jl you are using, and what the output of CUDA.versioninfo() is. In this case, I suspect you’re using an outdated version, so please upgrade your packages.

My CUDA.jl version is currently 5.3.3 which is out of date, however I am unable to update it to 5.4.2 due to compatibility constraints. I’ve included the printout from CUDA.versioninfo() below.

Cuda.versioninfo()

CUDA runtime 12.4, artifact installation
CUDA driver 12.5
NVIDIA driver 555.85.0

CUDA libraries:
- CUBLAS: 12.4.5
- CURAND: 10.3.5
- CUFFT: 11.2.1
- CUSOLVER: 11.6.1
- CUSPARSE: 12.3.1
- CUPTI: 22.0.0
- NVML: 12.0.0+555.85

Julia packages:
- CUDA: 5.3.3
- CUDA_Driver_jll: 0.8.1+0
- CUDA_Runtime_jll: 0.12.1+0

Toolchain:
- Julia: 1.10.3
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 2060 with Max-Q Design (sm_75, 5.815 GiB / 6.000 GiB available)

Pkg.status()

⌅ [47edcb42] ADTypes v0.2.7
  [764a87c0] BoundaryValueDiffEq v5.7.1
  [336ed68f] CSV v0.10.14
⌅ [052768ef] CUDA v5.3.3
  [b0b7db55] ComponentArrays v0.15.13
  [a93c6f00] DataFrames v1.6.1
  [8bb1440f] DelimitedFiles v1.9.1
⌃ [aae7a2af] DiffEqFlux v3.4.0
  [071ae1c0] DiffEqGPU v3.4.1
  [0c46a032] DifferentialEquations v7.13.0
  [31c24e10] Distributions v0.25.109
  [6a86dc24] FiniteDiff v2.23.1
  [a98d9a8b] Interpolations v0.15.1
  [b964fa9f] LaTeXStrings v1.3.1
⌃ [b2108857] Lux v0.5.47
  [d0bbae9a] LuxCUDA v0.3.2
  [2774e3e8] NLsolve v4.5.1
⌃ [7f7a1694] Optimization v3.24.3
  [36348300] OptimizationOptimJL v0.3.1
  [42dfb2eb] OptimizationOptimisers v0.2.1
  [91a5bcdd] Plots v1.40.4
  [f2b01f46] Roots v2.1.5
  [2913bbd2] StatsBase v0.34.3
  [e88e6eb3] Zygote v0.6.70
⌃ [02a925ec] cuDNN v1.3.1
  [37e2e46d] LinearAlgebra
  [9a3f8284] Random

From my own exploration, my version of CUDA.jl is limited by my version of cuDNN which is eventually limited by ADTypes (through Optimization and OptimizationBase). running pkg> status --outdated yields the following:

⌅ [47edcb42] ADTypes v0.2.7 (<v1.2.1): BoundaryValueDiffEq, DiffEqFlux, NonlinearSolve, Optimization, OptimizationBase, SciMLSensitivity, SparseDiffTools
⌅ [052768ef] CUDA v5.3.3 (<v5.4.2): cuDNN
⌃ [aae7a2af] DiffEqFlux v3.4.0 (<v3.5.0)
⌃ [b2108857] Lux v0.5.47 (<v0.5.52)
⌃ [7f7a1694] Optimization v3.24.3 (<v3.25.1)
⌅ [bca83a33] OptimizationBase v0.0.5 (<v1.0.2): Optimization
⌃ [02a925ec] cuDNN v1.3.1 (<v1.3.2)

Removing DifferentialEquations as an installed package allowed me to update to the latest version of CUDA.jl which solved the issue.