I’m trying to run a parameter sweep across various different forcing functions defined by interpolations through discrete data. My discrete data U_all is of shape N x Nu x Nt where N is the number of forcing functions I have, Nu is the state size of the control input, and Nt is the number of points in time I have collected. In order to use it to force my system, I form an interpolant function for each state in U_all and plug that into my diff eq definition. I am then trying to form an ensemble over forcing function (particularly on GPU). Here is the chunk of code that does this (for those familiar, this the continuous form of a Reservoir Computer):

```
using DifferentialEquations
using Dierckx #interpolation
using LinearAlgebra
using CUDA, DiffeqGPU
### load data, define parameters, etc.
# U_all = ...
# CRC = ... # holds system parameters
###
# form interpolants
interps_all = [[Spline1D(t, U_all[j, i, :]) for i in 1:size(U_all, 2)] for j in 1:size(U_all, 1)] # Num ICs x Nu
# cache constants, CRC holds all the parameters for the DE
Wrr::Vector{Float64} = zeros(CRC_base.Nr)
W_inu::Vector{Float64} = zeros(CRC_base.Nr)
# DE definition
function res_train!(dr, r, p, t_eval)
interps, Nu, Wr, W_in, tau, alpha, b, f = p
u = [interps[i](t_eval) for i in 1:Nu]
BLAS.gemv!('N', 1.0, Wr, r, 0.0, Wrr)
BLAS.gemv!('N', 1.0, W_in, u, 0.0, W_inu)
dr .= 1/tau .* (-alpha .* r .+ f.(Wrr .+ b) .+ W_inu)
end
params = (interps_all[1], size(U_all, 2), CRC_base.Wr, CRC_base.W_in, CRC_base.tau, CRC_base.alpha, CRC_base.b, CRC_base.f)# interps will be varied
prob = ODEProblem(res_train!, CRC_base.r, (0.0, t[end]), params)
# ensemble => edit interp for each traj
prob_func_interp = let p = params
(prob, i, repeat) -> begin
remake(prob, p = (interps_all[i], p[2:end]...))
end
end
# run ensemble
ensemble_prob = EnsembleProblem(prob, prob_func = prob_func_interp)
sol_all = solve(ensemble_prob, CRC_base.solve_alg, EnsembleGPUArray(CUDA.CUDABackend()), trajectories = NUM_IC, saveat = t)
```

First off, is there any way I can better optimize a single solve on CPU? I’m worried there is some performance footgun I’m missing through this interpolation passing

Second, this works with EnsembleDistributed() (provided I stick @everywhere before function definitions and such), but I am having trouble distributing across a GPU. I get a large and confusing error that starts off with

```
ERROR: InvalidIRError: compiling MethodInstance for DiffEqGPU.gpu_gpu_kernel(::KernelAbstractions.CompilerMetadata{…}, ::typeof(res_train!), ::CuDeviceMatrix{…}, ::CuDeviceMatrix{…}, ::CuDeviceMatrix{…}, ::Float64) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to ijl_lazy_load_and_lookup)
...
```

Thanks for your time!