Parallelize differential equation solve with interpolated forcing function

I’m trying to run a parameter sweep across various different forcing functions defined by interpolations through discrete data. My discrete data U_all is of shape N x Nu x Nt where N is the number of forcing functions I have, Nu is the state size of the control input, and Nt is the number of points in time I have collected. In order to use it to force my system, I form an interpolant function for each state in U_all and plug that into my diff eq definition. I am then trying to form an ensemble over forcing function (particularly on GPU). Here is the chunk of code that does this (for those familiar, this the continuous form of a Reservoir Computer):

using DifferentialEquations 
using Dierckx #interpolation
using LinearAlgebra
using CUDA, DiffeqGPU

### load data, define parameters, etc. 
# U_all = ...
# CRC = ...  # holds system parameters 
###

# form interpolants
interps_all = [[Spline1D(t, U_all[j, i, :]) for i in 1:size(U_all, 2)] for j in 1:size(U_all, 1)] # Num ICs x Nu

# cache constants, CRC holds all the parameters for the DE 
Wrr::Vector{Float64} = zeros(CRC_base.Nr)
W_inu::Vector{Float64} = zeros(CRC_base.Nr)

# DE definition
function res_train!(dr, r, p, t_eval)
    interps, Nu, Wr, W_in, tau, alpha, b, f = p
    u = [interps[i](t_eval) for i in 1:Nu]
    BLAS.gemv!('N', 1.0, Wr, r, 0.0, Wrr)
    BLAS.gemv!('N', 1.0, W_in, u, 0.0, W_inu)
    dr .= 1/tau .* (-alpha .* r .+ f.(Wrr .+ b) .+ W_inu)
end
params = (interps_all[1], size(U_all, 2), CRC_base.Wr, CRC_base.W_in, CRC_base.tau, CRC_base.alpha, CRC_base.b, CRC_base.f)# interps will be varied
prob = ODEProblem(res_train!, CRC_base.r, (0.0, t[end]), params)

# ensemble => edit interp for each traj
prob_func_interp = let p = params
    (prob, i, repeat) -> begin
        remake(prob, p = (interps_all[i], p[2:end]...))
    end
end

# run ensemble
ensemble_prob = EnsembleProblem(prob, prob_func = prob_func_interp)
sol_all = solve(ensemble_prob, CRC_base.solve_alg, EnsembleGPUArray(CUDA.CUDABackend()), trajectories = NUM_IC, saveat = t)

First off, is there any way I can better optimize a single solve on CPU? I’m worried there is some performance footgun I’m missing through this interpolation passing

Second, this works with EnsembleDistributed() (provided I stick @everywhere before function definitions and such), but I am having trouble distributing across a GPU. I get a large and confusing error that starts off with

ERROR: InvalidIRError: compiling MethodInstance for DiffEqGPU.gpu_gpu_kernel(::KernelAbstractions.CompilerMetadata{…}, ::typeof(res_train!), ::CuDeviceMatrix{…}, ::CuDeviceMatrix{…}, ::CuDeviceMatrix{…}, ::Float64) resulted in invalid LLVM IR

Reason: unsupported call to an unknown function (call to ijl_lazy_load_and_lookup)
...

Thanks for your time!

1 Like