Operator-split chemistry solve in GPU-resident CFD code using DiffEqGPU

Hi,

I’m looking to see if there is any way I could solve an ensemble of ODEs similar to using the low-level API of EnsembleGPUKernel in DiffEqGPU, but where the ensemble trajectories are pulled from a multi-dimensional CuArray of inputs already on the GPU, and the final solution needs to be written back to said CuArray directly.

Big picture: I am using a CFD code (Oceananigans.jl) that runs on GPUs using KernelAbstractions and MPI, and stores its solution in individual CuArrays (when running on NVIDIA GPUs) for the velocities, pressure, and tracers, and then uses its own RK3 loop over all its kernelized algorithms. What I want is to add very stiff finite-rate chemical kinetics to the simulation by Strang splitting the fluid transport and reactions, and then integrate the stiff reactions using a separate ODE solver, ideally from a package like DifferentialEquations.jl so I can find the optimal solver and tolerances.

Small picture: I have this ODEProblem where, on a CPU, or when offloading data stored on CPU to GPU, I can copy the individual species data from each spatial point as a new ensemble in prob_func:

using OrdinaryDiffEq
using DiffEqGPU
using Metal
# using CUDA
# using BenchmarkTools: @btime
using StaticArrays: SVector

GPU = Metal.MetalBackend()
const FP = Float32

# Chemistry defined here, particulars not important
include("carbonate_system.jl")

CO₂ = 10.0
HCO₃ = 1670.0
CO₃ = 315.0
OH = 20.0
BOH₃ = 297.0
BOH₄ = 119.0
c0 = SVector{6, FP}(CO₂, HCO₃, CO₃, OH, BOH₃, BOH₄)

T = 25.0 # degrees Celsius
S = 35.0 # salinity in g/kg
p = SVector{2, FP}(T, S)

tspan = SVector{2, FP}(0.0, 1.0)
prob = ODEProblem(carbonate_rhs, c0, tspan, p)
sol = solve(prob, Rodas5(), reltol = 1e-6, abstol = 1e-10, save_everystep = false) 

nx, ny, nz = (32, 32, 32)
ntraj = nx * ny * nz
i2s = CartesianIndices((nx, ny, nz))

P1 = rand(nx, ny, nz) .* T; # values from 0 to 25
C1 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * CO₂; # values from 0.9 to 1.1 x baseline
C2 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * HCO₃;
C3 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * CO₃;
C4 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * OH;
C5 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * BOH₃;
C6 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * BOH₄;

function prob_func(prob, i, repeat)
    ijk = i2s[i]
    C_n = SVector{6, FP}(C1[ijk], C2[ijk], C3[ijk], C4[ijk], C5[ijk], C6[ijk])
    p = SVector{2, FP}(P1[ijk], S) # S is constant
    return remake(prob; u0 = C_n, p)
end

# for testing purposes, use different output arrays
C1′ = zeros(FP, nx, ny, nz);
C2′ = zeros(FP, nx, ny, nz);
C3′ = zeros(FP, nx, ny, nz);
C4′ = zeros(FP, nx, ny, nz);
C5′ = zeros(FP, nx, ny, nz);
C6′ = zeros(FP, nx, ny, nz);

function output_func(sol, i) 
    ijk = i2s[i]
    C_np1 = sol.u[end]
    C1′[ijk] = C_np1[1]
    C2′[ijk] = C_np1[2]
    C3′[ijk] = C_np1[3]
    C4′[ijk] = C_np1[4]
    C5′[ijk] = C_np1[5]
    C6′[ijk] = C_np1[6]
    return (nothing, false)
end

vprob = EnsembleProblem(prob; output_func, prob_func, safetycopy = false)
vsol = solve(vprob, GPURodas4(), EnsembleGPUKernel(GPU), trajectories = ntraj, 
             reltol = 1f-6, abstol = 1f-10, save_everystep = false)

Is there any way to adapt this example for when my “ensemble” data is already on the GPU as CuArrays, with the obvious problem being I can’t simply index into the CuArrays? Or am I stuck with re-writing my chemical system to look something like the Brusselator example (GPU-Acceleration of a Stiff Nonlinear Partial Differential Equation · DiffEqGPU.jl), where I broadcast the single-point ODE func onto a big 4D solution array (nx, ny, nz, nspecies), passed as a Ref? I’d like to avoid that if possible, as I’m not sure if I’d be able to keep memory allocations reasonable.

1 Like

Will this kind of approach work for you Using the Lower Level API for Decreased Overhead with GPU acclerated Ensembles · DiffEqGPU.jl ? You just need an array of ODEProblem on the GPU.

Thanks for the quick reply!

It seemed to me I’d still need to index into the CuArray during the call to remake, and that’s basically my only stumbling block, but my hope was to use something like the low-level API to keep everything on the GPU and never communicate back to the CPU.

Ahhh, sorry, I’m new enough to Julia that I didn’t recognize the lines in the low-level example

## Building different problems for different parameters
probs = map(1:trajectories) do i
    DiffEqGPU.make_prob_compatible(remake(prob, p = (@SVector rand(Float32, 3)) .* p))
end

was returning an array of ODEProblems. I thought probs would be a function signature like prob_func, and therefore was the exact same issue as using the higher-level API.

But, spitballing here, can I write something like prob_func that treats each species array as a scalar, then do something like map(prob_func, C1, C2, C3, C4, C5, C6, P1) to generate a collection of ODEProblems on the GPU?