Operator-split chemistry solve in GPU-resident CFD code using DiffEqGPU

ColinTowery · November 5, 2025, 8:16pm

Hi,

I’m looking to see if there is any way I could solve an ensemble of ODEs similar to using the low-level API of EnsembleGPUKernel in DiffEqGPU, but where the ensemble trajectories are pulled from a multi-dimensional CuArray of inputs already on the GPU, and the final solution needs to be written back to said CuArray directly.

Big picture: I am using a CFD code (Oceananigans.jl) that runs on GPUs using KernelAbstractions and MPI, and stores its solution in individual CuArrays (when running on NVIDIA GPUs) for the velocities, pressure, and tracers, and then uses its own RK3 loop over all its kernelized algorithms. What I want is to add very stiff finite-rate chemical kinetics to the simulation by Strang splitting the fluid transport and reactions, and then integrate the stiff reactions using a separate ODE solver, ideally from a package like DifferentialEquations.jl so I can find the optimal solver and tolerances.

Small picture: I have this ODEProblem where, on a CPU, or when offloading data stored on CPU to GPU, I can copy the individual species data from each spatial point as a new ensemble in prob_func:

using OrdinaryDiffEq
using DiffEqGPU
using Metal
# using CUDA
# using BenchmarkTools: @btime
using StaticArrays: SVector

GPU = Metal.MetalBackend()
const FP = Float32

# Chemistry defined here, particulars not important
include("carbonate_system.jl")

CO₂ = 10.0
HCO₃ = 1670.0
CO₃ = 315.0
OH = 20.0
BOH₃ = 297.0
BOH₄ = 119.0
c0 = SVector{6, FP}(CO₂, HCO₃, CO₃, OH, BOH₃, BOH₄)

T = 25.0 # degrees Celsius
S = 35.0 # salinity in g/kg
p = SVector{2, FP}(T, S)

tspan = SVector{2, FP}(0.0, 1.0)
prob = ODEProblem(carbonate_rhs, c0, tspan, p)
sol = solve(prob, Rodas5(), reltol = 1e-6, abstol = 1e-10, save_everystep = false) 

nx, ny, nz = (32, 32, 32)
ntraj = nx * ny * nz
i2s = CartesianIndices((nx, ny, nz))

P1 = rand(nx, ny, nz) .* T; # values from 0 to 25
C1 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * CO₂; # values from 0.9 to 1.1 x baseline
C2 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * HCO₃;
C3 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * CO₃;
C4 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * OH;
C5 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * BOH₃;
C6 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * BOH₄;

function prob_func(prob, i, repeat)
    ijk = i2s[i]
    C_n = SVector{6, FP}(C1[ijk], C2[ijk], C3[ijk], C4[ijk], C5[ijk], C6[ijk])
    p = SVector{2, FP}(P1[ijk], S) # S is constant
    return remake(prob; u0 = C_n, p)
end

# for testing purposes, use different output arrays
C1′ = zeros(FP, nx, ny, nz);
C2′ = zeros(FP, nx, ny, nz);
C3′ = zeros(FP, nx, ny, nz);
C4′ = zeros(FP, nx, ny, nz);
C5′ = zeros(FP, nx, ny, nz);
C6′ = zeros(FP, nx, ny, nz);

function output_func(sol, i) 
    ijk = i2s[i]
    C_np1 = sol.u[end]
    C1′[ijk] = C_np1[1]
    C2′[ijk] = C_np1[2]
    C3′[ijk] = C_np1[3]
    C4′[ijk] = C_np1[4]
    C5′[ijk] = C_np1[5]
    C6′[ijk] = C_np1[6]
    return (nothing, false)
end

vprob = EnsembleProblem(prob; output_func, prob_func, safetycopy = false)
vsol = solve(vprob, GPURodas4(), EnsembleGPUKernel(GPU), trajectories = ntraj, 
             reltol = 1f-6, abstol = 1f-10, save_everystep = false)

Is there any way to adapt this example for when my “ensemble” data is already on the GPU as CuArrays, with the obvious problem being I can’t simply index into the CuArrays? Or am I stuck with re-writing my chemical system to look something like the Brusselator example (GPU-Acceleration of a Stiff Nonlinear Partial Differential Equation · DiffEqGPU.jl), where I broadcast the single-point ODE func onto a big 4D solution array (nx, ny, nz, nspecies), passed as a Ref? I’d like to avoid that if possible, as I’m not sure if I’d be able to keep memory allocations reasonable.

utkarsh530 · November 5, 2025, 8:32pm

Will this kind of approach work for you Using the Lower Level API for Decreased Overhead with GPU acclerated Ensembles · DiffEqGPU.jl ? You just need an array of ODEProblem on the GPU.

ColinTowery · November 5, 2025, 8:40pm

Thanks for the quick reply!

It seemed to me I’d still need to index into the CuArray during the call to remake, and that’s basically my only stumbling block, but my hope was to use something like the low-level API to keep everything on the GPU and never communicate back to the CPU.

ColinTowery · November 5, 2025, 8:58pm

Ahhh, sorry, I’m new enough to Julia that I didn’t recognize the lines in the low-level example

## Building different problems for different parameters
probs = map(1:trajectories) do i
    DiffEqGPU.make_prob_compatible(remake(prob, p = (@SVector rand(Float32, 3)) .* p))
end

was returning an array of ODEProblems. I thought probs would be a function signature like prob_func, and therefore was the exact same issue as using the higher-level API.

But, spitballing here, can I write something like prob_func that treats each species array as a scalar, then do something like map(prob_func, C1, C2, C3, C4, C5, C6, P1) to generate a collection of ODEProblems on the GPU?

ColinTowery · November 11, 2025, 1:15am

Working through the problem further, I’m not sure the SciMLBase.remake function will compile directly on GPUs.

If I test the low-level EnsembleGPUKernel API with data on the CPU and broadcasting remake

    C1 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * CO₂; # values from 0.9 to 1.1 x baseline
    C2 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * HCO₃;
    C3 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * CO₃;
    C4 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * OH;
    C5 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * BOH₃;
    C6 = (0.2 .* rand(nx, ny, nz) .+ 0.9) * BOH₄;

    function remake_compatible(prob, c1, c2, c3, c4, c5, c6, p1, p2)
        u0 = SVector{6, FP}(c1, c2, c3, c4, c5, c6)
        p = SVector{2, FP}(p1, p2)
        return DiffEqGPU.make_prob_compatible(remake(prob; u0, p))
    end
    
    probs = remake_compatible.(Ref(prob), C1, C2, C3, C4, C5, C6, P1, S)

then the low-level API works just fine, transferring data from CPU to GPU, and the solution output from

_, us = DiffEqGPU.vectorized_asolve(probs, prob, GPURodas5P(),  reltol = 1f-6, abstol = 1f-10)

comes out as, e.g., a MtlMatrix of SVectors on my Apple Laptop.

However, when I redo the test with MtlArrays on my laptop or CuArrays on an HPC machine, such as

    P1 = cu(rand(nx, ny, nz) .* T); # values from 0 to 25
    C1 = cu((0.2 .* rand(nx, ny, nz) .+ 0.9) * CO₂); # values from 0.9 to 1.1 x baseline
    C2 = cu((0.2 .* rand(nx, ny, nz) .+ 0.9) * HCO₃);
    C3 = cu((0.2 .* rand(nx, ny, nz) .+ 0.9) * CO₃);
    C4 = cu((0.2 .* rand(nx, ny, nz) .+ 0.9) * OH);
    C5 = cu((0.2 .* rand(nx, ny, nz) .+ 0.9) * BOH₃);
    C6 = cu((0.2 .* rand(nx, ny, nz) .+ 0.9) * BOH₄);

Then broadcasting the remake – probs = remake_compatible.(Ref(prob), C1, C2, C3, C4, C5, C6, P1, S) – fails with long error messages, but in a nutshell the error is ERROR: InvalidIRError and, among the reasons is Reason: unsupported dynamic function invocation, which if I understand correctly means that somewhere in the remake function chain there is a failure to infer types statically at compile-time, and GPUs cannot dynamically invoke typing at runtime in a broadcasting operation.

I also tested with map(remake_compatible, Ref(prob), C1, C2, C3, C4, C5, C6, P1, S) instead of broadcasting, and this fails straightforwardly with ERROR: Scalar indexing is disallowed on both Metal and CUDA.

If you have any thoughts on how to get remake to compile on GPUs, I’d love some expert help there. I tried making all the inputs const since I was testing with a global script, and also invoking the original problem as ODEProblem{false} vs without the {false}, but that’s as far as my knowledge extends to getting things to be type stable.

Otherwise I’ll just suck it up and figure out how to use a within-method style for my chemistry model that integrates the whole CuArrays at once and doesn’t explode with intermediate memory allocations and copies.

ColinTowery · November 12, 2025, 7:57pm

Working further through the problem using a MWE based off the Lorenz problem example, I’m pretty sure somewhere in the guts of remake (specifically _remake_internal) things aren’t type-stable for GPU compilation, so I’ll try a new approach.

using DiffEqGPU, OrdinaryDiffEq, StaticArrays, DiffEqBase, Metal
using Cthulhu

trajectories = 10_000

function lorenz(u::SVector{3, Float32}, p::SVector{3, Float32}, t::Float32)::SVector{3, Float32}
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] * (ρ - u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return SVector{3}(du1, du2, du3)
end

function main()
    u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
    tspan = (0.0f0, 10.0f0)
    p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
    prob = ODEProblem{false}(lorenz, u0, tspan, p)

    # # This all runs just fine
    # probs = map(1:trajectories) do i
    #     DiffEqGPU.make_prob_compatible(remake(prob, p = (@SVector rand(Float32, 3)) .* p))
    # end;
    # probs = mtl(probs);
    # ts, us = DiffEqGPU.vectorized_solve(probs, prob, GPUTsit5(); save_everystep = false, dt = 0.1f0)

    nx, ny, nz = (32, 32, 32)
    P1 = mtl(rand(nx, ny, nz) .* p[1]); 
    P2 = mtl(rand(nx, ny, nz) .* p[2]); 
    P3 = mtl(rand(nx, ny, nz) .* p[3]); 

    # ERROR HERE
    probs = map(P1, P2, P3) do p1, p2, p3
        DiffEqGPU.make_prob_compatible(remake(prob, u0 = u0, p = (@SVector [p1, p2, p3])))
    end
end

@device_code_warntype interactive=true main()

Topic		Replies	Views
]test DiffEqGPU errors with " unsupported call to the Julia runtime" GPU first-steps	19	1679	September 20, 2019
DIfferentialEquations and GPU Performance gpu	15	5066	December 7, 2020
DifferentialEquations inside CUDA kernels General Usage gpu	3	612	August 26, 2019
DifferentialEquations + RecursiveArrayTools Broadcasting / CuArrays Kernel compilation error -- GPU diffeq	6	894	February 4, 2020
Struggling to train a UDE with DiffEqGPU General Usage diffeq , cuda , optimization , diffeqflux	9	858	March 7, 2023

Operator-split chemistry solve in GPU-resident CFD code using DiffEqGPU

Related topics