DiffEqGPU - slow parallel solving of SDEs on GPU

sob · February 27, 2024, 12:21am

Hi all,

As part of a broader project to run SDEs on GPU, I was taking a stab first at the example on the SciML website, slightly modified with a number of trajectories set to 250e3 to have a better idea of how much time each step took. My hardware is a modern Ryzen 16C/32T CPU paired with an Ada generation NVidia RTX GPU (julia 1.10.1 on Linux, nvidia proprietary drivers 525, CUDA 5.2.0).

The last line of the example (re-run to avoid first compilation issues) was as follows:

julia> @time sol = solve(monteprob, SOSRI(), EnsembleGPUArray(CUDA.CUDABackend()), trajectories = 250_000,
           saveat = 1.0f0)
 65.599072 seconds (28.80 M allocations: 449.389 GiB, 9.09% gc time)
EnsembleSolution Solution of length 250000 with uType:
RODESolution{Float32, 2, uType, Nothing, Nothing, Vector{Float32}, randType, SDEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, Vector{Float32}, Nothing, SDEFunction{true, SciMLBase.FullSpecialize, typeof(lorenz), typeof(multiplicative_noise), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, typeof(multiplicative_noise), Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, Nothing}, SOSRI, IType, SciMLBase.DEStats, Nothing} where {uType, randType, IType}

What I observed during those 65s was sequentially:

40s with 1 CPU thread active, nothing happening on the GPU
5s with 32 CPU threads active, nothing happening on the GPU
20s with the GPU being fully active (and 1 active thread on the CPU)

After that, I tried a CPU run with similar settings:

julia> @time sol = solve(monteprob, SOSRI(); trajectories = 250_000,
           saveat = 1.0f0)
 23.182337 seconds (31.38 M allocations: 2.262 GiB, 0.96% gc time, 0.28% compilation time)
EnsembleSolution Solution of length 250000 with uType:
RODESolution{Float32, 2, Vector{Vector{Float32}}, Nothing, Nothing, Vector{Float32}, DiffEqNoiseProcess.NoiseProcess{Float32, 2, Float32, Vector{Float32}, Vector{Float32}, Vector{Vector{Float32}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float32, Vector{Float32}, Vector{Float32}}, true}, ResettableStacks.ResettableStack{Tuple{Float32, Vector{Float32}, Vector{Float32}}, true}, DiffEqNoiseProcess.RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, Vector{Float32}, Nothing, SDEFunction{true, SciMLBase.FullSpecialize, typeof(lorenz), typeof(multiplicative_noise), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, typeof(multiplicative_noise), Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, Nothing}, SOSRI, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float32}}, Vector{Float32}}, SciMLBase.DEStats, Nothing}

The 23s were spent with the 32 CPU threads working.

My question is, what happens during the different phases of the GPU run, and why is it slow compared to the CPU run?

ChrisRackauckas · February 27, 2024, 1:29am

Did you do EnsembleGPUKernel? That’s known to be a lot faster.

(We should update that page, it’s really old)

sob · February 27, 2024, 10:59am

Hi Chris. I haven’t found an algo that works with EnsembleGPUKernel() and doesn’t complain about sthg else (adaptivity, dt, etc). If you have a suggestion I’m happy to try.

sob · February 27, 2024, 10:30pm

My only successful attempt (GPUEM(), dt, adaptive false, out of place drift and noise) throws a warning that it’s running the kernel on the CPU.

utkarsh530 · March 1, 2024, 1:36am

Are you using the right backend? Try something like EnsembleGPUKernel(CUDA.CUDABackend())
Meanwhile, please share some code if things fail, as compiling stuff with EnsembleGPUKernel has some restrictions: EnsembleGPUKernel · DiffEqGPU.jl

sob · March 1, 2024, 9:38pm

Thanks for the link. The page has useful information as to what EnsembleGPUKernel allows in the context of an ODE (which unfortunately were not present on the SDE page I had found). Adapting it to my SDE worked. I will do further experiments on speed, thank you.

ChrisRackauckas · March 3, 2024, 4:10pm

@utkarsh530 can you make a new SDE tutorial? That seems like it would be useful.

Topic		Replies	Views
Performance problems with parallel ensemble simulation on GPU Modelling & Simulations question , gpu , performance	3	1379	February 14, 2020
Performance of Ensemble Simulations on GPUs Modelling & Simulations	4	652	May 28, 2021
DiffEqGPU Trajectory Failure Handling and Heterogeneous Trajectories GPU diffeq	4	105	July 22, 2025
DiffEq documentation example slower on GPU (33 sec) than on CPU (0.14 sec) GPU diffeq , performance , differentialequation	4	296	May 25, 2023
Why does EnsembleGPUArray not save at the given time points the solution? GPU	1	360	June 4, 2022

DiffEqGPU - slow parallel solving of SDEs on GPU

Related topics