DiffEqGPU - slow parallel solving of SDEs on GPU

Hi all,

As part of a broader project to run SDEs on GPU, I was taking a stab first at the example on the SciML website, slightly modified with a number of trajectories set to 250e3 to have a better idea of how much time each step took. My hardware is a modern Ryzen 16C/32T CPU paired with an Ada generation NVidia RTX GPU (julia 1.10.1 on Linux, nvidia proprietary drivers 525, CUDA 5.2.0).

The last line of the example (re-run to avoid first compilation issues) was as follows:

julia> @time sol = solve(monteprob, SOSRI(), EnsembleGPUArray(CUDA.CUDABackend()), trajectories = 250_000,
           saveat = 1.0f0)
 65.599072 seconds (28.80 M allocations: 449.389 GiB, 9.09% gc time)
EnsembleSolution Solution of length 250000 with uType:
RODESolution{Float32, 2, uType, Nothing, Nothing, Vector{Float32}, randType, SDEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, Vector{Float32}, Nothing, SDEFunction{true, SciMLBase.FullSpecialize, typeof(lorenz), typeof(multiplicative_noise), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, typeof(multiplicative_noise), Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, Nothing}, SOSRI, IType, SciMLBase.DEStats, Nothing} where {uType, randType, IType}

What I observed during those 65s was sequentially:

  • 40s with 1 CPU thread active, nothing happening on the GPU
  • 5s with 32 CPU threads active, nothing happening on the GPU
  • 20s with the GPU being fully active (and 1 active thread on the CPU)

After that, I tried a CPU run with similar settings:

julia> @time sol = solve(monteprob, SOSRI(); trajectories = 250_000,
           saveat = 1.0f0)
 23.182337 seconds (31.38 M allocations: 2.262 GiB, 0.96% gc time, 0.28% compilation time)
EnsembleSolution Solution of length 250000 with uType:
RODESolution{Float32, 2, Vector{Vector{Float32}}, Nothing, Nothing, Vector{Float32}, DiffEqNoiseProcess.NoiseProcess{Float32, 2, Float32, Vector{Float32}, Vector{Float32}, Vector{Vector{Float32}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float32, Vector{Float32}, Vector{Float32}}, true}, ResettableStacks.ResettableStack{Tuple{Float32, Vector{Float32}, Vector{Float32}}, true}, DiffEqNoiseProcess.RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, Vector{Float32}, Nothing, SDEFunction{true, SciMLBase.FullSpecialize, typeof(lorenz), typeof(multiplicative_noise), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, typeof(multiplicative_noise), Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, Nothing}, SOSRI, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float32}}, Vector{Float32}}, SciMLBase.DEStats, Nothing}

The 23s were spent with the 32 CPU threads working.

My question is, what happens during the different phases of the GPU run, and why is it slow compared to the CPU run?

Did you do EnsembleGPUKernel? That’s known to be a lot faster.

(We should update that page, it’s really old)

Hi Chris. I haven’t found an algo that works with EnsembleGPUKernel() and doesn’t complain about sthg else (adaptivity, dt, etc). If you have a suggestion I’m happy to try.

My only successful attempt (GPUEM(), dt, adaptive false, out of place drift and noise) throws a warning that it’s running the kernel on the CPU.

Are you using the right backend? Try something like EnsembleGPUKernel(CUDA.CUDABackend())
Meanwhile, please share some code if things fail, as compiling stuff with EnsembleGPUKernel has some restrictions: EnsembleGPUKernel · DiffEqGPU.jl

2 Likes

Thanks for the link. The page has useful information as to what EnsembleGPUKernel allows in the context of an ODE (which unfortunately were not present on the SDE page I had found). Adapting it to my SDE worked. I will do further experiments on speed, thank you.

@utkarsh530 can you make a new SDE tutorial? That seems like it would be useful.