Hi all,
As part of a broader project to run SDEs on GPU, I was taking a stab first at the example on the SciML website, slightly modified with a number of trajectories set to 250e3
to have a better idea of how much time each step took. My hardware is a modern Ryzen 16C/32T CPU paired with an Ada generation NVidia RTX GPU (julia 1.10.1 on Linux, nvidia proprietary drivers 525, CUDA 5.2.0).
The last line of the example (re-run to avoid first compilation issues) was as follows:
julia> @time sol = solve(monteprob, SOSRI(), EnsembleGPUArray(CUDA.CUDABackend()), trajectories = 250_000,
saveat = 1.0f0)
65.599072 seconds (28.80 M allocations: 449.389 GiB, 9.09% gc time)
EnsembleSolution Solution of length 250000 with uType:
RODESolution{Float32, 2, uType, Nothing, Nothing, Vector{Float32}, randType, SDEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, Vector{Float32}, Nothing, SDEFunction{true, SciMLBase.FullSpecialize, typeof(lorenz), typeof(multiplicative_noise), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, typeof(multiplicative_noise), Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, Nothing}, SOSRI, IType, SciMLBase.DEStats, Nothing} where {uType, randType, IType}
What I observed during those 65s was sequentially:
- 40s with 1 CPU thread active, nothing happening on the GPU
- 5s with 32 CPU threads active, nothing happening on the GPU
- 20s with the GPU being fully active (and 1 active thread on the CPU)
After that, I tried a CPU run with similar settings:
julia> @time sol = solve(monteprob, SOSRI(); trajectories = 250_000,
saveat = 1.0f0)
23.182337 seconds (31.38 M allocations: 2.262 GiB, 0.96% gc time, 0.28% compilation time)
EnsembleSolution Solution of length 250000 with uType:
RODESolution{Float32, 2, Vector{Vector{Float32}}, Nothing, Nothing, Vector{Float32}, DiffEqNoiseProcess.NoiseProcess{Float32, 2, Float32, Vector{Float32}, Vector{Float32}, Vector{Vector{Float32}}, typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_DIST), typeof(DiffEqNoiseProcess.INPLACE_WHITE_NOISE_BRIDGE), true, ResettableStacks.ResettableStack{Tuple{Float32, Vector{Float32}, Vector{Float32}}, true}, ResettableStacks.ResettableStack{Tuple{Float32, Vector{Float32}, Vector{Float32}}, true}, DiffEqNoiseProcess.RSWM{Float64}, Nothing, RandomNumbers.Xorshifts.Xoroshiro128Plus}, SDEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, Vector{Float32}, Nothing, SDEFunction{true, SciMLBase.FullSpecialize, typeof(lorenz), typeof(multiplicative_noise), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing, Nothing}, typeof(multiplicative_noise), Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, Nothing}, SOSRI, StochasticDiffEq.LinearInterpolationData{Vector{Vector{Float32}}, Vector{Float32}}, SciMLBase.DEStats, Nothing}
The 23s were spent with the 32 CPU threads working.
My question is, what happens during the different phases of the GPU run, and why is it slow compared to the CPU run?