Performance of Ensemble Simulations on GPUs

We are currently analyzing the performance of ensemble simlations on CPUs and GPUs. In our case, we noticed that the ensemble simulation on the GPU with EnsembleGPUArray only stresses the GPU for about 6s. Nevertheless, the overall calculation time using EnsembleGPUArray is around 115s, which is shown below. It seems that the preparation for the computations on the GPU and the copying of the data from host to deivce and the other way around takes a great amount of time.

CPU Ensemble Calculations
187.014356 seconds (109.21 M allocations: 14.871 GiB, 2.22% gc time, 0.00% compilation time)
180.962341 seconds (85.79 M allocations: 13.464 GiB, 2.74% gc time)
GPU Ensemble Calculations
190.710367 seconds (130.19 M allocations: 231.045 GiB, 31.04% gc time, 3.68% compilation time)
115.335076 seconds (31.41 M allocations: 225.337 GiB, 15.32% gc time)

Is it possible to improve the performance for EnsembleGPUArray since the GPU is only stressed for a small fraction of the overall calculation time? Are we missing something?

The code executing the ensemble simulations is shown below. The function ROMO is a vehicle model and is omitted here due to readability reasons. The used GPU is a NVIDIA GeForce RTX 3080.

using OrdinaryDiffEq
using DifferentialEquations
using Distributed
using DiffEqGPU
using CUDA
using Plots

function ROMO(dx,x,p,t)
    ...
end

ENV["JULIA_NUM_THREADS"]

x0 = Float32[0.0; 20.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 100; 100; 0.0; 0.0]

tspan = (0.0, 400.0)
savetime = 1.0f0
p = [1013.0f0, 1130.0f0, 0.8f0]
prob = ODEProblem(ROMO,x0,tspan,p)

prob_func = (prob,i,repeat) -> remake(prob,p=p+0.2*rand(Float32,3).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false)

println("CPU Ensemble Calculations")
@time sol = solve(monteprob,RK4(),EnsembleThreads(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()
@time sol = solve(monteprob,RK4(),EnsembleThreads(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()

println("GPU Ensemble Calculations")
@time sol = solve(monteprob,RK4(),EnsembleGPUArray(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()
@time sol = solve(monteprob,RK4(),EnsembleGPUArray(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()

I can’t run your code, but yes on non-stiff ODEs it has lower utilization than I’d like (on stiff ODEs it does a lot better). I’m planning to work with @vchuravy to do a bit of a different architecture for non-stiff ODEs that should improve it, but for now, you’ll want to pick even more trajectories on to fill the kernels.

Thank you very much for your reply. Below I added some code which also leads us to the same observations (EnsembleThreads() faster than EnsembleGPUArray()) if you want run it. I am looking forward to the improvements you mentioned in your reply.

Are the possibly some other settings for now which could make it run more efficiently on the GPU?

using OrdinaryDiffEq
using DifferentialEquations
using Distributed
using DiffEqGPU
using CUDA
using Plots

function FUN(dx,x,p,t)
    T1 = 0.1
    T2 = 0.2
    T3 = 0.3
    T4 = 0.4
    T5 = 0.5

    dx[1] = (sin(2*pi*p[1]*t) - x[1])/T1
    dx[2] = (sin(2*pi*p[2]*t) - x[2])/T2
    dx[3] = (sin(2*pi*p[3]*t) - x[3])/T3
    dx[4] = (sin(2*pi*p[4]*t) - x[4])/T4
    dx[5] = (sin(2*pi*p[5]*t) - x[5])/T5
end

ENV["JULIA_NUM_THREADS"]

x0 = Float32[0.0; 0.0; 0.0; 0.0; 0.0]
tspan = (0.0, 200.0)
p = [1.0, 1.0, 1.0, 1.0, 1.0]
prob = ODEProblem(FUN,x0,tspan,p)

savetime = 1.0f0

println("CPU Calculations without Ensembles")

prob_func = (prob,i,repeat) -> remake(prob,p=p+0.5*rand(Float32,5).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false)

println("CPU Ensemble Calculations")

@time sol = solve(monteprob,RK4(),EnsembleThreads(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()
@time sol = solve(monteprob,RK4(),EnsembleThreads(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()

println("GPU Ensemble Calculations")

@time sol = solve(monteprob,RK4(),EnsembleGPUArray(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()
@time sol = solve(monteprob,RK4(),EnsembleGPUArray(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()

Is this section

supposed to look like this?

    dx[1] = (sin(2*pi*p[1]*t) - x[1])/T1
    dx[2] = (sin(2*pi*p[2]*t) - x[2])/T2
    dx[3] = (sin(2*pi*p[3]*t) - x[3])/T3
    dx[4] = (sin(2*pi*p[4]*t) - x[4])/T4
    dx[5] = (sin(2*pi*p[5]*t) - x[5])/T5

Because otherwise you’re just setting the first two derivatives over again.

Yes, you are right, I corrected the typo. Thanks for the hint.