Performance of Ensemble Simulations on GPUs

knn · May 19, 2021, 9:19am

We are currently analyzing the performance of ensemble simlations on CPUs and GPUs. In our case, we noticed that the ensemble simulation on the GPU with EnsembleGPUArray only stresses the GPU for about 6s. Nevertheless, the overall calculation time using EnsembleGPUArray is around 115s, which is shown below. It seems that the preparation for the computations on the GPU and the copying of the data from host to deivce and the other way around takes a great amount of time.

CPU Ensemble Calculations
187.014356 seconds (109.21 M allocations: 14.871 GiB, 2.22% gc time, 0.00% compilation time)
180.962341 seconds (85.79 M allocations: 13.464 GiB, 2.74% gc time)
GPU Ensemble Calculations
190.710367 seconds (130.19 M allocations: 231.045 GiB, 31.04% gc time, 3.68% compilation time)
115.335076 seconds (31.41 M allocations: 225.337 GiB, 15.32% gc time)

Is it possible to improve the performance for EnsembleGPUArray since the GPU is only stressed for a small fraction of the overall calculation time? Are we missing something?

The code executing the ensemble simulations is shown below. The function ROMO is a vehicle model and is omitted here due to readability reasons. The used GPU is a NVIDIA GeForce RTX 3080.

using OrdinaryDiffEq
using DifferentialEquations
using Distributed
using DiffEqGPU
using CUDA
using Plots

function ROMO(dx,x,p,t)
    ...
end

ENV["JULIA_NUM_THREADS"]

x0 = Float32[0.0; 20.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 100; 100; 0.0; 0.0]

tspan = (0.0, 400.0)
savetime = 1.0f0
p = [1013.0f0, 1130.0f0, 0.8f0]
prob = ODEProblem(ROMO,x0,tspan,p)

prob_func = (prob,i,repeat) -> remake(prob,p=p+0.2*rand(Float32,3).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false)

println("CPU Ensemble Calculations")
@time sol = solve(monteprob,RK4(),EnsembleThreads(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()
@time sol = solve(monteprob,RK4(),EnsembleThreads(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()

println("GPU Ensemble Calculations")
@time sol = solve(monteprob,RK4(),EnsembleGPUArray(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()
@time sol = solve(monteprob,RK4(),EnsembleGPUArray(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()

ChrisRackauckas · May 19, 2021, 11:19am

I can’t run your code, but yes on non-stiff ODEs it has lower utilization than I’d like (on stiff ODEs it does a lot better). I’m planning to work with @vchuravy to do a bit of a different architecture for non-stiff ODEs that should improve it, but for now, you’ll want to pick even more trajectories on to fill the kernels.

knn · May 27, 2021, 6:38pm

Thank you very much for your reply. Below I added some code which also leads us to the same observations (EnsembleThreads() faster than EnsembleGPUArray()) if you want run it. I am looking forward to the improvements you mentioned in your reply.

Are the possibly some other settings for now which could make it run more efficiently on the GPU?

using OrdinaryDiffEq
using DifferentialEquations
using Distributed
using DiffEqGPU
using CUDA
using Plots

function FUN(dx,x,p,t)
    T1 = 0.1
    T2 = 0.2
    T3 = 0.3
    T4 = 0.4
    T5 = 0.5

    dx[1] = (sin(2*pi*p[1]*t) - x[1])/T1
    dx[2] = (sin(2*pi*p[2]*t) - x[2])/T2
    dx[3] = (sin(2*pi*p[3]*t) - x[3])/T3
    dx[4] = (sin(2*pi*p[4]*t) - x[4])/T4
    dx[5] = (sin(2*pi*p[5]*t) - x[5])/T5
end

ENV["JULIA_NUM_THREADS"]

x0 = Float32[0.0; 0.0; 0.0; 0.0; 0.0]
tspan = (0.0, 200.0)
p = [1.0, 1.0, 1.0, 1.0, 1.0]
prob = ODEProblem(FUN,x0,tspan,p)

savetime = 1.0f0

println("CPU Calculations without Ensembles")

prob_func = (prob,i,repeat) -> remake(prob,p=p+0.5*rand(Float32,5).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false)

println("CPU Ensemble Calculations")

@time sol = solve(monteprob,RK4(),EnsembleThreads(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()
@time sol = solve(monteprob,RK4(),EnsembleThreads(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()

println("GPU Ensemble Calculations")

@time sol = solve(monteprob,RK4(),EnsembleGPUArray(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()
@time sol = solve(monteprob,RK4(),EnsembleGPUArray(),trajectories=100_000,saveat=savetime, dt=0.001);
sol=0
GC.gc()

jonniedie · May 27, 2021, 8:47pm

Is this section

knn:

    dx[1] = (sin(2*pi*p[1]*t) - x[1])/T1
    dx[2] = (sin(2*pi*p[2]*t) - x[2])/T2
    dx[1] = (sin(2*pi*p[3]*t) - x[3])/T3
    dx[2] = (sin(2*pi*p[4]*t) - x[4])/T4
    dx[1] = (sin(2*pi*p[5]*t) - x[5])/T5

supposed to look like this?

    dx[1] = (sin(2*pi*p[1]*t) - x[1])/T1
    dx[2] = (sin(2*pi*p[2]*t) - x[2])/T2
    dx[3] = (sin(2*pi*p[3]*t) - x[3])/T3
    dx[4] = (sin(2*pi*p[4]*t) - x[4])/T4
    dx[5] = (sin(2*pi*p[5]*t) - x[5])/T5

Because otherwise you’re just setting the first two derivatives over again.

knn · May 28, 2021, 7:13am

Yes, you are right, I corrected the typo. Thanks for the hint.

Topic		Replies	Views
Performance problems with parallel ensemble simulation on GPU Modelling & Simulations question , gpu , performance	3	1369	February 14, 2020
Why does EnsembleGPUArray not save at the given time points the solution? GPU	1	356	June 4, 2022
EnsembleThreads slower than EnsembleSerial or EnsembleDistributed Modelling & Simulations package , diffeq	3	804	April 29, 2021
DiffEqGPU - slow parallel solving of SDEs on GPU GPU	6	387	March 3, 2024
Errors on 2nd Iteration of EnsembleGPUArrays when using Lux Modelling & Simulations	2	97	June 28, 2024

Performance of Ensemble Simulations on GPUs

Related topics