Performance problems with parallel ensemble simulation on GPU

We are doing some performance analysis of different ODE solvers on GPU’s. When I compare the run times of the Julia code to the odeint and MPGOS codes of the same system, Julia does not perform too well. We are solving the Lorenz equation for different number of parameters, using double precision floating point numbers.
My code is mainly based on this example. The diagram shows the run time of 1000 RK4 steps (with fixed dt) for different number of parameters of the same ensemble problem.


I’ve expected better results from Julia, am I missing something important? The solve function has several allocations, although I’m not saving any data. Is it possible to reduce this?

Here is my code:

using DifferentialEquations, DiffEqGPU, BenchmarkTools
using CUDAnative, CUDAdrv

#settings
numberOfParameters = 7680
gpuID = 1 #Nvidia titan black device

device!(CuDevice(gpuID))
println("Running on "*string(CuDevice(gpuID)))

function lorenz!(du,u,p,t)
  @inbounds begin
    du[1] = 10.0*(u[2]-u[1])
    du[2] = p[1]*u[1]-u[2]-u[1]*u[3]
    du[3] = u[1]*u[2]-(8.0/3.0)*u[3]
  end
  nothing
end

parameterList = range(0.0,stop = 21.0,length=numberOfParameters)
p = [0.0]
tspan = (0.0,10.0)
u0 = Float64[10.0,10.0,10.0]

lorenzProblem =  ODEProblem(lorenz!,u0,tspan,p)
#parameterChange = (prob,i,repeat) -> remake(prob,p=parameterList[i]) #it has slightly more allocations
function parameterChange(prob,i,repeat)
  prob.p[1] = parameterList[i]
  prob
end

ensembleProb = EnsembleProblem(lorenzProblem,prob_func=parameterChange)

data = @benchmark solve(ensembleProb,RK4(),EnsembleGPUArray(),trajectories=numberOfParameters,
  save_everystep = false,save_start = false,save_end = false, adaptive = false, dt = 0.01)

println("Parameter number: "*string(numberOfParameters))
println("Minimum time: "*string(minimum(data.times)/1e6)*" ms")
println("Allocs: "*string(data.allocs))

Output of the code is:

Running on CuDevice(1): GeForce GTX TITAN Black
Parameter number: 7680
Minimum time: 229.473072 ms
Allocs: 764004

@maleadt can probably comment a bit more as he knows the profiling tools a lot better than me, but what’s going on seems to be a few things:

  1. Our implementation is very different. It’s much more general, which is why it already supports things like adaptive SDE integrators, but has different performance characteristics. Basically, the right way to use the one we have is to pack in as big of an ODE as you can, so Lorenz is a good worst case scenario for us. We are experimenting with an alternative implementation which is more like the others, but making it work will take some compiler tricks. Direct Compilation methods · Issue #3 · SciML/DiffEqGPU.jl · GitHub
  2. The slowest part right now is actually the CuArrays mapreduce that shows up in the adaptivity norm. That could be greatly improved if someone has time. I had a student show that it could be improved a ton, but that hasn’t gotten into a PR to CuArrays yet. I think @maleadt had an effort somewhere to revive this?
  3. A lot of our current optimizations have been for stiff systems. We should have a pretty massive advantage over naively using odeint for a large stiff system, and MPGOS doesn’t do stiff. In the case of stiff systems, the Jacobian and nonlinear solver stuff takes up all of the time and dominates the profiles, so this issue with what we see in ERK methods is just overshadowed by that.
  4. For extremely small systems like Lorenz, it would be helpful for us to fuse the full perform_step, but that’s hard in the current form of code generation and only applies to “small enough” ODEs.

So indeed, we know about some performance issues and are continuing to work through it. The first thing for us was really to get our stiff solvers onto GPUs in a good enough automated manner, and then figure out where that leaves us performance-wise.

2 Likes

That work fell through, so mapreduce is still a high-priority target to get people working on.

There’s probably some other low-hanging optimization opportunities here; if you’re interested I would suggest capturing a trace and visualizing the execution timeline. Look for obvious gaps, or certain kernels that are called in iteration while a batched alternative could do the work. I recently documented the procedure for getting a trace: https://juliagpu.gitlab.io/CUDA.jl/development/profiling/

1 Like

Thank you for your answers.

I know Julia has some great capabilities. Right now I’m trying to achieve the best performance for this simple equation. For me it seems like I cannot improve this code on the user level. I might try profiling the code, to find the bottleneck.