DIfferentialEquations and GPU




According to the following blog post http://www.stochasticlifestyle.com/solving-systems-stochastic-pdes-using-gpus-julia/ it is very easy to turn on GPU parallelism with DifferentialEquations.jl. The only thing one needs to do is to replace ordinary arrays with CLArrays or CuArrays. However, the following code shows that the GPU version is orders of magnitude slower:

using DifferentialEquations
using CuArrays

function func(du, u, p, t)
    @inbounds @. du = u
    return nothing

tspan = (0., 10.)
alg = BS3()

# ******************************************************************************

u0 = fill(1., 500)
prob = ODEProblem(func, u0, tspan)

@time sol = solve(prob, alg, saveat=0.01)
@time sol = solve(prob, alg, saveat=0.01)
@time sol = solve(prob, alg, saveat=0.01)

# ******************************************************************************

# CuArrays.allowscalar(false)

u0_gpu = CuArray(convert(Array{Float32, 1}, u0))
prob_gpu = ODEProblem(func, u0_gpu, tspan)

@time sol_gpu = solve(prob_gpu, alg, saveat=0.01)
@time sol_gpu = solve(prob_gpu, alg, saveat=0.01)
@time sol_gpu = solve(prob_gpu, alg, saveat=0.01)

The output on my machine with Nvidia GTX 1070 is the following:

  3.627708 seconds (14.59 M allocations: 756.548 MiB, 5.28% gc time)
  0.002929 seconds (1.11 k allocations: 4.082 MiB)
  0.003089 seconds (1.11 k allocations: 4.082 MiB)
 32.020281 seconds (43.64 M allocations: 2.092 GiB, 1.71% gc time)
 24.563494 seconds (20.42 M allocations: 951.305 MiB, 0.49% gc time)
 27.890973 seconds (20.42 M allocations: 951.711 MiB, 0.48% gc time)

Can you please explain me what am I doing wrong?

Thank you.


Did you run with that line uncommented? If that errors, there’s a scalar operation that isn’t supported by CuArrays.jl (yet) and will completely ruin performance.


Yes, it gives the error.

Well, then I will try to reformulate the question. Are there any examples which show how DifferentialEquations can be speed up using GPU? So far I have found only the above blog post which discusses this possibility.


Maybe not completely what you request but see


not to my knowledge. A few points though:

  1. it will be difficult to speed up real problems on gpu, even big ones, without having to write a custom kernel somewhere
    differential equations solvers are typically iterative in nature, so the speedup from the GPU needs to be the individual array operations themselves. This is a hard problem to speed up on GPU unless the operations involved are very large and very computationally intensive
  2. for your example, the GPU probably won’t beat the CPU by much even if func is written as a custom kernel. you’re pretty much just testing the L2 cache bandwidth of a single CPU core vs L2 cache on a GPU.


Thank you all!


You’re timing the worst case scenario for GPUs, which is where all of your kernel calls are trivial. Make your ODE something with a matrix multiplication or something else with a lot of linear algebra and you’ll see a different result. BLAS1 computations (“element-wise” or broadcast) is usually not generally enough to warrant GPU usage.


We’re working on this. We will have a GSoC that optimizes GPU code and gets it running with methods for stiff ODEs, focusing on getting solvers for PDEs.