Diffusion by using CUDA.jl, DifferentialEquations and DiffEqGPU

I want to apply a difusion for a two- dimensional plat using the DiffEqGPU package.
when I used the Euler method as a for loop it worked.
My code looks like this

module Playground

using CUDA
using CairoMakie
using ColorSchemes

const α  = 1e-4                                                                            # Diffusivity
const L  = 0.1                                                                             # Length
const W  = 0.1                                                                             # Width
const Nx = 66                                                                              # No.of steps in x-axis
const Ny = 66                                                                              # No.of steps in y-axis
const Δx = L/(Nx-1)                                                                        # x-grid spacing
const Δy = W/(Ny-1)                                                                        # y-grid spacing
const Δt = Δx^2 * Δy^2 / (2.0 * α * (Δx^2 + Δy^2))                                         # Largest stable time step

function diffuse!(du, u, p,t)
    dijij = view(du, 2:Nx-1, 2:Ny-1)
    dij  = view(u, 2:Nx-1, 2:Ny-1)
    di1j = view(u, 1:Nx-2, 2:Ny-1)
    dij1 = view(u, 2:Nx-1, 1:Ny-2)
    di2j = view(u, 3:Nx  , 2:Ny-1)
    dij2 = view(u, 2:Nx-1, 3:Ny  )                                                  # Stencil Computations

    @. dijij = α  * (
        (di1j - 2 * dij + di2j)/Δx^2 +
        (dij1 - 2 * dij + dij2)/Δy^2)                                               # Apply diffusion

    du[1, :] += α  * (2*u[2, :] - 2*u[1, :])/Δx^2
    du[Nx, :] += α  * (2*u[Nx-1, :] - 2*u[Ny, :])/Δx^2
    du[:, 1] += α   * (2*u[:, 2]-2*u[:, 1])/Δy^2
    du[:, Ny] += α * (2*u[:, Nx-1]-2*u[:, Ny])/Δy^2                  # update boundary condition (Neumann BCs)

end


u_GPU= CUDA.zeros(Nx,Ny)
du_GPU = similar(u_GPU)     
u_GPU[25:35, 25:35] .= 50
for i in 1:1000                                                                     # Apply the diffuse 1000 time to let the heat spread a long the rod      

    diffuse!(du_GPU, u_GPU,0,0)

    u_GPU = u_GPU + Δt * du_GPU

end

By using the DifferentialEquations and DiffEqGPU packeges.
It works slowly with sol = solve(prob, Euler(), dt=Δt). but when I tried sol2 = solve(prob, QNDF()) I get this ERROR: DimensionMismatch(“array could not be broadcast to match destination”) , for sol3= solve(prob, FBDF()) ERROR: This object is not a GPU array, The last one which I tried was
sol4 = solve(prob, alg_hins=[:stiff], save_everystep=false) and it takes alot of time without outcomes.

using DifferentialEquations, DiffEqGPU
tspan = (0.0, 1.0)
prob = ODEProblem(diffuse!, u_GPU, tspan)
sol = solve(prob, Euler(), dt=Δt)
sol2 = solve(prob, QNDF()) 
sol3 = solve(prob, FBDF())
sol4 = solve(prob, alg_hins=[:stiff], save_everystep=false)

Unfortunately, no matter which solver I chose. It is so slow or it doesn’t work.
Thanks for your help :slight_smile:
[/quote]

1 Like

A couple of remarks.

  1. Use Float32 consistently for the dependent variable, parameters and time on the GPU. Especially on consumer grade hardware.
  2. I assume you don’t want to increment in the calculation of the boundary condition, but assign du[1, :] = \alpha * (...) etc.
  3. You are missing broadcasts @. in the boundary conditions.
  4. Taking that into account, solve(prob, Euler(), dt=Δt) runs on my machine as fast as the for-loop. Which is about 10x slower than on the CPU, hinting that the problem is too small for the GPU.
  5. Explicit algorithms like Euler, Tsit5 and Vern7/9 work.
  6. Implicit algorithms fail consistently. Mostly with
    ERROR: MethodError: no method matching fast_materialize!(::Static.False, ::Static.True, ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})

    Edit: After restarting the kernel, this error went away.

@ChrisRackauckas Am I making a mistake here somewhere (likely), or is there an issue with FastBroadcast?

1 Like

Long story short. To use them, the jacobian must be somehow inverted at “each” steps. Hence, DE must form the jacobian (you can pass it). Here, you can compute the jacobian J and it will be sparse. There, you hit a weak spot of GPU: computing J \ x with LU (for example). The “only” option left is then preconditionned matrix-free which DE provides. This is possible, I manage to do it with recent DE.jl.

2 Likes

Thank you, this is a very useful remark.

While it may not be efficient, I don’t see any reason (which is more of a statement about my knowledge than anything else) why it shouldn’t nevertheless work. CUDA dispatches the LinearAlgebra operations on CuArrays AFAIK.

The error message makes me think that FastBroadcast doesn’t play nicely with CuArrays in this situation.

Below I optimized it. It’s all using Float32s now, and I got rid of a lot of intermediate allocations which weren’t necessary.

However, the speed has nothing to do with DiffEq. If you just run the diffuse! on GPU and CPU, you see that CPU is faster.

using CUDA

const α  = 1f-4                                                                             # Diffusivity
const L  = 1f-1                                                                             # Length
const W  = 1f-1                                                                             # Width
const Nx = 66                                                                               # No.of steps in x-axis
const Ny = 66                                                                               # No.of steps in y-axis
const Δx = Float32(L/(Nx-1))                                                                # x-grid spacing
const Δy = Float32(W/(Ny-1))                                                                # y-grid spacing
const Δt = Float32(Δx^2 * Δy^2 / (2f0 * α * (Δx^2 + Δy^2)))                                 # Largest stable time step

function diffuse!(du, u, p,t)
    dijij = view(du, 2:Nx-1, 2:Ny-1)
    dij  = view(u, 2:Nx-1, 2:Ny-1)
    di1j = view(u, 1:Nx-2, 2:Ny-1)
    dij1 = view(u, 2:Nx-1, 1:Ny-2)
    di2j = view(u, 3:Nx  , 2:Ny-1)
    dij2 = view(u, 2:Nx-1, 3:Ny  )                                                  # Stencil Computations

    @. dijij = α  * (
        (di1j - 2 * dij + di2j)/Δx^2 +
        (dij1 - 2 * dij + dij2)/Δy^2)                                               # Apply diffusion

    @views @. du[1, :] += α  * (2*u[2, :] - 2*u[1, :])/Δx^2
    @views @. du[Nx, :] += α  * (2*u[Nx-1, :] - 2*u[Ny, :])/Δx^2
    @views @. du[:, 1] += α   * (2*u[:, 2]-2*u[:, 1])/Δy^2
    @views @. du[:, Ny] += α * (2*u[:, Nx-1]-2*u[:, Ny])/Δy^2                  # update boundary condition (Neumann BCs)

end

u = zeros(Nx,Ny)
du = similar(u)     
u[25:35, 25:35] .= 50f0;

@time for i in 1:1000                                                                     # Apply the diffuse 1000 time to let the heat spread a long the rod      

    diffuse!(du, u,0,0)

    @. u = u + Δt * du

end;

u_GPU= CUDA.zeros(Nx,Ny)
du_GPU = similar(u_GPU)     
u_GPU[25:35, 25:35] .= 50f0;

@time for i in 1:1000                                                                     # Apply the diffuse 1000 time to let the heat spread a long the rod      

    diffuse!(du_GPU, u_GPU,0,0)

    u_GPU = u_GPU + Δt * du_GPU

end;

It’s just too small of kernels to parallelize. Just isolate it.

f(du,u) = du .+= 2 .* u
@time for i in 1:10000; f(du,u) end # 0.021198 seconds
@time for i in 1:10000; f(du_GPU,u_GPU) end # 0.163096 seconds

If the actual kernels you’re calling are slower on the GPU than on the CPU, then how is the ODE solver supposed to make that faster? Note that GPUs are thousands of cores. If you’re not doing thousands of “stuff” in parallel, it’s not going to be fast. GPUs are not good for serial code. This code is effectively serial.

As for the DiffEq algorithms, ROCK2 or ROCK4 would almost certainly be best here. Almost all work:

using OrdinaryDiffEq

tspan = (0f0, 1f-2)
prob = ODEProblem(diffuse!, u_GPU, tspan)
sol = solve(prob, Euler(), dt=Δt) # Success
sol = solve(prob, Rodas5(), dt=Δt) # Success
sol = solve(prob, TRBDF2(), dt=Δt) # Success
sol = solve(prob, ROCK2(), dt=Δt) # Success
sol2 = solve(prob, QNDF()) 
sol3 = solve(prob, FBDF())

The two that don’t are QNDF and FBDF, I guess they just haven’t been checked for compatibility yet. We’ll get that fixed. All of the SDIRKs, RKs, etc. should be fine though.

But as @rveltz mentions, sparse LU is not a very parallel algorithm and so it’s not good for GPUs. Generally you’d prefer ROCK over implicit on GPUs for this kind of case for this reason (and even on CPUs ROCK should win here).

No, FastBroadcast.jl just falls back to broadcast when it’s not arrays. The issue is that QNDF and FBDF is doing some special things, and some of those special things may accidentally be CPU arrays. Hence the issue and something for us to fix.

6 Likes

Thanks for such a wonderfully comprehensive answer, even though I am not the OP :slight_smile:

The fast_materialize error went away after restarting the kernel :thinking: and now ROCK2 etc. do work. That’s what tripped me up.

Thank you so much, It’s really helpful. :grinning: