Diffusion by using CUDA.jl, DifferentialEquations and DiffEqGPU

eslam_abdelfatah · March 22, 2022, 11:03pm

I want to apply a difusion for a two- dimensional plat using the DiffEqGPU package.
when I used the Euler method as a for loop it worked.
My code looks like this

module Playground

using CUDA
using CairoMakie
using ColorSchemes

const α  = 1e-4                                                                            # Diffusivity
const L  = 0.1                                                                             # Length
const W  = 0.1                                                                             # Width
const Nx = 66                                                                              # No.of steps in x-axis
const Ny = 66                                                                              # No.of steps in y-axis
const Δx = L/(Nx-1)                                                                        # x-grid spacing
const Δy = W/(Ny-1)                                                                        # y-grid spacing
const Δt = Δx^2 * Δy^2 / (2.0 * α * (Δx^2 + Δy^2))                                         # Largest stable time step

function diffuse!(du, u, p,t)
    dijij = view(du, 2:Nx-1, 2:Ny-1)
    dij  = view(u, 2:Nx-1, 2:Ny-1)
    di1j = view(u, 1:Nx-2, 2:Ny-1)
    dij1 = view(u, 2:Nx-1, 1:Ny-2)
    di2j = view(u, 3:Nx  , 2:Ny-1)
    dij2 = view(u, 2:Nx-1, 3:Ny  )                                                  # Stencil Computations

    @. dijij = α  * (
        (di1j - 2 * dij + di2j)/Δx^2 +
        (dij1 - 2 * dij + dij2)/Δy^2)                                               # Apply diffusion

    du[1, :] += α  * (2*u[2, :] - 2*u[1, :])/Δx^2
    du[Nx, :] += α  * (2*u[Nx-1, :] - 2*u[Ny, :])/Δx^2
    du[:, 1] += α   * (2*u[:, 2]-2*u[:, 1])/Δy^2
    du[:, Ny] += α * (2*u[:, Nx-1]-2*u[:, Ny])/Δy^2                  # update boundary condition (Neumann BCs)

end


u_GPU= CUDA.zeros(Nx,Ny)
du_GPU = similar(u_GPU)     
u_GPU[25:35, 25:35] .= 50
for i in 1:1000                                                                     # Apply the diffuse 1000 time to let the heat spread a long the rod      

    diffuse!(du_GPU, u_GPU,0,0)

    u_GPU = u_GPU + Δt * du_GPU

end

By using the DifferentialEquations and DiffEqGPU packeges.
It works slowly with sol = solve(prob, Euler(), dt=Δt). but when I tried sol2 = solve(prob, QNDF()) I get this ERROR: DimensionMismatch(“array could not be broadcast to match destination”) , for sol3= solve(prob, FBDF()) ERROR: This object is not a GPU array, The last one which I tried was
sol4 = solve(prob, alg_hins=[:stiff], save_everystep=false) and it takes alot of time without outcomes.

using DifferentialEquations, DiffEqGPU
tspan = (0.0, 1.0)
prob = ODEProblem(diffuse!, u_GPU, tspan)
sol = solve(prob, Euler(), dt=Δt)
sol2 = solve(prob, QNDF()) 
sol3 = solve(prob, FBDF())
sol4 = solve(prob, alg_hins=[:stiff], save_everystep=false)

Unfortunately, no matter which solver I chose. It is so slow or it doesn’t work.
Thanks for your help
[/quote]

skleinbo · March 28, 2022, 8:26am

A couple of remarks.

Use Float32 consistently for the dependent variable, parameters and time on the GPU. Especially on consumer grade hardware.
I assume you don’t want to increment in the calculation of the boundary condition, but assign du[1, :] = \alpha * (...) etc.
You are missing broadcasts @. in the boundary conditions.
Taking that into account, solve(prob, Euler(), dt=Δt) runs on my machine as fast as the for-loop. Which is about 10x slower than on the CPU, hinting that the problem is too small for the GPU.
Explicit algorithms like Euler, Tsit5 and Vern7/9 work.
Implicit algorithms fail consistently. Mostly with
ERROR: MethodError: no method matching fast_materialize!(::Static.False, ::Static.True, ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
Edit: After restarting the kernel, this error went away.

@ChrisRackauckas Am I making a mistake here somewhere (likely), or is there an issue with FastBroadcast?

rveltz · March 28, 2022, 8:56am

Long story short. To use them, the jacobian must be somehow inverted at “each” steps. Hence, DE must form the jacobian (you can pass it). Here, you can compute the jacobian J and it will be sparse. There, you hit a weak spot of GPU: computing J \ x with LU (for example). The “only” option left is then preconditionned matrix-free which DE provides. This is possible, I manage to do it with recent DE.jl.

skleinbo · March 28, 2022, 11:52am

Thank you, this is a very useful remark.

While it may not be efficient, I don’t see any reason (which is more of a statement about my knowledge than anything else) why it shouldn’t nevertheless work. CUDA dispatches the LinearAlgebra operations on CuArrays AFAIK.

The error message makes me think that FastBroadcast doesn’t play nicely with CuArrays in this situation.

ChrisRackauckas · March 28, 2022, 12:51pm

Below I optimized it. It’s all using Float32s now, and I got rid of a lot of intermediate allocations which weren’t necessary.

However, the speed has nothing to do with DiffEq. If you just run the diffuse! on GPU and CPU, you see that CPU is faster.

using CUDA

const α  = 1f-4                                                                             # Diffusivity
const L  = 1f-1                                                                             # Length
const W  = 1f-1                                                                             # Width
const Nx = 66                                                                               # No.of steps in x-axis
const Ny = 66                                                                               # No.of steps in y-axis
const Δx = Float32(L/(Nx-1))                                                                # x-grid spacing
const Δy = Float32(W/(Ny-1))                                                                # y-grid spacing
const Δt = Float32(Δx^2 * Δy^2 / (2f0 * α * (Δx^2 + Δy^2)))                                 # Largest stable time step

function diffuse!(du, u, p,t)
    dijij = view(du, 2:Nx-1, 2:Ny-1)
    dij  = view(u, 2:Nx-1, 2:Ny-1)
    di1j = view(u, 1:Nx-2, 2:Ny-1)
    dij1 = view(u, 2:Nx-1, 1:Ny-2)
    di2j = view(u, 3:Nx  , 2:Ny-1)
    dij2 = view(u, 2:Nx-1, 3:Ny  )                                                  # Stencil Computations

    @. dijij = α  * (
        (di1j - 2 * dij + di2j)/Δx^2 +
        (dij1 - 2 * dij + dij2)/Δy^2)                                               # Apply diffusion

    @views @. du[1, :] += α  * (2*u[2, :] - 2*u[1, :])/Δx^2
    @views @. du[Nx, :] += α  * (2*u[Nx-1, :] - 2*u[Ny, :])/Δx^2
    @views @. du[:, 1] += α   * (2*u[:, 2]-2*u[:, 1])/Δy^2
    @views @. du[:, Ny] += α * (2*u[:, Nx-1]-2*u[:, Ny])/Δy^2                  # update boundary condition (Neumann BCs)

end

u = zeros(Nx,Ny)
du = similar(u)     
u[25:35, 25:35] .= 50f0;

@time for i in 1:1000                                                                     # Apply the diffuse 1000 time to let the heat spread a long the rod      

    diffuse!(du, u,0,0)

    @. u = u + Δt * du

end;

u_GPU= CUDA.zeros(Nx,Ny)
du_GPU = similar(u_GPU)     
u_GPU[25:35, 25:35] .= 50f0;

@time for i in 1:1000                                                                     # Apply the diffuse 1000 time to let the heat spread a long the rod      

    diffuse!(du_GPU, u_GPU,0,0)

    u_GPU = u_GPU + Δt * du_GPU

end;

It’s just too small of kernels to parallelize. Just isolate it.

f(du,u) = du .+= 2 .* u
@time for i in 1:10000; f(du,u) end # 0.021198 seconds
@time for i in 1:10000; f(du_GPU,u_GPU) end # 0.163096 seconds

If the actual kernels you’re calling are slower on the GPU than on the CPU, then how is the ODE solver supposed to make that faster? Note that GPUs are thousands of cores. If you’re not doing thousands of “stuff” in parallel, it’s not going to be fast. GPUs are not good for serial code. This code is effectively serial.

As for the DiffEq algorithms, ROCK2 or ROCK4 would almost certainly be best here. Almost all work:

using OrdinaryDiffEq

tspan = (0f0, 1f-2)
prob = ODEProblem(diffuse!, u_GPU, tspan)
sol = solve(prob, Euler(), dt=Δt) # Success
sol = solve(prob, Rodas5(), dt=Δt) # Success
sol = solve(prob, TRBDF2(), dt=Δt) # Success
sol = solve(prob, ROCK2(), dt=Δt) # Success
sol2 = solve(prob, QNDF()) 
sol3 = solve(prob, FBDF())

The two that don’t are QNDF and FBDF, I guess they just haven’t been checked for compatibility yet. We’ll get that fixed. All of the SDIRKs, RKs, etc. should be fine though.

github.com/SciML/OrdinaryDiffEq.jl

GPU Compatibility of QNDF and FBDF

opened 12:48PM - 28 Mar 22 UTC

ChrisRackauckas

```julia using CUDA const α = 1f-4 … # Diffusivity const L = 1f-1 # Length const W = 1f-1 # Width const Nx = 66 # No.of steps in x-axis const Ny = 66 # No.of steps in y-axis const Δx = Float32(L/(Nx-1)) # x-grid spacing const Δy = Float32(W/(Ny-1)) # y-grid spacing const Δt = Float32(Δx^2 * Δy^2 / (2f0 * α * (Δx^2 + Δy^2))) # Largest stable time step function diffuse!(du, u, p,t) dijij = view(du, 2:Nx-1, 2:Ny-1) dij = view(u, 2:Nx-1, 2:Ny-1) di1j = view(u, 1:Nx-2, 2:Ny-1) dij1 = view(u, 2:Nx-1, 1:Ny-2) di2j = view(u, 3:Nx , 2:Ny-1) dij2 = view(u, 2:Nx-1, 3:Ny ) # Stencil Computations @. dijij = α * ( (di1j - 2 * dij + di2j)/Δx^2 + (dij1 - 2 * dij + dij2)/Δy^2) # Apply diffusion @views @. du[1, :] += α * (2*u[2, :] - 2*u[1, :])/Δx^2 @views @. du[Nx, :] += α * (2*u[Nx-1, :] - 2*u[Ny, :])/Δx^2 @views @. du[:, 1] += α * (2*u[:, 2]-2*u[:, 1])/Δy^2 @views @. du[:, Ny] += α * (2*u[:, Nx-1]-2*u[:, Ny])/Δy^2 # update boundary condition (Neumann BCs) end u = zeros(Nx,Ny) du = similar(u) u[25:35, 25:35] .= 50f0; @time for i in 1:1000 # Apply the diffuse 1000 time to let the heat spread a long the rod diffuse!(du, u,0,0) @. u = u + Δt * du end; u_GPU= CUDA.zeros(Nx,Ny) du_GPU = similar(u_GPU) u_GPU[25:35, 25:35] .= 50f0; @time for i in 1:1000 # Apply the diffuse 1000 time to let the heat spread a long the rod diffuse!(du_GPU, u_GPU,0,0) u_GPU = u_GPU + Δt * du_GPU end; f(du,u) = du .+= 2 .* u @time for i in 1:10000; f(du,u) end # 0.021198 seconds @time for i in 1:10000; f(du_GPU,u_GPU) end # 0.163096 seconds using OrdinaryDiffEq tspan = (0f0, 1f-2) prob = ODEProblem(diffuse!, u_GPU, tspan) sol = solve(prob, Euler(), dt=Δt) # Success sol = solve(prob, Rodas5(), dt=Δt) # Success sol = solve(prob, TRBDF2(), dt=Δt) # Success sol = solve(prob, ROCK2(), dt=Δt) # Success sol2 = solve(prob, QNDF()) sol3 = solve(prob, FBDF()) ``` @YingboMa these two seem to be mixing in some CPU arrays?

But as @rveltz mentions, sparse LU is not a very parallel algorithm and so it’s not good for GPUs. Generally you’d prefer ROCK over implicit on GPUs for this kind of case for this reason (and even on CPUs ROCK should win here).

No, FastBroadcast.jl just falls back to broadcast when it’s not arrays. The issue is that QNDF and FBDF is doing some special things, and some of those special things may accidentally be CPU arrays. Hence the issue and something for us to fix.

skleinbo · March 28, 2022, 2:21pm

Thanks for such a wonderfully comprehensive answer, even though I am not the OP

The fast_materialize error went away after restarting the kernel and now ROCK2 etc. do work. That’s what tripped me up.

eslam_abdelfatah · March 29, 2022, 10:44am

Thank you so much, It’s really helpful.

Topic		Replies	Views
How to solve implicitly a 2D Heat diffusion equation / ParallelStencil / DifferentialEquation / Modelling & Simulations question	17	2473	May 13, 2022
]test DiffEqGPU errors with " unsupported call to the Julia runtime" GPU first-steps	19	1674	September 20, 2019
Error in diffeq_fd command of DiffEqFlux Machine Learning diffeq , flux	4	537	June 29, 2019
Some help with ArrayFire + DifferentialEquations GPU	20	1145	December 18, 2019
DIfferentialEquations and GPU Performance gpu	15	5056	December 7, 2020

Diffusion by using CUDA.jl, DifferentialEquations and DiffEqGPU

Related topics