DifferentialEquations + RecursiveArrayTools Broadcasting / CuArrays Kernel compilation error --

Hi,

Finally have access to a CUDA capable GPU, so I’m trying to accelerate some finite-difference PDE solves using CuArrays and DifferentialEquations.

I began with Chris’ simple reaction-diffusion demonstration here, and while I can get the CPU computation to complete for nearly any solver by removing the RecursiveArrayTools, I can not get any GPU computation to work for any solver. (Yes, the whole CUDA stack is installed and passed all tests.)

Setting CuArrays.allowscalar(false) and running the solver yields a forbidden indexing error, while permitting scalar operations generates a pretty inscrutable GPU compilation error message including “KernelError: passing and using non-bitstype argument”, (similar: here and here).

GPU compilation of #25(CuArrays.CuKernelState, CUDAnative.CuDeviceArray{Float64,2,CUDAnative.AS.Global}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(+),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(-),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(-),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(+),Tuple{Base.Broadcast.Extruded{Array{Float64,2},Tuple{Bool,Bool},Tuple{Int64,Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}}}},Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(*),Tuple{Float64,Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}}}}}},Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(*),Tuple{Float64,Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}}}}}},Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(*),Tuple{Float64,Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}}}}}}) failed
KernelError: passing and using non-bitstype argument

Argument 4 to your kernel function is of type Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(+),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(-),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(-),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(+),Tuple{Base.Broadcast.Extruded{Array{Float64,2},Tuple{Bool,Bool},Tuple{Int64,Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}}}},Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(*),Tuple{Float64,Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}}}}}},Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(*),Tuple{Float64,Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}}}}}},Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(*),Tuple{Float64,Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}}}}}}.
That type is not isbits, and such arguments are only allowed when they are unused by the kernel.  .args is of type Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(-),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(-),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(+),Tuple{Base.Broadcast.Extruded{Array{Float64,2},Tuple{Bool,Bool},Tuple{Int64,Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}}}},Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(*),Tuple{Float64,Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}}}}}},Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(*),Tuple{Float64,Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}}}}}},Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(*),Tuple{Float64,Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}}}}} which is not isbits.
    .1 is of type Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(-),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(-),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(+),Tuple{Base.Broadcast.Extruded{Array{Float64,2},Tuple{Bool,Bool},Tuple{Int64,Int64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}}}},Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(*),Tuple{Float64,Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Tuple{Bool,Bool},Tuple{Int64,Int64}}}}}},Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,typeof(*),Tuple{Float64,Base.B...

(I can’t figure out how to format the error in a more clear way, sorry.)

Complete code below (very minor changes from Chris’ blog):

using DifferentialEquations, RecursiveArrayTools, LinearAlgebra

# Define the constants for the PDE
const α₂ = 1.0
const α₃ = 1.0
const β₁ = 1.0
const β₂ = 1.0
const β₃ = 1.0
const r₁ = 1.0
const r₂ = 1.0
const D = 100.0
const γ₁ = 0.1
const γ₂ = 0.1
const γ₃ = 0.1
const N = 256
const X = reshape([i for i in 1:N for j in 1:N],N,N)
const Y = reshape([j for i in 1:N for j in 1:N],N,N)
const α₁ = 1.0.*(X.>=4*N/5)

const Mx = Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1])
const My = copy(Mx)
Mx[2,1] = 2.0
Mx[end-1,end] = 2.0
My[1,2] = 2.0
My[end,end-1] = 2.0

# Define the initial condition as normal arrays
A = zeros(N,N); B  = zeros(N,N); C = zeros(N,N); u0 = ArrayPartition((A,B,C))
#u0 = zeros(Float64,N,N,3)

const MyA = zeros(N,N);
const AMx = zeros(N,N);
const DA = zeros(N,N);
# Define the discretized PDE as an ODE function
function f(du,u,p,t)
  A,B,C = u.x
  dA,dB,dC = du.x
  # A = @view  u[:,:,1]
  # B = @view  u[:,:,2]
  # C = @view  u[:,:,3]
  #dA = @view du[:,:,1]
  #dB = @view du[:,:,2]
  #dC = @view du[:,:,3]
  mul!(MyA,My,A)
  mul!(AMx,A,Mx)
  @. DA = D*(MyA + AMx)
  @. dA = DA + α₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end

# Solve the ODE
prob = ODEProblem(f,u0,(0.0,100.0))
sol = solve(prob,BS3(),progress=true,save_everystep=false,save_start=false)

#using Plots; pyplot()
#p1 = surface(X,Y,sol[end][:,:,1],title = "[A]")
#p2 = surface(X,Y,sol[end][:,:,2],title = "[B]")
#p3 = surface(X,Y,sol[end][:,:,3],title = "[C]")
#plot(p1,p2,p3,layout=grid(3,1))

#######################################################
### Solve the PDE using CuArrays
#######################################################

using CuArrays
gA = CuArray(Float32.(A)); gB = CuArray(Float32.(B)); gC = CuArray(Float32.(C)); gu0 = ArrayPartition((gA,gB,gC))
#gu0 = CuArray(Float32.(u0))
const gMx = CuArray(Float32.(Mx))
const gMy = CuArray(Float32.(My))
const gα₁ = CuArray(Float32.(α₁))


const gMyA = CuArray(zeros(Float32,N,N))
const gAMx = CuArray(zeros(Float32,N,N))
const gDA = CuArray(zeros(Float32,N,N))

function gf(du,u,p,t)
  A,B,C = u.x
  dA,dB,dC = du.x
  # A = @view  u[:,:,1]
  # B = @view  u[:,:,2]
  # C = @view  u[:,:,3]
  #dA = @view du[:,:,1]
  #dB = @view du[:,:,2]
  #dC = @view du[:,:,3]
  mul!(gMyA,gMy,A)
  mul!(gAMx,A,gMx)
  @. DA = D*(gMyA + gAMx)
  @. dA = DA + gα₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end

prob2 = ODEProblem(gf,gu0,(0.0,100.0))
#CuArray.allowslow(false) # makes sure none of the slow fallbacks are used
CuArrays.allowscalar(true) # so that the error generates
sol = solve(prob2,BS3(),save_everystep=false,save_start=false)

I would very much appreciate any help understanding why this doesn’t seem to work at all, since it’s been a headlining feature for the better part of a year.

You forgot to change the cache for DA to the GPU. These two lines should be:

  @. gDA = D*(gMyA + gAMx)
  @. gdA = gDA + gα₁ - β₁*A - r₁*A*B + r₂*C

Then this script works. As for the allowscalar issue, I just put a patch in for RecursiveArrayTools: it was an ArrayPartition thing. Also, I now set this up on our GPU CI so it’ll be tracked.It should all be good to go.

Of course, I wouldn’t recommend implementing this exact problem like this, but it’ll work.

For reference, I just updated the blog post to now use the ROCK2 method which will be much faster than BS3 here. Cheers.

Thanks, Chris. Missing a device cache is a little embarrassing but I’m glad it was a simple fix.

In terms of legibility, is there a way to make the error message more descriptive for this sort of error?

Unfortunately v2.0.5 of RecursiveArrayTools yields the same error as before with ROCK2/ROCK4 (but BS3 works as expected), but thanks for having a look.

What are you running? The code here works for me:

using OrdinaryDiffEq, RecursiveArrayTools, LinearAlgebra, Test, SparseArrays, SparseDiffTools, Sundials

# Define the constants for the PDE
const α₂ = 1.0
const α₃ = 1.0
const β₁ = 1.0
const β₂ = 1.0
const β₃ = 1.0
const r₁ = 1.0
const r₂ = 1.0
const D = 100.0
const γ₁ = 0.1
const γ₂ = 0.1
const γ₃ = 0.1
const N = 256
const X = reshape([i for i in 1:N for j in 1:N],N,N)
const Y = reshape([j for i in 1:N for j in 1:N],N,N)
const α₁ = 1.0.*(X.>=4*N/5)

const Mx = Tridiagonal([1.0 for i in 1:N-1],[-2.0 for i in 1:N],[1.0 for i in 1:N-1])
const My = copy(Mx)
Mx[2,1] = 2.0
Mx[end-1,end] = 2.0
My[1,2] = 2.0
My[end,end-1] = 2.0

# Define the initial condition as normal arrays
u0 = zeros(N,N,3)

const MyA = zeros(N,N);
const AMx = zeros(N,N);
const DA = zeros(N,N);
# Define the discretized PDE as an ODE function
function f(du,u,p,t)
   A = @view  u[:,:,1]
   B = @view  u[:,:,2]
   C = @view  u[:,:,3]
  dA = @view du[:,:,1]
  dB = @view du[:,:,2]
  dC = @view du[:,:,3]
  mul!(MyA,My,A)
  mul!(AMx,A,Mx)
  @. DA = D*(MyA + AMx)
  @. dA = DA + α₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end

Iy = SparseMatrixCSC(I,N,N)
Ix = SparseMatrixCSC(I,N,N)
fJ = ones(3,3)
Dz = [1 0 0
      0 0 0
      0 0 0]
jacsparsity = kron(Dz,Iy,sparse(Mx)) + kron(Dz,sparse(My),Ix) + kron(fJ,Iy,Ix)
@time colorvec = matrix_colors(jacsparsity)

# Solve the ODE
ff = ODEFunction(f,colorvec=colorvec,jac_prototype=jacsparsity)
prob = ODEProblem(ff,u0,(0.0,100.0))
sol = solve(prob,BS3(),progress=true,save_everystep=false,save_start=false)
sol = solve(prob,ROCK2(),progress=true,save_everystep=false,save_start=false)
sol = solve(prob,TRBDF2(autodiff=false),progress=true,save_everystep=false,save_start=false)
@time sol = solve(prob,CVODE_BDF(linear_solver=:GMRES),progress=true,save_everystep=false)

println("CPU Times")
println("BS3")
@time sol = solve(prob,BS3(),progress=true,save_everystep=false)
println("ROCK2")
@time sol = solve(prob,ROCK2(),progress=true,save_everystep=false)
println("ROCK4")
@time sol = solve(prob,ROCK4(),progress=true,save_everystep=false)
println("TRBDF2")
@time sol = solve(prob,TRBDF2(autodiff=false),progress=true,save_everystep=false)
println("Rodas5")
@time sol = solve(prob,Rodas5(autodiff=false),progress=true,save_everystep=false)
println("CVODE_BDF")
@time sol = solve(prob,CVODE_BDF(linear_solver=:GMRES),progress=true,save_everystep=false)

using CuArrays
gu0 = CuArray(Float32.(u0))
const gMx = CuArray(Float32.(Mx))
const gMy = CuArray(Float32.(My))
const gα₁ = CuArray(Float32.(α₁))
const gMyA = CuArray(zeros(Float32,N,N))
const gAMx = CuArray(zeros(Float32,N,N))
const gDA = CuArray(zeros(Float32,N,N))

function gf(du,u,p,t)
   A = @view  u[:,:,1]
   B = @view  u[:,:,2]
   C = @view  u[:,:,3]
  dA = @view du[:,:,1]
  dB = @view du[:,:,2]
  dC = @view du[:,:,3]
  mul!(gMyA,gMy,A)
  mul!(gAMx,A,gMx)
  @. gDA = D*(gMyA + gAMx)
  @. dA = gDA + gα₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end

prob2 = ODEProblem(gf,gu0,(0.0,100.0))
CuArrays.allowscalar(false)
sol = solve(prob2,BS3(),save_everystep=false,save_start=false)
sol = solve(prob2,ROCK2(),save_everystep=false,save_start=false)
@test sol.t[end] == 100.0

println("GPU Times")
println("BS3")
@time sol = solve(prob2,BS3(),progress=true,save_everystep=false,save_start=false)
println("ROCK2")
@time sol = solve(prob2,ROCK2(),progress=true,save_everystep=false,save_start=false)

Oh I see what you meant and I see where you got the issue from. I updated the blog post code and it should be good to go. Sorry for the issue! Cheers.

2 Likes