# Some help with ArrayFire + DifferentialEquations

Hellos,

I noticed that with one line definition for `ArrayFire.jl` I can make it run together with `DifferentialEquations.jl`, but it is not working as expected.

DifferentialEquations is calling a generic multiplication function `generic_matvecmul!` instead of using ArrayFire `*(matmul)` and I would like to know how to fix it (if possible) to get all the performance of the GPU.

Here a minimal example of my problem:

``````# Don't worry about this part
# It is necessary to define the matrix "M"  below
function _sph2cart(θ, ϕ, r)
N = length(θ)
x = Array{Float64}(undef, N)
y = Array{Float64}(undef, N)
z = Array{Float64}(undef, N)

x = r*cos(θ)*cos(ϕ)
y = r*sin(θ)*cos(ϕ)
z = r*sin(ϕ)
return x,y,z
end

X = Array{Float64}(undef, N)
Y = Array{Float64}(undef, N)
Z = Array{Float64}(undef, N)
dist_3 = MinDist^3
iLoop = 1
nValid = 0
while nValid < N && iLoop < 10^6
rvals = 2*rand() - 1
elevation = asin(rvals)
azimuth = 2π*rand()

newX, newY, newZ = _sph2cart(azimuth, elevation, radii)
if all(( (X[1:nValid].- newX).^2 + (Y[1:nValid].- newY).^2 + (Z[1:nValid].- newZ).^2 ) .>= dist_3)
nValid = nValid + 1
X[nValid] = newX
Y[nValid] = newY
Z[nValid] = newZ
end
iLoop = iLoop + 1
end
if nValid < N
error("Could not generate all data after 10M interactions")
end

X_jk = X*ones(1,N) - ones(N,1)*X';
Y_jk = Y*ones(1,N) - ones(N,1)*Y';
Z_jk = Z*ones(1,N) - ones(N,1)*Z';
R_jk = sqrt.(X_jk.^2 + Y_jk.^2 + Z_jk.^2);
X_jk = 1;Y_jk=1; Z_jk=1; GC.gc()
return [X Y Z],R_jk
end
# --------------------------------------------

using ArrayFire, BenchmarkTools, DifferentialEquations, LinearAlgebra

# We need to define any()  with AFArray
Base.any(f::Function, a::AFArray) = Bool(sum(f.(a)))

# Minimal code for what I want to do
function Foo!(du,u,p,t)
M = p
Ω = p
du[:] = M*u + Ω
end

# just another attempt
Goo(u, p,t) = M_gpu*u + Ω_gpu

N = 10 # matrix size

r, R_jk = _RandPoints_Sphere(N, 10, 0 )
M = Array{Complex{Float64}}(undef, N,N)
@. M = -(1/2)*exp(im*R_jk)/(im*R_jk)
M[diagind(M)] .= 1im*1 - 1/2
Ω = -(im/2)*exp.(1im.*r[:,3])

M_gpu = AFArray(M)
Ω_gpu = AFArray(Ω)

p = []
push!(p, M)
push!(p, Ω)
p_gpu = []
push!(p_gpu, M_gpu)
push!(p_gpu, Ω_gpu)

println("----")
u0 = zeros(ComplexF64,length(Ω))
tspan = (0.0,100.0)
prob = ODEProblem(Foo!,u0,tspan,p)
@btime sol = DifferentialEquations.solve(prob,Tsit5())

u0_gpu = AFArray(zeros(ComplexF64,length(Ω)))
tspan = (0.0,100.0)
prob_gpu1 = ODEProblem(Foo!,u0,tspan, p_gpu)
@btime sol_gpu1 = DifferentialEquations.solve(prob_gpu1,Tsit5())

u0_gpu = AFArray(zeros(ComplexF64,length(Ω)))
tspan = (0.0,100.0)
prob_gpu2 = ODEProblem(Goo,u0,tspan)
@btime sol_gpu2 = DifferentialEquations.solve(prob_gpu2,Tsit5())
``````

Results:

``````
ArrayFire v3.6.4 (OpenCL, 64-bit Linux, build 1b8030c)
 AMD: gfx900, 8176 MB
----
158.905 μs (1412 allocations: 293.16 KiB)
830.935 ms (91112 allocations: 5.08 MiB)
825.654 ms (99245 allocations: 5.99 MiB)
``````

ArrayFire has a memory pool where, instead of mutating them, it’s usually better to create new ones.

But anyways, your problem seems to be a typo:

You didn’t use `u0_gpu`, and instead had it on the CPU.

Also, you might want to use 32-bit Floats.

(edit)
Using `F32` and fixing my typo, I got `StackOverflowError` using the `Foo!` and Input type mismatch with `Goo`

What are your input types into that operation? Print out `typeof`.

To print the type:

``````function Goo(u, p,t)
println("M_gpu: ",typeof(M_gpu))
println("u    : ",typeof(u))
println("Ω_gpu: ",typeof(Ω_gpu))
return M_gpu*u + Ω_gpu
end
``````

and I run the code:

``````M_gpu: AFArray{Complex{Float32},2}
u    : AFArray{Complex{Float32},1}
Ω_gpu: AFArray{Complex{Float32},1}
M_gpu: AFArray{Complex{Float32},2}
u    : AFArray{Complex{Float32},1}
Ω_gpu: AFArray{Complex{Float32},1}
M_gpu: AFArray{Complex{Float32},2}
u    : AFArray{Complex{Float64},1}
Ω_gpu: AFArray{Complex{Float32},1}
``````

at some moment, the type of `u` changes and creates the issue.

Converting everything again for ComplexFloat64 (and still using the `Goo` code) produces the same `StackOverflowError` as before with `Foo!`.

I don’t see a point where in does in your printout?

In the last 3 printlns, the type of u becomes Float64, and before it was Float32

``````M_gpu: AFArray{Complex{Float32},2}
u    : AFArray{Complex{Float**64**},1}
Ω_gpu: AFArray{Complex{Float32},1}
``````

Make time `Float32`, i.e. `tspan = (0f0,100f0)`

Doing `tspan` with `Float32` and rebooting my PC to make sure that nothing strange was happening, I still get the same `StackOverflowError`

Whereat?

See what happens if you make `abstol=1f-6,reltol=1f-3`.

I tested all the Solvers: RK4() I got the same error of type mismatch because `u` changes to `Float64`. Most all the others Explicity Runge-Kutta I got `StackOverflowError`, however the the simplest `Euler()` with `dt=(1/2)^5` works.

It looks to be something in adaptive time stepping. What’s your current code? I may be able to run it soon

``````using ArrayFire, BenchmarkTools, LinearAlgebra
# We need to define any()  wit AFArray
Base.any(f::Function, a::AFArray) = Bool(sum(f.(a)))

using DifferentialEquations
function Foo!(du,u,p,t)
M = p
Ω = p
du[:] = M*u + Ω
end

function Goo(u, p, t)
# println("M_gpu: ",typeof(M_gpu))
# println("u    : ",typeof(u))
# println("Ω_gpu: ",typeof(Ω_gpu))
return M_gpu*u + Ω_gpu
end

N = 10
# Create Matrix for this example
r, R_jk = _RandPoints_Sphere(N, 10, 0 )
M = Array{Complex{Float32}}(undef, N,N)
@. M = -(1/2)*exp(im*R_jk)/(im*R_jk)
M[diagind(M)] .= 1im*1 - 1/2
Ω = ComplexF32.(-(im/2)*exp.(im.*r[:,3]))

M_gpu = AFArray(M)
Ω_gpu = AFArray(Ω)

p = []
push!(p, M)
push!(p, Ω)

println("----")
u0 = zeros(ComplexF32,length(Ω))
tspan = (0f0,100f0)
prob = ODEProblem(Foo!,u0,tspan,p)
@time sol = DifferentialEquations.solve(prob,Tsit5())

u0_gpu = AFArray(zeros(ComplexF32,length(Ω)))
tspan = (0f0,100f0)
prob_gpu2 = ODEProblem(Goo,u0_gpu,tspan)
@time sol_gpu2 = DifferentialEquations.solve(prob_gpu2, Euler(), dt=(1/2)^5)
``````

I got warnings that I cannot see, but looks like both results (cpu and gpu) are working as expected ( to see this I need further codes, and they are not relevant here)

``````
ArrayFire v3.6.4 (OpenCL, 64-bit Linux, build 1b8030c)
 AMD: gfx900, 8176 MB
----
3.061343 seconds (13.05 M allocations: 624.245 MiB, 5.90% gc time)
3 warnings generated.
1.506639 seconds (2.33 M allocations: 119.601 MiB, 1.70% gc time)
``````

Yeah that stack overflow…

That broadcast implementation is really not a good idea. I’m surprised any code works. The fix should be:

``````using DiffEqBase
@inline function DiffEqBase.calculate_residuals(ũ::AFArray, u₀::AFArray, u₁::AFArray,
α, ρ, internalnorm,t)
@. ũ / (α + max(norm(u₀,t), norm(u₁,t)) * ρ)
end
``````

but, because their broadcast implementation doesn’t really work, it breaks the fusion of the expression:

``````MethodError: no method matching /(::AFArray{Complex{Float32},1}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(+),Tuple{Float32,Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(max),Tuple{Float32,Float32}},Float32}}}})
Closest candidates are:
/(::AFArray, !Matched::Number) at C:\Users\accou\.julia\packages\ArrayFire\nkN6J\src\array.jl:125
/(::AbstractArray, !Matched::Number) at arraymath.jl:55
/(::AbstractArray{T,1} where T, !Matched::Union{UnitLowerTriangular, UnitUpperTriangular}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.3\LinearAlgebra\src\triangular.jl:1948
...
calculate_residuals(::AFArray{Complex{Float32},1}, ::AFArray{Complex{Float32},1}, ::AFArray{Complex{Float32},1}, ::Float32, ::Float32, ::Function, ::Float32) at test.jl:158
perform_step!(::OrdinaryDiffEq.ODEIntegrator{Tsit5,false,AFArray{Complex{Float32},1},Float32,DiffEqBase.NullParameters,Float32,Float32,Float32,Array{AFArray{Complex{Float32},1},1},ODESolution{Complex{Float32},2,Array{AFArray{Complex{Float32},1},1},Nothing,Nothing,Array{Float32,1},Array{Array{AFArray{Complex{Float32},1},1},1},ODEProblem{AFArray{Complex{Float32},1},Tuple{Float32,Float32},false,DiffEqBase.NullParameters,ODEFunction{false,typeof(Goo),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},Tsit5,OrdinaryDiffEq.InterpolationData{ODEFunction{false,typeof(Goo),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Array{AFArray{Complex{Float32},1},1},Array{Float32,1},Array{Array{AFArray{Complex{Float32},1},1},1},OrdinaryDiffEq.Tsit5ConstantCache{Complex{Float32},Float32}},DiffEqBase.DEStats},ODEFunction{false,typeof(Goo),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},OrdinaryDiffEq.Tsit5ConstantCache{Complex{Float32},Float32},OrdinaryDiffEq.DEOptions{Float32,Float32,Float32,Float32,typeof(DiffEqBase.ODE_DEFAULT_NORM),typeof(opnorm),CallbackSet{Tuple{},Tuple{}},typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN),typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE),typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK),DataStructures.BinaryHeap{Float32,DataStructures.LessThan},DataStructures.BinaryHeap{Float32,DataStructures.LessThan},Nothing,Nothing,Int64,Array{Float32,1},Array{Float32,1},Array{Float32,1}},AFArray{Complex{Float32},1},Complex{Float32},Nothing}, ::OrdinaryDiffEq.Tsit5ConstantCache{Complex{Float32},Float32}, ::Bool) at low_order_rk_perform_step.jl:598
perform_step! at low_order_rk_perform_step.jl:578 [inlined]
solve!(::OrdinaryDiffEq.ODEIntegrator{Tsit5,false,AFArray{Complex{Float32},1},Float32,DiffEqBase.NullParameters,Float32,Float32,Float32,Array{AFArray{Complex{Float32},1},1},ODESolution{Complex{Float32},2,Array{AFArray{Complex{Float32},1},1},Nothing,Nothing,Array{Float32,1},Array{Array{AFArray{Complex{Float32},1},1},1},ODEProblem{AFArray{Complex{Float32},1},Tuple{Float32,Float32},false,DiffEqBase.NullParameters,ODEFunction{false,typeof(Goo),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},Tsit5,OrdinaryDiffEq.InterpolationData{ODEFunction{false,typeof(Goo),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Array{AFArray{Complex{Float32},1},1},Array{Float32,1},Array{Array{AFArray{Complex{Float32},1},1},1},OrdinaryDiffEq.Tsit5ConstantCache{Complex{Float32},Float32}},DiffEqBase.DEStats},ODEFunction{false,typeof(Goo),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},OrdinaryDiffEq.Tsit5ConstantCache{Complex{Float32},Float32},OrdinaryDiffEq.DEOptions{Float32,Float32,Float32,Float32,typeof(DiffEqBase.ODE_DEFAULT_NORM),typeof(opnorm),CallbackSet{Tuple{},Tuple{}},typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN),typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE),typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK),DataStructures.BinaryHeap{Float32,DataStructures.LessThan},DataStructures.BinaryHeap{Float32,DataStructures.LessThan},Nothing,Nothing,Int64,Array{Float32,1},Array{Float32,1},Array{Float32,1}},AFArray{Complex{Float32},1},Complex{Float32},Nothing}) at solve.jl:373
#__solve#331 at solve.jl:5 [inlined]
__solve at solve.jl:4 [inlined]
#solve_call#442(::Base.Iterators.Pairs{Union{},Union{},Tuple{},Na...
``````

I would open an issue that it’s not possible to do the operation

``````t = 1f0
ρ = 1f0
α = 1f0
u₀ = AFArray(rand(ComplexF32,length(Ω)))
u₁ = AFArray(rand(ComplexF32,length(Ω)))
ũ =  AFArray(rand(ComplexF32,length(Ω)))
@. ũ / (α + max(norm(u₀,t), norm(u₁,t)) * ρ)
``````

with ArrayFire.jl arrays.

1 Like

BTW, is there any reason why you’re using ArrayFire instead of CuArrays?

1 Like

They’re using an AMD GPU. Hopefully soon ROCArrays will work well enough to run their code (if it doesn’t already).

2 Likes

I see. Okay, so we should probably get the ArrayFire broadcast implementation fixed anyways. What it’s trying to do is make sure it calls the right kernels, but it doesn’t actually turn off fusion. It needs to do this kind of thing:

on `*`, `norm`, etc.

As stated before, I have a AMD GPU and ArrayFire is enough for me.