Some help with ArrayFire + DifferentialEquations

Hellos,

I noticed that with one line definition for ArrayFire.jl I can make it run together with DifferentialEquations.jl, but it is not working as expected.

DifferentialEquations is calling a generic multiplication function generic_matvecmul! instead of using ArrayFire *(matmul) and I would like to know how to fix it (if possible) to get all the performance of the GPU.

Here a minimal example of my problem:

# Don't worry about this part
# It is necessary to define the matrix "M"  below
function _sph2cart(θ, ϕ, r)
    N = length(θ)
    x = Array{Float64}(undef, N)
    y = Array{Float64}(undef, N)
    z = Array{Float64}(undef, N)

    x = r*cos(θ)*cos(ϕ)
    y = r*sin(θ)*cos(ϕ)
    z = r*sin(ϕ)
    return x,y,z
end

function _RandPoints_Sphere(N::Int64, Radius, MinDist )
    X = Array{Float64}(undef, N)
    Y = Array{Float64}(undef, N)
    Z = Array{Float64}(undef, N)
    dist_3 = MinDist^3
    iLoop = 1
    nValid = 0
    while nValid < N && iLoop < 10^6
        rvals = 2*rand() - 1
        elevation = asin(rvals)
        azimuth = 2π*rand()
        radii = Radius*(rand()^(1 ./3.))

        newX, newY, newZ = _sph2cart(azimuth, elevation, radii)
        if all(( (X[1:nValid].- newX).^2 + (Y[1:nValid].- newY).^2 + (Z[1:nValid].- newZ).^2 ) .>= dist_3)
            nValid = nValid + 1
            X[nValid] = newX
            Y[nValid] = newY
            Z[nValid] = newZ
        end
        iLoop = iLoop + 1
    end
    if nValid < N
        error("Could not generate all data after 10M interactions")
    end

    X_jk = X*ones(1,N) - ones(N,1)*X';
    Y_jk = Y*ones(1,N) - ones(N,1)*Y';
    Z_jk = Z*ones(1,N) - ones(N,1)*Z';
    R_jk = sqrt.(X_jk.^2 + Y_jk.^2 + Z_jk.^2);
    X_jk = 1;Y_jk=1; Z_jk=1; GC.gc()
    return [X Y Z],R_jk
end
# --------------------------------------------


using ArrayFire, BenchmarkTools, DifferentialEquations, LinearAlgebra

# We need to define any()  with AFArray
Base.any(f::Function, a::AFArray) = Bool(sum(f.(a)))

# Minimal code for what I want to do
function Foo!(du,u,p,t)
	M = p[1]
	Ω = p[2]
	du[:] = M*u + Ω
end

# just another attempt
Goo(u, p,t) = M_gpu*u + Ω_gpu


N = 10 # matrix size

r, R_jk = _RandPoints_Sphere(N, 10, 0 )
M = Array{Complex{Float64}}(undef, N,N)
@. M = -(1/2)*exp(im*R_jk)/(im*R_jk)
M[diagind(M)] .= 1im*1 - 1/2
Ω = -(im/2)*exp.(1im.*r[:,3]) 

M_gpu = AFArray(M)
Ω_gpu = AFArray(Ω)

p = []
push!(p, M)
push!(p, Ω)
p_gpu = []
push!(p_gpu, M_gpu)
push!(p_gpu, Ω_gpu)


println("----")
u0 = zeros(ComplexF64,length(Ω))
tspan = (0.0,100.0)
prob = ODEProblem(Foo!,u0,tspan,p)
@btime sol = DifferentialEquations.solve(prob,Tsit5())


u0_gpu = AFArray(zeros(ComplexF64,length(Ω)))
tspan = (0.0,100.0)
prob_gpu1 = ODEProblem(Foo!,u0,tspan, p_gpu)
@btime sol_gpu1 = DifferentialEquations.solve(prob_gpu1,Tsit5())


u0_gpu = AFArray(zeros(ComplexF64,length(Ω)))
tspan = (0.0,100.0)
prob_gpu2 = ODEProblem(Goo,u0,tspan)
@btime sol_gpu2 = DifferentialEquations.solve(prob_gpu2,Tsit5())

Results:


ArrayFire v3.6.4 (OpenCL, 64-bit Linux, build 1b8030c)
[0] AMD: gfx900, 8176 MB
----
  158.905 μs (1412 allocations: 293.16 KiB)
  830.935 ms (91112 allocations: 5.08 MiB)
  825.654 ms (99245 allocations: 5.99 MiB)

ArrayFire has a memory pool where, instead of mutating them, it’s usually better to create new ones.

But anyways, your problem seems to be a typo:

You didn’t use u0_gpu, and instead had it on the CPU.

Also, you might want to use 32-bit Floats.

(edit)
Using F32 and fixing my typo, I got StackOverflowError using the Foo! :cry:

and Input type mismatch with Goo

What are your input types into that operation? Print out typeof.

To print the type:

function Goo(u, p,t)
        println("M_gpu: ",typeof(M_gpu))
	println("u    : ",typeof(u))
	println("Ω_gpu: ",typeof(Ω_gpu))
	return M_gpu*u + Ω_gpu
end

and I run the code:

M_gpu: AFArray{Complex{Float32},2}
u    : AFArray{Complex{Float32},1}
Ω_gpu: AFArray{Complex{Float32},1}
M_gpu: AFArray{Complex{Float32},2}
u    : AFArray{Complex{Float32},1}
Ω_gpu: AFArray{Complex{Float32},1}
M_gpu: AFArray{Complex{Float32},2}
u    : AFArray{Complex{Float64},1}
Ω_gpu: AFArray{Complex{Float32},1}

at some moment, the type of u changes and creates the issue.

Converting everything again for ComplexFloat64 (and still using the Goo code) produces the same StackOverflowError as before with Foo!.

I don’t see a point where in does in your printout?

In the last 3 printlns, the type of u becomes Float64, and before it was Float32

M_gpu: AFArray{Complex{Float32},2}
u    : AFArray{Complex{Float**64**},1}
Ω_gpu: AFArray{Complex{Float32},1}

Make time Float32, i.e. tspan = (0f0,100f0)

Doing tspan with Float32 and rebooting my PC to make sure that nothing strange was happening, I still get the same StackOverflowError

Whereat?

See what happens if you make abstol=1f-6,reltol=1f-3.

I tested all the Solvers: RK4() I got the same error of type mismatch because u changes to Float64. Most all the others Explicity Runge-Kutta I got StackOverflowError, however the the simplest Euler() with dt=(1/2)^5 works.

It looks to be something in adaptive time stepping. What’s your current code? I may be able to run it soon

using ArrayFire, BenchmarkTools, LinearAlgebra
# We need to define any()  wit AFArray
Base.any(f::Function, a::AFArray) = Bool(sum(f.(a)))

using DifferentialEquations
function Foo!(du,u,p,t)
	M = p[1]
	Ω = p[2]
	du[:] = M*u + Ω
end

function Goo(u, p, t)
	# println("M_gpu: ",typeof(M_gpu))
	# println("u    : ",typeof(u))
	# println("Ω_gpu: ",typeof(Ω_gpu))
	return M_gpu*u + Ω_gpu
end



N = 10
# Create Matrix for this example
r, R_jk = _RandPoints_Sphere(N, 10, 0 )
M = Array{Complex{Float32}}(undef, N,N)
@. M = -(1/2)*exp(im*R_jk)/(im*R_jk)
M[diagind(M)] .= 1im*1 - 1/2
Ω = ComplexF32.(-(im/2)*exp.(im.*r[:,3]))

M_gpu = AFArray(M)
Ω_gpu = AFArray(Ω)

p = []
push!(p, M)
push!(p, Ω)

println("----")
u0 = zeros(ComplexF32,length(Ω))
tspan = (0f0,100f0)
prob = ODEProblem(Foo!,u0,tspan,p)
@time sol = DifferentialEquations.solve(prob,Tsit5())
 

u0_gpu = AFArray(zeros(ComplexF32,length(Ω)))
tspan = (0f0,100f0)
prob_gpu2 = ODEProblem(Goo,u0_gpu,tspan)
@time sol_gpu2 = DifferentialEquations.solve(prob_gpu2, Euler(), dt=(1/2)^5)

I got warnings that I cannot see, but looks like both results (cpu and gpu) are working as expected ( to see this I need further codes, and they are not relevant here)


ArrayFire v3.6.4 (OpenCL, 64-bit Linux, build 1b8030c)
[0] AMD: gfx900, 8176 MB
----
  3.061343 seconds (13.05 M allocations: 624.245 MiB, 5.90% gc time)
3 warnings generated.
  1.506639 seconds (2.33 M allocations: 119.601 MiB, 1.70% gc time)

Yeah that stack overflow…

https://github.com/JuliaGPU/ArrayFire.jl/blob/master/src/array.jl#L217-L225

That broadcast implementation is really not a good idea. I’m surprised any code works. The fix should be:

using DiffEqBase
@inline function DiffEqBase.calculate_residuals(ũ::AFArray, u₀::AFArray, u₁::AFArray,
                                    α, ρ, internalnorm,t)
    @. ũ / (α + max(norm(u₀,t), norm(u₁,t)) * ρ)
end

but, because their broadcast implementation doesn’t really work, it breaks the fusion of the expression:

MethodError: no method matching /(::AFArray{Complex{Float32},1}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(+),Tuple{Float32,Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(max),Tuple{Float32,Float32}},Float32}}}})
Closest candidates are:
  /(::AFArray, !Matched::Number) at C:\Users\accou\.julia\packages\ArrayFire\nkN6J\src\array.jl:125
  /(::AbstractArray, !Matched::Number) at arraymath.jl:55
  /(::AbstractArray{T,1} where T, !Matched::Union{UnitLowerTriangular, UnitUpperTriangular}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.3\LinearAlgebra\src\triangular.jl:1948
  ...
broadcasted(::Function, ::AFArray{Complex{Float32},1}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(+),Tuple{Float32,Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(*),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(max),Tuple{Float32,Float32}},Float32}}}}) at array.jl:220
calculate_residuals(::AFArray{Complex{Float32},1}, ::AFArray{Complex{Float32},1}, ::AFArray{Complex{Float32},1}, ::Float32, ::Float32, ::Function, ::Float32) at test.jl:158
perform_step!(::OrdinaryDiffEq.ODEIntegrator{Tsit5,false,AFArray{Complex{Float32},1},Float32,DiffEqBase.NullParameters,Float32,Float32,Float32,Array{AFArray{Complex{Float32},1},1},ODESolution{Complex{Float32},2,Array{AFArray{Complex{Float32},1},1},Nothing,Nothing,Array{Float32,1},Array{Array{AFArray{Complex{Float32},1},1},1},ODEProblem{AFArray{Complex{Float32},1},Tuple{Float32,Float32},false,DiffEqBase.NullParameters,ODEFunction{false,typeof(Goo),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},Tsit5,OrdinaryDiffEq.InterpolationData{ODEFunction{false,typeof(Goo),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Array{AFArray{Complex{Float32},1},1},Array{Float32,1},Array{Array{AFArray{Complex{Float32},1},1},1},OrdinaryDiffEq.Tsit5ConstantCache{Complex{Float32},Float32}},DiffEqBase.DEStats},ODEFunction{false,typeof(Goo),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},OrdinaryDiffEq.Tsit5ConstantCache{Complex{Float32},Float32},OrdinaryDiffEq.DEOptions{Float32,Float32,Float32,Float32,typeof(DiffEqBase.ODE_DEFAULT_NORM),typeof(opnorm),CallbackSet{Tuple{},Tuple{}},typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN),typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE),typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK),DataStructures.BinaryHeap{Float32,DataStructures.LessThan},DataStructures.BinaryHeap{Float32,DataStructures.LessThan},Nothing,Nothing,Int64,Array{Float32,1},Array{Float32,1},Array{Float32,1}},AFArray{Complex{Float32},1},Complex{Float32},Nothing}, ::OrdinaryDiffEq.Tsit5ConstantCache{Complex{Float32},Float32}, ::Bool) at low_order_rk_perform_step.jl:598
perform_step! at low_order_rk_perform_step.jl:578 [inlined]
solve!(::OrdinaryDiffEq.ODEIntegrator{Tsit5,false,AFArray{Complex{Float32},1},Float32,DiffEqBase.NullParameters,Float32,Float32,Float32,Array{AFArray{Complex{Float32},1},1},ODESolution{Complex{Float32},2,Array{AFArray{Complex{Float32},1},1},Nothing,Nothing,Array{Float32,1},Array{Array{AFArray{Complex{Float32},1},1},1},ODEProblem{AFArray{Complex{Float32},1},Tuple{Float32,Float32},false,DiffEqBase.NullParameters,ODEFunction{false,typeof(Goo),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},Tsit5,OrdinaryDiffEq.InterpolationData{ODEFunction{false,typeof(Goo),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Array{AFArray{Complex{Float32},1},1},Array{Float32,1},Array{Array{AFArray{Complex{Float32},1},1},1},OrdinaryDiffEq.Tsit5ConstantCache{Complex{Float32},Float32}},DiffEqBase.DEStats},ODEFunction{false,typeof(Goo),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},OrdinaryDiffEq.Tsit5ConstantCache{Complex{Float32},Float32},OrdinaryDiffEq.DEOptions{Float32,Float32,Float32,Float32,typeof(DiffEqBase.ODE_DEFAULT_NORM),typeof(opnorm),CallbackSet{Tuple{},Tuple{}},typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN),typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE),typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK),DataStructures.BinaryHeap{Float32,DataStructures.LessThan},DataStructures.BinaryHeap{Float32,DataStructures.LessThan},Nothing,Nothing,Int64,Array{Float32,1},Array{Float32,1},Array{Float32,1}},AFArray{Complex{Float32},1},Complex{Float32},Nothing}) at solve.jl:373
#__solve#331 at solve.jl:5 [inlined]
__solve at solve.jl:4 [inlined]
#solve_call#442(::Base.Iterators.Pairs{Union{},Union{},Tuple{},Na...

I would open an issue that it’s not possible to do the operation

t = 1f0
ρ = 1f0
α = 1f0
u₀ = AFArray(rand(ComplexF32,length(Ω)))
u₁ = AFArray(rand(ComplexF32,length(Ω)))
ũ =  AFArray(rand(ComplexF32,length(Ω)))
@. ũ / (α + max(norm(u₀,t), norm(u₁,t)) * ρ)

with ArrayFire.jl arrays.

1 Like

BTW, is there any reason why you’re using ArrayFire instead of CuArrays?

1 Like

They’re using an AMD GPU. Hopefully soon ROCArrays will work well enough to run their code (if it doesn’t already).

2 Likes

I see. Okay, so we should probably get the ArrayFire broadcast implementation fixed anyways. What it’s trying to do is make sure it calls the right kernels, but it doesn’t actually turn off fusion. It needs to do this kind of thing:

https://github.com/JuliaLang/julia/blob/master/base/broadcast.jl#L1026-L1082

on *, norm, etc.

As stated before, I have a AMD GPU and ArrayFire is enough for me.