Code works on CPU but not on GPU

Hi,
I’m trying to implement a PINN (physiscs informed neural network) to solve the heat equation in Julia. I come from pytorch and so my code probably doesn’t look very Julia-like. I managed to get a version of my code running on the cpu, but I want to run it on a GPU now and I’m encountering a lot of issues. Here’s the code I have. First I initialise some constants (not clear to me if I should send them all to the GPU if I want my computation to happen there)

using LinearAlgebra
using Parameters,BenchmarkTools
using Flux
using PyPlot
using Sobol

σ = Float32(0.1)  |>gpu
d = Int32(2)  |>gpu
const Δx = Float32(1e-2) |>gpu
const Δt = Float32(1e-2) |>gpu

Then I define some functions that will be useful in the future:

function fExpl0(x) 
    sol = 1/(4*π*σ)^(d/2)*exp(-norm(x)^2/(4*σ))
    return Float32(sol)
end

function ∂t(f,point)
    t = point[1:1]
    x = point[2:end]
    g(t,x) = f([t,x...]) 

    return ∂t(g,t,x)
end

function ∂t(f,t,x)
    return (f(t .+ Δt,x) .- f(t,x))./(Δt)
end

function Δ(f,point)
    t = point[1:1]
    x = point[2:end]
    g(t,x) = f([t,x...]) 

    return Δ(g,t,x)
end

function Δ(f,t,x)
    d = length(x)
    Δf = eltype(x)(0.0) 
    for i = 1:d
        Δf += (f(t,x .+ Δx*I(d)[i,:]) .- 2*f(t,x) .+  f(t,x .+ (-Δx)*I(d)[i,:]))./Δx^2
    end
    return Δf
end

I didn’t manage to use automated differences in my loss function and from here I understood that it could be easier to use finite differences (if you have a way to use autodiff I’d greatly appreciate) .
Then I define a custom neural network with Flux that matches perfectly the initial condition

struct NNInitialCondition
    chain::Chain
    f0::Function
end

function (m::NNInitialCondition)(point)
    t = point[1:1]
    x = point[2:end]
    return t .* m.chain(point)[1:1] .+ (1 .- t) .* m.f0(x)
end

Flux.@functor NNInitialCondition

NN = Chain(
    Dense(d+1 => 32, tanh),
    Dense(32 => 32, tanh),
    Dense(32 => 32, tanh),
    Dense(32 => 1)
) |> gpu

NNCustom = NNInitialCondition(NN,fExpl0)

Finally I define my loss function and the evaluation points:

function loss(f,point)
    lossVal = (∂t(f,point) .- σ*Δ(f,point) ).^2
    return lossVal
end

NPtTrain = 1000

s = SobolSeq(d+1)

evalPointsTrain = [Float32.(next!(s)) for i in 1:NPtTrain] |> gpu

Then I call:

NNCustom(evalPointsTrain[1])
loss(NNCustom,evalPointsTrain[1])

I have three questions at the moment:
1 - If I do not wrap the output of fExpl0(x) into a Float32() instruction then my custom net returns a Float64 value. Am I doing things correctly? Is this the best way to get a Float32 output from my custom model?
2 - After getting some warnings from CUDA due to scalar indexing, I started using stuff like t = point[1:1] rather than t = point[1] is this the correct way to proceed?
3- When I call loss(NNCustom,evalPointsTrain[1]) I get the error :

ERROR: MethodError: no method matching +(::CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, ::Float32)
For element-wise addition, use broadcasting with dot syntax: array .+ scalar

I think it comes from the line return (f(t .+ Δt,x) .- f(t,x))./(Δt) which already has a .+ and I do not understand why it’s complaining. I guess that it comes from this line because I get the same error if I call ∂t(NNCustom,evalPointsTrain[1])

Don’t splat. It’s bad for performance is would break GPU here. Instead of [t,x…] Do [t;x].

1 Like

Cool, that helped. Now after a lot of iterations I got this code.
Importing packages and defining parameters

using LinearAlgebra
using Parameters,BenchmarkTools
using Flux
using PyPlot
using Sobol
using CUDA
using OneHotArrays 

σ = Float32(0.1)  |>gpu
d = Int32(2)  |>gpu
BATCH_SIZE = 1000 |>gpu
const Δx = Float32(1e-2) |>gpu
const Δt = Float32(1e-2) |>gpu

Define initial condition and derivatives:

function fExpl0(x) 
    sol = 1/(4*π*σ)^(d/2)*exp.(-sum(x.^2, dims=1) ./(4*σ))
    return Float32.(sol)
end

function ∂t(f,point)
    t = point[1:1,:]
    x = point[2:end,:]
    return (f([t .+ Δt;x]) .- f([t;x]))./(Δt)
end

function Δ(f,point)
    t = point[1:1,:]
    x = point[2:end,:]
    Δf = CUDA.zeros(1,size(point)[end])
     for i = 1:d
         direction = OneHotVector(i,Int64(d))
         Δf = Δf .+ (f([t;x .+ Δx .* direction]) .- 2*f([t;x]) .+  f([t;x .+ (-Δx) .* direction]))./Δx^2
     end
    return Δf
end

Define the network and loss function

struct NNInitialCondition
    chain::Chain
    f0::Function
end

function (m::NNInitialCondition)(point)
    t = point[1:1,:]
    x = point[2:end,:]
    return t .* m.chain(point) .+ (1 .- t) .* m.f0(x)
end

Flux.@functor NNInitialCondition

NN = Chain(
    Dense(d+1 => 32, tanh),
    Dense(32 => 32, tanh),
    Dense(32 => 32, tanh),
    Dense(32 => 1)
) |> gpu

function loss(f,point)
    lossVal = sum((∂t(f,point) .- σ*Δ(f,point) ).^2)/size(point)[2]
    return lossVal
end

Defining datapoints, the network and evaluate the loss function and compute the gradient:

NPtTrain = 1000
s = SobolSeq(d+1)
evalPointsTrain = vcat([Float32.(next!(s)) for i in 1:NPtTrain]'...) |> gpu

NNCustom = NNInitialCondition(NN,fExpl0)
loss(NNCustom,evalPointsTrain')
grads =  Flux.gradient( m -> loss(m,evalPointsTrain'), NNCustom)

I have 2 issues:
1 - the interpreter continues to complain about scalar indexing, what am I doing wrong?
2 - I cannot compute the gradients because I get a very cryptic error below
Also, am I approaching the problem in a very wrong way? It took me 30 mins max to do the same in pytorch so I’m thinking that maybe I’m using the wrong approach here.

ERROR: `llvmcall` must be compiled to be called
Stacktrace:
  [1] macro expansion
    @ C:\Users\MARCO\.julia\packages\Zygote\HTsWj\src\compiler\interface2.jl:101 [inlined]
  [2] _pullback(::Zygote.Context{false}, ::Core.IntrinsicFunction, ::String, ::Type{Int64}, ::Type{Tuple{Ptr{Int64}}}, ::Ptr{Int64})
    @ Zygote C:\Users\MARCO\.julia\packages\Zygote\HTsWj\src\compiler\interface2.jl:101
  [3] _pullback
    @ .\atomics.jl:358 [inlined]
  [4] _pullback
    @ C:\Users\MARCO\.julia\packages\CUDA\pCcGc\lib\utils\threading.jl:25 [inlined]
  [5] _pullback
    @ C:\Users\MARCO\.julia\packages\CUDA\pCcGc\lib\utils\threading.jl:24 [inlined]
  [6] macro expansion
    @ C:\Users\MARCO\.julia\packages\CUDA\pCcGc\lib\utils\memoization.jl:66 [inlined]
  [7] _pullback
    @ C:\Users\MARCO\.julia\packages\CUDA\pCcGc\src\pool.jl:160 [inlined]
  [8] _pullback(ctx::Zygote.Context{false}, f::typeof(CUDA.stream_ordered), args::CuDevice)
    @ Zygote C:\Users\MARCO\.julia\packages\Zygote\HTsWj\src\compiler\interface2.jl:0
  [9] macro expansion
    @ C:\Users\MARCO\.julia\packages\CUDA\pCcGc\src\pool.jl:414 [inlined]
 [10] macro expansion
    @ .\timing.jl:393 [inlined]
 [11] _pullback
    @ C:\Users\MARCO\.julia\packages\CUDA\pCcGc\src\pool.jl:413 [inlined]
 [12] _pullback
    @ C:\Users\MARCO\.julia\packages\CUDA\pCcGc\src\pool.jl:408 [inlined]
 [13] _pullback
    @ C:\Users\MARCO\.julia\packages\CUDA\pCcGc\src\pool.jl:398 [inlined]
 [14] _pullback
    @ C:\Users\MARCO\.julia\packages\CUDA\pCcGc\src\pool.jl:392 [inlined]
 [15] _pullback
    @ C:\Users\MARCO\.julia\packages\CUDA\pCcGc\src\array.jl:93 [inlined]
 [16] _pullback(::Zygote.Context{false}, ::Type{CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}, ::UndefInitializer, ::Tuple{Int64, Int64})
    @ Zygote C:\Users\MARCO\.julia\packages\Zygote\HTsWj\src\compiler\interface2.jl:0
 [17] _pullback
    @ C:\Users\MARCO\.julia\packages\CUDA\pCcGc\src\array.jl:176 [inlined]
 [18] _pullback
    @ C:\Users\MARCO\.julia\packages\CUDA\pCcGc\src\array.jl:187 [inlined]
 [19] _pullback
    @ C:\Users\MARCO\.julia\packages\CUDA\pCcGc\src\array.jl:189 [inlined]
 [20] _pullback(::Zygote.Context{false}, ::Type{CuArray{Float32}}, ::UndefInitializer, ::Int64, ::Int64)
    @ Zygote C:\Users\MARCO\.julia\packages\Zygote\HTsWj\src\compiler\interface2.jl:0
 [21] _apply(::Function, ::Vararg{Any})
    @ Core .\boot.jl:838
 [22] adjoint
    @ C:\Users\MARCO\.julia\packages\Zygote\HTsWj\src\lib\lib.jl:203 [inlined]
 [23] _pullback
    @ C:\Users\MARCO\.julia\packages\ZygoteRules\OgCVT\src\adjoint.jl:66 [inlined]
 [24] _pullback
    @ C:\Users\MARCO\.julia\packages\CUDA\pCcGc\src\array.jl:655 [inlined]
 [25] _pullback(::Zygote.Context{false}, ::typeof(CUDA.zeros), ::Type{Float32}, ::Int64, ::Int64)
    @ Zygote C:\Users\MARCO\.julia\packages\Zygote\HTsWj\src\compiler\interface2.jl:0
 [26] _apply(::Function, ::Vararg{Any})
    @ Core .\boot.jl:838
 [27] adjoint
    @ C:\Users\MARCO\.julia\packages\Zygote\HTsWj\src\lib\lib.jl:203 [inlined]
 [28] _pullback
    @ C:\Users\MARCO\.julia\packages\ZygoteRules\OgCVT\src\adjoint.jl:66 [inlined]
 [29] _pullback
    @ C:\Users\MARCO\.julia\packages\CUDA\pCcGc\src\array.jl:657 [inlined]
 [30] _pullback(::Zygote.Context{false}, ::typeof(CUDA.zeros), ::Int64, ::Int64)
    @ Zygote C:\Users\MARCO\.julia\packages\Zygote\HTsWj\src\compiler\interface2.jl:0
 [31] _pullback
    @ c:\Users\MARCO\Dropbox\cricci francischello\neuralPDEMasterEq\tests\HeatEQ\testLossFD.jl:39 [inlined]
 [32] _pullback(::Zygote.Context{false}, ::typeof(Δ), ::NNInitialCondition, ::Adjoint{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}})
    @ Zygote C:\Users\MARCO\.julia\packages\Zygote\HTsWj\src\compiler\interface2.jl:0
 [33] _pullback
    @ c:\Users\MARCO\Dropbox\cricci francischello\neuralPDEMasterEq\tests\HeatEQ\testLossFD.jl:87 [inlined]
 [34] _pullback(::Zygote.Context{false}, ::typeof(loss), ::NNInitialCondition, ::Adjoint{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}})
    @ Zygote C:\Users\MARCO\.julia\packages\Zygote\HTsWj\src\compiler\interface2.jl:0
 [35] _pullback
    @ c:\Users\MARCO\Dropbox\cricci francischello\neuralPDEMasterEq\tests\HeatEQ\testLossFD.jl:99 [inlined]
 [36] _pullback(ctx::Zygote.Context{false}, f::var"#17#18", args::NNInitialCondition)
    @ Zygote C:\Users\MARCO\.julia\packages\Zygote\HTsWj\src\compiler\interface2.jl:0
 [37] pullback(f::Function, cx::Zygote.Context{false}, args::NNInitialCondition)
    @ Zygote C:\Users\MARCO\.julia\packages\Zygote\HTsWj\src\compiler\interface.jl:44
 [38] pullback
    @ C:\Users\MARCO\.julia\packages\Zygote\HTsWj\src\compiler\interface.jl:42 [inlined]
 [39] gradient(f::Function, args::NNInitialCondition)
    @ Zygote C:\Users\MARCO\.julia\packages\Zygote\HTsWj\src\compiler\interface.jl:96
 [40] top-level scope
    @ c:\Users\MARCO\Dropbox\cricci francischello\neuralPDEMasterEq\tests\HeatEQ\testLossFD.jl:99

TBH this stuff all looks super weird to me. Have you taken a peek at the model zoo?

Yeah, thanks. The model itself is not very complicated tbh, but I’ll have a look to see if there’s something similar. The problem seems to lie in the Δ(f,point) function. The rest I can happily take the gradient of. I thought of using FiniteDiff.jl rather than doing it by hand, but in the future I’d like to solve a PDE where I have say 45 dimensions but I only need 5 second order derivatives and it’s not super clear from the docs if there’s a way to just compute 5 entries of the Hessian with the mentioned package. I tried to look at the source but it’s pretty obscure to me :eyes:

How big is your function, in terms of total number of operations? If it is less than 10^6 then brianguenter/FastDifferentiation.jl: Fast derivative evaluation (github.com) might work for you.

This package unrolls matrix multiplications into scalar operations so if your matrices are big it won’t be practical. But if your problem size is in the range that FastDifferentiation can handle you should get very good performance.

I don’t think there is currently a way to compute just a few entries in the Hessian. Create an issue on the repo and I’ll take a look. It shouldn’t be hard to add.

The error comes from trying to differentiate through CUDA.zeros, which Zygote (the AD engine Flux uses by default) doesn’t support. There are two solutions here. The fastest would be to use the functionality provided by API · ChainRules to tell the AD not to recurse into CUDA.zeros (and any other code which is problematic but doesn’t need gradients to flow through it). The other and arguably better one is to replace explicit use of CUDA.jl functions like CUDA.zeros with more general code such as fill!(similar(point, 1, size(point)[end], 0). Alongside not upsetting Zygote, this has the benefit of working on CPU and GPU.