NeuralPDE on GPU throws NaNs when I use a source term elevated to some power

Hello to everyone,

I have a problem with NeuralPDE when trying to use the GPU on the following problem

using NeuralPDE
using Optimization, OptimizationOptimJL, OptimizationOptimisers
import ModelingToolkit: Interval
using Lux, LuxCUDA, ComponentArrays, Random
using QuasiMonteCarlo


println("using CUDA: ", CUDA.functional())
move_to_gpu = gpu_device()

@parameters q μ
@variables P(..)

# Differential operators
∂q = Differential(q)
∂qq = Differential(q)^2
∂μμ = Differential(μ)^2

# Source term
Pc = 0.3405074
G(P) = 8 * (P - Pc)^3 * (P > Pc)

# Pc = 0.6
# G(P) = 4 * (P - Pc) * (P > Pc)

# G(P) = 0

# PDE
eq = 2 * q^3 * ∂q(P(q, μ)) + q^4 * ∂qq(P(q, μ)) +  q^2 * (1 - μ^2) * ∂μμ(P(q, μ)) ~ -G(P(q, μ))

# Domain limits
q_min = 0.0
q_max = 1.0
μ_min = -1.0
μ_max = 1.0

# Initial and boundary conditions
bcs = [
       P(q_min, μ) ~ 0.0
       , P(q_max, μ) ~ q_max * (1 - μ^2)
       , P(q, μ_min) ~ 0.0
       , P(q, μ_max) ~ 0.0
      ]

# Space and time domains
domains = [
           q ∈ Interval(q_min, q_max),
           μ ∈ Interval(μ_min, μ_max)
         #   ϕ ∈ Interval(ϕ_min, ϕ_max)
          ]
    

@named pde_system = PDESystem(eq, bcs, domains, [q, μ], [P(q, μ)])


inner = 20
chain = Chain(Dense(2, inner, tanh),
              Dense(inner, inner, tanh),
            #   Dense(inner, inner, tanh),
            #   Dense(inner, inner, tanh),
              Dense(inner, 1))

ps = Lux.setup(Random.default_rng(), chain)[1]
# ps = ps |> ComponentArray .|> Float64
ps = ps |> ComponentArray |> move_to_gpu .|> Float64
st = Lux.setup(Random.default_rng(), chain)[2]

iteration = Base.RefValue{Int}(1)

callback = function (p, l)
   
   if iteration[] % 100 == 0 || iteration[] == 1
      println("iteration ", iteration[], " loss: ", l)
   end
   
   iteration[] += 1

   return false
end

points = 1000
strategy = NeuralPDE.QuasiRandomTraining(points
                                         , bcs_points = points ÷ 4
                                         , sampling_alg = SobolSample()
                                        )

discretization = PhysicsInformedNN(chain
                                   , strategy
                                   , init_params = ps
                                  )

prob = discretize(pde_system, discretization)

res = Optimization.solve(prob, OptimizationOptimisers.Adam(0.05);
                         callback = callback, 
                         maxiters = 500)
prob = remake(prob, u0 = res.u)
res = Optimization.solve(prob, BFGS(); 
                         callback = callback, 
                         maxiters = 1000)

phi = discretization.phi

After the first iteration, the loss returns NaNs.
If I don’t use the GPU and run everything on CPU, the network trains as it should and I get the desired result.

My attempts to find a solution have led me to the conclusion that the problem lies in the source term which has a term elevated to some power.

# Source term
Pc = 0.3405074
G(P) = 8 * (P - Pc)^3 * (P > Pc)

# Pc = 0.6
# G(P) = 4 * (P - Pc) * (P > Pc)

If I use some other source term that is linear (the commented line) I get the desired result without NaNs.

I found this issue on github that seems to have the same problem but the solution suggested cannot work in my case.

Is there any alternative of the power operator that avoids this issue? Note that I have tried multiply (P-Pc) three times but I have the same problem.

Thank you in advance for your help.

I have found the solution to the problem, in case anyone else encounters something similar in the future.

Replacing

(P - Pc)^3

with

abs(P - Pc)^3

does the trick for any exponent.
I am not sure why this is the case and I doubt that this solution is suitable for any use case, but that was the fix that I applied.

Hi, I applied our binary instrumentation tool GPU-FPX (see my homepage at Utah [www dot cs dot utah dot edu / tilde ganesh] for a talk I recently gave at the ARRAY Workshop that describes it). It does shed some light into which kernels are involved in producing exceptions. I also ran your “fix”. It seems to produce an integer divide by zero after iteration 500. Did you see that? But the loss is “fixed” (no NaN) as you say. But even in this “fixed” version, there are NaNs being generated before Iteration 1 itself. Since I don’t know your code (I was merely looking for examples to try our code on), please email me at my email (in the webpage above) and we can perhaps develop not only an understanding of the issue but perhaps help improve our tool? Thank you. Ganesh. I won’t provide more than a few lines of what GPU-FPX produces below.

#GPU-FPX: Initializing GPU context…
No enable_kernels.txt !
No disable_kernels.txt !
#GPU-FPX: Instrument all kernels.
Running #GPU-FPX: kernel [gpu_broadcast_kernel_linear(CompilerMetadata] …
Running #GPU-FPX: kernel [gpu_fill_kernel_(CompilerMetadata] …
Running #GPU-FPX: kernel [gpu_linear_copy_kernel_(CompilerMetadata] …
Running #GPU-FPX: kernel [partial_mapreduce_grid(identity, _, Bool, CartesianIndices] …
Running #GPU-FPX: kernel [gpu_getindex_kernel(CompilerMetadata] …
Running #GPU-FPX: kernel [gpu_setindex_kernel(CompilerMetadata] …
Running #GPU-FPX: kernel [gpu_broadcast_kernel_cartesian(CompilerMetadata] …
#GPU-FPX-ANA SHARED REGISTER: Before executing the instruction @ ./int.jl:295 Instruction: MUFU.RCP64H R21, R19 ; We h
ave 4 registers in total. Register 0 is VAL. Register 1 is NaN. Register 2 is VAL. Register 3 is VAL.

#GPU-FPX-ANA SHARED REGISTER: Before executing the instruction @ ./float.jl:491 Instruction: DFMA R12, RZ, R12, R16 ;
We have 4 registers in total. Register 0 is VAL. Register 1 is NaN. Register 2 is VAL. Register 3 is VAL.

#GPU-FPX-ANA SHARED REGISTER: After executing the instruction @ ./float.jl:491 Instruction: DFMA R12, RZ, R12, R16 ; W
e have 4 registers in total. Register 0 is VAL. Register 1 is NaN. Register 2 is VAL. Register 3 is VAL.

…more omitted…