NeuralPDE features and GPU compatibility

Hello everyone. I am new to Julia, but somewhat experienced in physics-informed neural networks.
I have some questions about several features of NeuralPDE, their compatibility with GPUs and some choices made in the example in the documentation here.
Specifically:

  1. The code from the documentation does not use my GPU at all out of the box. After searching, I found that replacing using CUDA with using LuxCUDA solves the issue. Are the two packages equivalent or is there something wrong with my system?

  2. If I choose the GridTraining strategy (as in the example) , 100% of my GPU is used. However, if I choose QuasiRandomTraining with the default sampling algorithm 0% (zero) of my GPU is used and all the calculations are being carried out by 1 core of my CPU. If I change the sampling algorithm to SobolSample(), my GPU is indeed used, but only ~30%. Is there any reason for this behaviour?

  3. In the example, after the network is trained the following lines are used to calculate the solution in some grid for plotting and analysis

phi = discretization.phi
ts, xs, ys = [infimum(d.domain):0.1:supremum(d.domain) for d in domains]
u_real = [analytic_sol_func(t, x, y) for t in ts for x in xs for y in ys]
u_predict = [first(Array(phi(gpu([t, x, y]), res.u))) for t in ts for x in xs for y in ys]

If I understand correctly, the grid points t, x, y are moved to the GPU (where also the parameters res.u reside) and then the solution is evaluated. However, this is extremely slow. Instead, if I replace the last line with

u_predict = [first(Array(phi(([t, x, y]), cpu(res.u)))) for t in ts for x in xs for y in ys]

meaning that I am moving the parameters res.u to the CPU, I get a significant gain in speed. Is there any reason to prefer one or the other and if not, why the documentation uses the slow version?
Below is a MWE, which is basically the example from the documentation with a few lines added by me regarding the above points (my fixes are commented out)

using NeuralPDE, Lux, Random, ComponentArrays
using Optimization
using OptimizationOptimisers
import ModelingToolkit: Interval
using QuasiMonteCarlo
using BenchmarkTools
using CUDA           # This does not recognise my GPU
# using LuxCUDA      # This does

@parameters t x y
@variables u(..)
Dxx = Differential(x)^2
Dyy = Differential(y)^2
Dt = Differential(t)
t_min = 0.0
t_max = 2.0
x_min = 0.0
x_max = 2.0
y_min = 0.0
y_max = 2.0

# 2D PDE
eq = Dt(u(t, x, y)) ~ Dxx(u(t, x, y)) + Dyy(u(t, x, y))

analytic_sol_func(t, x, y) = exp(x + y) * cos(x + y + 4t)
# Initial and boundary conditions
bcs = [u(t_min, x, y) ~ analytic_sol_func(t_min, x, y),
    u(t, x_min, y) ~ analytic_sol_func(t, x_min, y),
    u(t, x_max, y) ~ analytic_sol_func(t, x_max, y),
    u(t, x, y_min) ~ analytic_sol_func(t, x, y_min),
    u(t, x, y_max) ~ analytic_sol_func(t, x, y_max)]

# Space and time domains
domains = [t ∈ Interval(t_min, t_max),
    x ∈ Interval(x_min, x_max),
    y ∈ Interval(y_min, y_max)]

# Neural network
inner = 25
chain = Chain(Dense(3, inner, Lux.σ),
              Dense(inner, inner, Lux.σ),
              Dense(inner, inner, Lux.σ),
              Dense(inner, inner, Lux.σ),
              Dense(inner, 1))

# strategy = GridTraining(0.05)         #This uses 100% of my GPU                          
strategy = NeuralPDE.QuasiRandomTraining(500, resampling=true)       #This uses 0% of my GPU
# strategy = NeuralPDE.QuasiRandomTraining(500, sampling_alg = SobolSample(), resampling=true)       #This uses 30% of my GPU
ps = Lux.setup(Random.default_rng(), chain)[1]
ps = ps |> ComponentArray |> gpu .|> Float64
discretization = PhysicsInformedNN(chain,
                                   strategy,
                                   init_params = ps)

@named pde_system = PDESystem(eq, bcs, domains, [t, x, y], [u(t, x, y)])
prob = discretize(pde_system, discretization)
symprob = symbolic_discretize(pde_system, discretization)

callback = function (p, l)
    println("Current loss is: $l")
    return false
end

res = Optimization.solve(prob, Adam(0.01); callback = callback, maxiters = 100)
phi = discretization.phi
ts, xs, ys = [infimum(d.domain):0.1:supremum(d.domain) for d in domains]
u_real = [analytic_sol_func(t, x, y) for t in ts for x in xs for y in ys]
u_predict = [first(Array(phi(gpu([t, x, y]), res.u))) for t in ts for x in xs for y in ys]      #This is slow
# u_predict = [first(Array(phi([t, x, y]), cpu(res.u))) for t in ts for x in xs for y in ys]    #This is faster

@btime u_predict = [first(Array(phi(gpu([t, x, y]), res.u))) for t in ts for x in xs for y in ys]
@btime u_predict = [first(Array(phi([t, x, y], cpu(res.u)))) for t in ts for x in xs for y in ys]

And some additional info


(@v1.9) pkg> st
Status `~/.julia/environments/v1.9/Project.toml`
  [c29ec348] AbstractDifferentiation v0.5.3
  [6e4b80f9] BenchmarkTools v1.3.2
  [052768ef] CUDA v5.0.0
⌃ [13f3f980] CairoMakie v0.10.10
⌅ [b0b7db55] ComponentArrays v0.14.1
  [587475ba] Flux v0.14.6
  [59287772] Formatting v0.4.2
  [f6369f11] ForwardDiff v0.10.36
⌃ [e9467ef8] GLMakie v0.8.10
  [033835bb] JLD2 v0.4.35
  [b964fa9f] LaTeXStrings v1.3.0
⌃ [b2108857] Lux v0.5.5
  [d0bbae9a] LuxCUDA v0.3.1
  [dde8697e] MakiePublication v0.3.2
⌃ [961ee093] ModelingToolkit v8.70.0
⌃ [315f7962] NeuralPDE v5.8.0
⌅ [3bd65402] Optimisers v0.2.20
  [7f7a1694] Optimization v3.19.1
⌃ [36348300] OptimizationOptimJL v0.1.9
⌃ [42dfb2eb] OptimizationOptimisers v0.1.5
  [91a5bcdd] Plots v1.39.0
  [d330b81b] PyPlot v2.11.2
⌅ [8a4e6c94] QuasiMonteCarlo v0.2.19
  [e88e6eb3] Zygote v0.6.65
  [02a925ec] cuDNN v1.2.0
nvidia-smi
Wed Oct 11 19:51:23 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:2E:00.0 Off |                  N/A |
| N/A   46C    P8     2W /  60W |    351MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1378      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A    306393      C   ...ros/julia-1.9.3/bin/julia      344MiB |
+-----------------------------------------------------------------------------+

Sorry for the long post, but I wanted to be as detailed as possible.
Thank you in advance for your help.

@avikpal is that some change from before?

It will depend on the number of samples taken.

It’s a demonstration. Of course for smaller networks (and with many GPUs, using Float64) will be slower with GPUs. You need to have the right kinds of GPUs (or use Float32) and use large enough networks and batches to get any speedup with GPUs. This is pretty standard. Did you run a matmul smoke test?

Hi @ChrisRackauckas and thank you for your reply.

Does this mean that if I use more samples, more of my GPU will be used? Note that the percentages given do not refer to GPU memory but GPU usage.

I see. So for small scale simulations, training should be done on GPU (I do in fact get a significant speedup when I train on GPUs), but analysis on CPU. Is that the idea?

Apologies if this is something straightforward, but I am new to Julia and I don’t know what this is.

Thanks for the help.

Yes. Neural networks essentially evaluate NxB matmuls, where N is the node size and B is the batch size. By default batches equal the number of sampled points.

Yes. Analysis is generally operations of O(n) scale, as opposed to matmuls which are O(n^3) scale (well that’s square, so change to non-square but you get the point).

To know if something will be fast on the GPU, outline the sizes and just do a quick test.

using CUDA, BenchmarkTools
@btime x .* y
cux, cuy = cu(x), cu(y)
@btime cux .* cuy

etc. If you want to know if a neural network will be faster on GPU than CPU, just do the matmul in isolation and check. The cutoff point is different for each GPU, and different depending on things like precision, but it’s relatively straightforward to check if you should expect a speedup. Normally if you don’t get one, you can very easily see with this kind of quick check that the operation is just slower on GPU than CPU.

1 Like

Thank you for the detailed answers.

I have many more questions and doubts regarding NeuralPDE and I will be opening new topics and possibly github issues for each separate question.

So expect to hear from me a lot in the next months.

The code from the documentation does not use my GPU at all out of the box. After searching, I found that replacing using CUDA with using LuxCUDA solves the issue. Are the two packages equivalent or is there something wrong with my system?

That was the v0.5 breaking change. We also needed cuDNN (and for a brief moment CUDAKernels.jl via KA) so LuxCUDA hosts all dependencies required to trigger CUDA compatibility.

FWIW currently the following msg is printed if you try to use gpu_device without loading trigger packages:

julia> using LuxDeviceUtils

julia> gpu_device()
┌ Warning: No functional GPU backend found! Defaulting to CPU.
│ 
│ 1. If no GPU is available, nothing needs to be done.
│ 2. If GPU is available, load the corresponding trigger package.
└ @ LuxDeviceUtils /mnt/julia/packages/LuxDeviceUtils/rMeCf/src/LuxDeviceUtils.jl:158
(::LuxCPUDevice) (generic function with 5 methods)

(if helpful I can list the names of trigger packages in the warning itself)

Hi, @avikpal, thank you for your response.

Yes, this was precisely the error message that I was getting. Listing the error messages in the warning itself would be helpful. More importantly though, the NeuralPDE example in the docs should be updated with loading the trigger packages in my opinion. LUX manual does load them in its GPU usage examples.

Can you help update the NeuralPDE docs there?

Possibly. What would I need to do?

I meant @avikpal, since it’s a breaking change so if he can make the docs update I’ll finish it, but I’m not 100% sure if I need to do more.