Hello everyone. I am new to Julia, but somewhat experienced in physicsinformed neural networks.
I have some questions about several features of NeuralPDE, their compatibility with GPUs and some choices made in the example in the documentation here.
Specifically:

The code from the documentation does not use my GPU at all out of the box. After searching, I found that replacing
using CUDA
withusing LuxCUDA
solves the issue. Are the two packages equivalent or is there something wrong with my system? 
If I choose the
GridTraining
strategy (as in the example) , 100% of my GPU is used. However, if I chooseQuasiRandomTraining
with the default sampling algorithm 0% (zero) of my GPU is used and all the calculations are being carried out by 1 core of my CPU. If I change the sampling algorithm toSobolSample()
, my GPU is indeed used, but only ~30%. Is there any reason for this behaviour? 
In the example, after the network is trained the following lines are used to calculate the solution in some grid for plotting and analysis
phi = discretization.phi
ts, xs, ys = [infimum(d.domain):0.1:supremum(d.domain) for d in domains]
u_real = [analytic_sol_func(t, x, y) for t in ts for x in xs for y in ys]
u_predict = [first(Array(phi(gpu([t, x, y]), res.u))) for t in ts for x in xs for y in ys]
If I understand correctly, the grid points t, x, y are moved to the GPU (where also the parameters res.u reside) and then the solution is evaluated. However, this is extremely slow. Instead, if I replace the last line with
u_predict = [first(Array(phi(([t, x, y]), cpu(res.u)))) for t in ts for x in xs for y in ys]
meaning that I am moving the parameters res.u to the CPU, I get a significant gain in speed. Is there any reason to prefer one or the other and if not, why the documentation uses the slow version?
Below is a MWE, which is basically the example from the documentation with a few lines added by me regarding the above points (my fixes are commented out)
using NeuralPDE, Lux, Random, ComponentArrays
using Optimization
using OptimizationOptimisers
import ModelingToolkit: Interval
using QuasiMonteCarlo
using BenchmarkTools
using CUDA # This does not recognise my GPU
# using LuxCUDA # This does
@parameters t x y
@variables u(..)
Dxx = Differential(x)^2
Dyy = Differential(y)^2
Dt = Differential(t)
t_min = 0.0
t_max = 2.0
x_min = 0.0
x_max = 2.0
y_min = 0.0
y_max = 2.0
# 2D PDE
eq = Dt(u(t, x, y)) ~ Dxx(u(t, x, y)) + Dyy(u(t, x, y))
analytic_sol_func(t, x, y) = exp(x + y) * cos(x + y + 4t)
# Initial and boundary conditions
bcs = [u(t_min, x, y) ~ analytic_sol_func(t_min, x, y),
u(t, x_min, y) ~ analytic_sol_func(t, x_min, y),
u(t, x_max, y) ~ analytic_sol_func(t, x_max, y),
u(t, x, y_min) ~ analytic_sol_func(t, x, y_min),
u(t, x, y_max) ~ analytic_sol_func(t, x, y_max)]
# Space and time domains
domains = [t ∈ Interval(t_min, t_max),
x ∈ Interval(x_min, x_max),
y ∈ Interval(y_min, y_max)]
# Neural network
inner = 25
chain = Chain(Dense(3, inner, Lux.σ),
Dense(inner, inner, Lux.σ),
Dense(inner, inner, Lux.σ),
Dense(inner, inner, Lux.σ),
Dense(inner, 1))
# strategy = GridTraining(0.05) #This uses 100% of my GPU
strategy = NeuralPDE.QuasiRandomTraining(500, resampling=true) #This uses 0% of my GPU
# strategy = NeuralPDE.QuasiRandomTraining(500, sampling_alg = SobolSample(), resampling=true) #This uses 30% of my GPU
ps = Lux.setup(Random.default_rng(), chain)[1]
ps = ps > ComponentArray > gpu .> Float64
discretization = PhysicsInformedNN(chain,
strategy,
init_params = ps)
@named pde_system = PDESystem(eq, bcs, domains, [t, x, y], [u(t, x, y)])
prob = discretize(pde_system, discretization)
symprob = symbolic_discretize(pde_system, discretization)
callback = function (p, l)
println("Current loss is: $l")
return false
end
res = Optimization.solve(prob, Adam(0.01); callback = callback, maxiters = 100)
phi = discretization.phi
ts, xs, ys = [infimum(d.domain):0.1:supremum(d.domain) for d in domains]
u_real = [analytic_sol_func(t, x, y) for t in ts for x in xs for y in ys]
u_predict = [first(Array(phi(gpu([t, x, y]), res.u))) for t in ts for x in xs for y in ys] #This is slow
# u_predict = [first(Array(phi([t, x, y]), cpu(res.u))) for t in ts for x in xs for y in ys] #This is faster
@btime u_predict = [first(Array(phi(gpu([t, x, y]), res.u))) for t in ts for x in xs for y in ys]
@btime u_predict = [first(Array(phi([t, x, y], cpu(res.u)))) for t in ts for x in xs for y in ys]
And some additional info
(@v1.9) pkg> st
Status `~/.julia/environments/v1.9/Project.toml`
[c29ec348] AbstractDifferentiation v0.5.3
[6e4b80f9] BenchmarkTools v1.3.2
[052768ef] CUDA v5.0.0
⌃ [13f3f980] CairoMakie v0.10.10
⌅ [b0b7db55] ComponentArrays v0.14.1
[587475ba] Flux v0.14.6
[59287772] Formatting v0.4.2
[f6369f11] ForwardDiff v0.10.36
⌃ [e9467ef8] GLMakie v0.8.10
[033835bb] JLD2 v0.4.35
[b964fa9f] LaTeXStrings v1.3.0
⌃ [b2108857] Lux v0.5.5
[d0bbae9a] LuxCUDA v0.3.1
[dde8697e] MakiePublication v0.3.2
⌃ [961ee093] ModelingToolkit v8.70.0
⌃ [315f7962] NeuralPDE v5.8.0
⌅ [3bd65402] Optimisers v0.2.20
[7f7a1694] Optimization v3.19.1
⌃ [36348300] OptimizationOptimJL v0.1.9
⌃ [42dfb2eb] OptimizationOptimisers v0.1.5
[91a5bcdd] Plots v1.39.0
[d330b81b] PyPlot v2.11.2
⌅ [8a4e6c94] QuasiMonteCarlo v0.2.19
[e88e6eb3] Zygote v0.6.65
[02a925ec] cuDNN v1.2.0
nvidiasmi
Wed Oct 11 19:51:23 2023
++
 NVIDIASMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 
+++
 GPU Name PersistenceM BusId Disp.A  Volatile Uncorr. ECC 
 Fan Temp Perf Pwr:Usage/Cap MemoryUsage  GPUUtil Compute M. 
   MIG M. 
===============================+======================+======================
 0 NVIDIA GeForce ... Off  00000000:2E:00.0 Off  N/A 
 N/A 46C P8 2W / 60W  351MiB / 6144MiB  0% Default 
   N/A 
++++
++
 Processes: 
 GPU GI CI PID Type Process name GPU Memory 
 ID ID Usage 
=============================================================================
 0 N/A N/A 1378 G /usr/lib/xorg/Xorg 4MiB 
 0 N/A N/A 306393 C ...ros/julia1.9.3/bin/julia 344MiB 
++
Sorry for the long post, but I wanted to be as detailed as possible.
Thank you in advance for your help.