GPU Compilation error when using Lux framework in NeuralPDE

ERROR: LoadError: GPU compilation of MethodInstance for (::GPUArrays.var"#broadcast_kernel#26")(::CUDA.CuKernelContext, ::CuDeviceMatrix{Float32, 1}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(-), Tuple{Base.Broadcast.Extruded{Matrix{Float32}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{LinearAlgebra.Adjoint{Float32, CuDeviceVector{Float32, 1}}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, ::Int64) failed
KernelError: passing and using non-bitstype argument

Argument 4 to your kernel function is of type Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(-), Tuple{Base.Broadcast.Extruded{Matrix{Float32}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{LinearAlgebra.Adjoint{Float32, CuDeviceVector{Float32, 1}}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, which is not isbits:
.args is of type Tuple{Base.Broadcast.Extruded{Matrix{Float32}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{LinearAlgebra.Adjoint{Float32, CuDeviceVector{Float32, 1}}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}} which is not isbits.
.1 is of type Base.Broadcast.Extruded{Matrix{Float32}, Tuple{Bool, Bool}, Tuple{Int64, Int64}} which is not isbits.
.x is of type Matrix{Float32} which is not isbits.

# Function to calculate the Data loss
function loss_data(phi,θ,p_,coord,data,depVars)
    # First data frame corresponding to the velocity data
    #println(θ.depvar[:U])
    ld = sum([Statistics.mean(abs2,phi[i](coord,θ.depvar[depVars[i]]) .- data[i,:]')/Statistics.mean(abs2,data[i,:]) for i ∈ 1:8])
    return ld
end

depVars = [:U, :V, :Θ, :u′u′_mean, :u′v′_mean, :v′v′_mean, :u′θ′_mean, :v′θ′_mean, :P]
loss_data_(phi,θ,p_) = loss_data(phi,θ,p_,coord,data,depVars)

# Discretization using PhysicsInformedNN constructor
discretization = PhysicsInformedNN(chain, strategy, init_params = ps, additional_loss = loss_data_, adaptive_loss = adaptiveLossWeights)

The error says that the fourth argument of the kernel function which is the data loss function has a broadcast problem. What is the problem exactly? Can somebody help me out please?

The specific error from GPU compilation that’s mentioned here is from this line "phii .- data[i,:]’ ". @avikpal, @ChrisRackauckas The two terms of the subtraction operation is not happening on the GPU…Why is this the case?But when I try to print the individual components that I’m subtracting (broadcasted subtraction) here, it works fine…Please help me out…?

One of the inputs to your broadcast is a CPU array.

Why is it the case that phi[i](coord,θ.depvar[depVars[i]]) is not being transferred to the GPU automatically when using Lux chain. I see that the coord variable is moved to the GPU anyways. The neuralPDE documentation for GPU requires a massive transformation in terms of clarity and usage examples. Also, I dont see a significant boost in performance when using GPU with neuralPDE framework. I’m only putting the initial parameters on the GPU and the data. But it seems that the problem is when subtract the two quantities, the data[i,:] is in the GPU but this term mentioned above is in the CPU and when I subtract, hence the error. Why is this case and how to solve the issue? Please somebody help me out?
You can see the code for reviewing here

# Creating the Network Architecture (Same for both GPU and CPU case) Note:- No need to put the network on the GPU
# For GPU processing using Lux, only the initial parameters need to be in the GPU.

#Random.seed!(123)
if (length(numLayers) != 1) && (length(numNeurons) != 1)
    chain = [eval(Meta.parse("Lux.Chain(Lux.Dense($input,$(numNeurons[i]),activation),"*("Lux.Dense($(numNeurons[i]),$(numNeurons[i]),activation),"^numLayers[i])*"Lux.Dense($(numNeurons[i]),1))")) for i in 1:9] 
else
    chain = [eval(Meta.parse("Lux.Chain(Lux.Dense($input,$numNeurons,activation),"*("Lux.Dense($numNeurons,$numNeurons,activation),"^numLayers)*"Lux.Dense($numNeurons,1))")) for _ in 1:9]  
end 

# Check if the Type parameter for computation is Float64, if so, convert the model to Float64 model
if precision == Float64
    if useGPU
        ps = Lux.setup(Random.default_rng(),chain)[1] 
        ps = ps |> ComponentArray |> device .|> Lux.f64
    else
        ps = Lux.setup(Random.default_rng(),chain)[1] 
        ps = ps |> ComponentArray .|> Lux.f64
    end
else
    if useGPU
        ps = Lux.setup(Random.default_rng(),chain)[1] 
        ps = ps |> ComponentArray |> device .|> Lux.f32
    else
        ps = Lux.setup(Random.default_rng(),chain)[1] 
        ps = ps |> ComponentArray .|> Lux.f32
    end
end
#######################################################################################################################################################
println("The network was created successfully")
println("####################################################################################################")
#######################################################################################################################################################

# Function to calculate the Data loss
function loss_data(phi,θ,p_,coord,data,depVars)
    # First data frame corresponding to the velocity data
    #println(θ.depvar[:U])
    println(typeof(coord))
    println(typeof(phi[1](coord,θ.depvar[depVars[1]])))
    println(typeof(θ))
    exit()
    ld = sum([Statistics.mean(abs2,phi[i](coord,θ.depvar[depVars[i]]) .- data[i,:]')/Statistics.mean(abs2,data[i,:]) for i ∈ 1:8])
    return ld
end

depVars = [:U, :V, :Θ, :u′u′_mean, :u′v′_mean, :v′v′_mean, :u′θ′_mean, :v′θ′_mean, :P]
loss_data_(phi,θ,p_) = loss_data(phi,θ,p_,coord,data,depVars)
#######################################################################################################################################################
println("The data loss function is defined successfully")
println("####################################################################################################")
# Create the training strategy and adaptive loss weighting strategy
# Define the weight asscociated with the losses
if !useGPU
    strategy = NeuralPDE.QuadratureTraining()
elseif useGPU && (trainingStrategy == "QuadratureTraining")
    error("Cannot use Quadrature training with GPU processing! Please select either GridTraining or QuasiRandomTraining strategy")
elseif useGPU && (trainingStrategy == "QuasiRandomTraining")
    strategy = NeuralPDE.QuasiRandomTraining(numTrainPoints,sampling_alg = LatinHypercubeSample())
else
    error("Please choose from Quadrature, QuasiRandomTraining strategies")
end

# Giving equal weights to both Physics and BC's
if !adaptive_weights
    ζ = (1-α)
    adaptiveLossWeights = NeuralPDE.NonAdaptiveLoss(pde_loss_weights = α, bc_loss_weights = α, additional_loss_weights = ζ)
elseif adaptive_weights && adaptive_weight_alg == "MinMax"
    α = 1.0
    δ = 1.0
    ζ = 1.0
    adaptiveLossWeights = NeuralPDE.MiniMaxAdaptiveLoss(reweight_freq, pde_max_optimiser = Flux.ADAM(1e-3), bc_max_optimiser = Flux.ADAM(1e-3), pde_loss_weights = α, bc_loss_weights = δ, additional_loss_weights = ζ)
elseif adaptive_weights && adaptive_weight_alg == "Gradient"
    α = 1.0
    δ = 1.0
    ζ = 1.0
    adaptiveLossWeights = NeuralPDE.GradientScaleAdaptiveLoss(reweight_freq, weight_change_inertia = 0.99, pde_loss_weights = α, bc_loss_weights = δ, additional_loss_weights = ζ)
end

discretization = PhysicsInformedNN(chain, strategy, init_params = ps, additional_loss = loss_data_, adaptive_loss = adaptiveLossWeights)

Seems that the initial parameters are moved to the GPU, but when I print the typeof θ, it is moved back to the CPU and the predictions from phi are computed on the CPU…why is this the case.

Also I use CUDA.allowscalar(true) in my program which I think is causing massive slowdowns in the code when using GPU…? @ChrisRackauckas, @maleadt Can you help me with this, how can I solve the problem without using scalar indexing for the GPU? I think the scalar indexing is showing up as a problem when I define the data loss which is also slowing down the computations on the GPU…What is the workaround for this?

allowscalar is there to protect you against falling back to fallback methods that don’t actually execute on the GPU, so you want that disabled if you care about performance. The points where it errors you need to figure out how to avoid the problematic code, e.g., by implementing the missing array abstractions in a GPU-compatible manner, or by slightly changing the code such that it invokes operations that are GPU-compatible.

Can you @ChrisRackauckas and @maleadt please help me restructure the code to effectively use the GPU. Also I get that broadcasting error where one term of the subtraction operation is on the CPU and the other is on the GPU. This happens when I use the Lux framework. Please help me out with this issue?

Thanks in advance

Sorry, I’m not familiar with Lux.

Can somebody help me for the Lux framework? Please…The network parameters that is passed to the loss data function is somehow on the CPU when using Lux framework as opposed to being on the GPU…Why is this happening?
The other thing I wanted to ask was, is the NeuralPDE package when utilizing GPU, free from scalar indexing issues. I believe the NeuralPDE package when utilizing GPU does’nt use scalar indexing, becasue if it does, it is going to face performance issues. Can someone comment on this one please?

Atleast if you can tell me for the FLux framework? Please…
Thanks a ton

Maybe the code you posted isn’t the full MWE, but I don’t see you loading any of the GPU glue packages that the Lux docs say are required for GPU support?

I don’t understand how come there is no improvement in performance when I use the GPU as oppsed to using CPU. The ADAM step is fast on the CPU but very slow on the GPU…

Your code isn’t runnable. It doesn’t show the layer sizes. Last time you showed the layer sizes it was obvious why that was slower on GPU. Is it the same network?

Yes it is the same network…

Okay and you recently tested why a length 20 array wouldn’t be faster on GPU if I’m not mistaken? And what is numTrainPoints?

using CUDA, BenchmarkTools
A = rand(Float32,20,20)
B = rand(Float32,20,1000)
gA = cu(A)
gB = cu(B)

@btime A*B
@btime gA * gB
  122.800 μs (2 allocations: 78.17 KiB)
  18.400 μs (30 allocations: 896 bytes)

Okay so at this size, quasi-random sampling can be a little faster (need to account for data exchange overhead). There’s your smoke test.

Now, do you have an MWE of what you’re running into? Your other code has hundreds of irrelevant options. What’s the core of it? The simplest form of a runnable example?