I’ve been trying to exploit GPU capabilities for solving a NeuralPDE with GalacticOptim.solve, but I’ve been stumbling into a problem related to scalar operations on GPU arrays.
In line 123 I use the |> gpu operator and in line 179 if CUDA.allowscalar(true) I get:
Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with allowscalar(false)
Indeed, if I set CUDA.allowscalar(false) I get an error:
`scalar getindex is disallowed``
Is there a way to exploit GPU’s speedup with these settings?
there’s no ‘setting’ to automatically speed these operations up. you need to avoid the problematic calls and replace them by something vectorized the array infrastructure can work with
Hi, thanks for uploading explanation on the home page about this. I have come across same issues here. As you mentioned in this example, we need to aviod indexing a CuArray.
This is just an example I am testing since I have a whole implementation works fine on CPU but gets “scalar getindex is disallowed” error when I do CUDA.allowscalar(false) on GPU version. Thanks in advance!
If you want to mutate the array, use a view. If you actually want to access a single element, scalar indexing is fine, and you can allow it locally using @allowscalar.
Thanks for pointing out. It’s okay, this is just a simple example not exactly what I want.
I think I can have a try to use @allowsaaclar. But will it slow the whole program down significantly? (Maybe it is a task dependent question and we can only tell after I try it out.)
No. It’s a slow operation because it’s a memory copy, so don’t use it in a loop to process item per item. But if you only need to fetch a single item after a bunch of operations, it’s fine.
Thanks again for your reply! Then I think what I am doing is fetching it item by item, as you said, this going to be slow.
Actually, I am trying to implement the DiffEqFlux example as in https://diffeqflux.sciml.ai/dev/examples/exogenous_input/ on GPU. Since in this case, I have an input which is an exogeous signal, and I would like to input it into the ODE layer at each time instant, thus I modified the dudt function to be
function dudt(u, p, t)
#in_vect = vcat(u[1])
#nn_model(gpu(in_vect), p)
#nn_dudt(vcat(u,CUDA.@allowscalar(ex[1])), p)
nn_dudt(vcat(u,CUDA.@allowscalar(ex[Int(round(t*10))])), p)
end
When I do @allowscalar as we just discussed, the error still occurs. Do you have any suggestions on that? Thank you very much.
This is the code on the tutorial page:
function dudt(u, p, t)
#input_val = u_vals[Int(round(t*10)+1)]
nn_model(vcat(u[1], ex[Int(round(10*0.1))]), p)
end
Yeah, this is not going to work like that. You should use vectorized functions, or custom kernels, that perform these operations on the GPU. Unless dudt is called in a kernel, of course, but that isn’t the case here or you wouldn’t run into the scalar iteration error on the host. @chrisrackauckas will know more.
Thanks for replying. I have just changed it to make it a CuArray. (as seen in the code below) But there is still a scalar indexing error.
When you say vcat() won’t be on GPU, do you mean that I need to use other functions that can perform this operation?
function dudt(u, p, t)
#in_vect = vcat(u[1])
#nn_model(gpu(in_vect), p)
#nn_dudt(vcat(u,CUDA.@allowscalar(ex[1])), p)
i = vcat(u[1], CUDA.@allowscalar(ex[Int(round(10*0.1))])) |>gpu
nn_dudt(i, p)
end
Yeah, I think so. As suggested on the document, if we want to use FastChain both u0 and p need to be on GPU. Then, I would suppose the output after solving the ODE problem will also be CuArray (I used gpu(solve(…))). The relavent code is shown as below.
This is atually the same as what you posted in the tutorial. I only modified a bit to make it a GPU version. The error only occurs on the line with vcat().
ode_data = gpu(Float32.(hammerstein_system(ex)))
nn_dudt = FastChain(
FastDense(2, 8, tanh),
FastDense(8, 1))
u0 = Float32[0.0]|> gpu
p = initial_params(nn_dudt)|> gpu
#dudt2_(u, p, t) = dudt2(u,p)
function dudt(u, p, t)
#in_vect = vcat(u[1])
#nn_model(gpu(in_vect), p)
#nn_dudt(vcat(u,CUDA.@allowscalar(ex[1])), p)
i = vcat(u[1], CUDA.@allowscalar(ex[Int(round(10*0.1))])) |>gpu
nn_dudt(i, p)
end
prob_gpu = ODEProblem(dudt, u0, tspan, nothing)
# Runs on a GPU
function predict_neuralode(p)
_prob_gpu = remake(prob_gpu,p=p)
gpu(solve(_prob_gpu, Tsit5(), saveat = tsteps, abstol = 1e-8, reltol = 1e-6))
end
function loss_neuralode(p)
pred =predict_neuralode(p)
N = length(pred)
l = sum(abs2, ode_data[1:N] .- pred)/N
return l, pred
end
res0 = DiffEqFlux.sciml_train(loss_neuralode,p ,ADAM(0.01), maxiters=10)
Also I just tested to make sure u works fine by disgarding the input ex signal. Only this part is changed and this works all good. I think the issue is still with ex[Int(round(10*0.1))] (even @allowscalar is applied).
Hi, you can find my code here, this is where I came across with the issue of scalar index is disallowed. You can also find the CPU version of the code here. The CPU code is working alright and it is what I am expecting to get when running on the GPU. Thank you very much for your time and I really appreciate it!