Overcoming Slow Scalar Operations on GPU Arrays

Hi all,

I’ve been trying to exploit GPU capabilities for solving a NeuralPDE with GalacticOptim.solve, but I’ve been stumbling into a problem related to scalar operations on GPU arrays.

In line 123 I use the |> gpu operator and in line 179 if CUDA.allowscalar(true) I get:

Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with allowscalar(false)

Indeed, if I set CUDA.allowscalar(false) I get an error:
`scalar getindex is disallowed``

Is there a way to exploit GPU’s speedup with these settings?

Here you can find my code, thanks :slightly_smiling_face:

Copying my reply from Slack for archival purposes:

this is too large of an example to have a quick look at

see https://juliagpu.gitlab.io/CUDA.jl/usage/workflow/#UsageWorkflowScalar, scalar indexing is something you want to avoid because it kills performance (edited)

there’s no ‘setting’ to automatically speed these operations up. you need to avoid the problematic calls and replace them by something vectorized the array infrastructure can work with

2 Likes

Hi, thanks for uploading explanation on the home page about this. I have come across same issues here. As you mentioned in this example, we need to aviod indexing a CuArray.

However, if I only want the second element of the CuArray to be added 1, what should I do?


This is just an example I am testing since I have a whole implementation works fine on CPU but gets “scalar getindex is disallowed” error when I do CUDA.allowscalar(false) on GPU version. Thanks in advance!

Cheers.

That broadcast doesn’t do what you expect:

julia> a = ones(3)
3-element Array{Float64,1}:
 1.0
 1.0
 1.0

julia> a[2] .+ 1
2.0

julia> a
3-element Array{Float64,1}:
 1.0
 1.0
 1.0

If you want to mutate the array, use a view. If you actually want to access a single element, scalar indexing is fine, and you can allow it locally using @allowscalar.

1 Like

Thanks for pointing out. It’s okay, this is just a simple example not exactly what I want.

I think I can have a try to use @allowsaaclar. But will it slow the whole program down significantly? (Maybe it is a task dependent question and we can only tell after I try it out.)

No. It’s a slow operation because it’s a memory copy, so don’t use it in a loop to process item per item. But if you only need to fetch a single item after a bunch of operations, it’s fine.

Thanks again for your reply! Then I think what I am doing is fetching it item by item, as you said, this going to be slow.
Actually, I am trying to implement the DiffEqFlux example as in https://diffeqflux.sciml.ai/dev/examples/exogenous_input/ on GPU. Since in this case, I have an input which is an exogeous signal, and I would like to input it into the ODE layer at each time instant, thus I modified the dudt function to be

function dudt(u, p, t)
    #in_vect = vcat(u[1])
    #nn_model(gpu(in_vect), p)
    #nn_dudt(vcat(u,CUDA.@allowscalar(ex[1])), p)
    nn_dudt(vcat(u,CUDA.@allowscalar(ex[Int(round(t*10))])), p)
end

When I do @allowscalar as we just discussed, the error still occurs. Do you have any suggestions on that? Thank you very much.

This is the code on the tutorial page:

function dudt(u, p, t)
    #input_val = u_vals[Int(round(t*10)+1)]
    nn_model(vcat(u[1], ex[Int(round(10*0.1))]), p)
end

Yeah, this is not going to work like that. You should use vectorized functions, or custom kernels, that perform these operations on the GPU. Unless dudt is called in a kernel, of course, but that isn’t the case here or you wouldn’t run into the scalar iteration error on the host. @chrisrackauckas will know more.

You have to put the resulting array on the GPU. vcat(u[1], ex[Int(round(10*0.1))]) won’t be on the GPU.

Thanks for replying. I have just changed it to make it a CuArray. (as seen in the code below) But there is still a scalar indexing error.
When you say vcat() won’t be on GPU, do you mean that I need to use other functions that can perform this operation?

function dudt(u, p, t)
    #in_vect = vcat(u[1])
    #nn_model(gpu(in_vect), p)
    #nn_dudt(vcat(u,CUDA.@allowscalar(ex[1])), p)
    i = vcat(u[1], CUDA.@allowscalar(ex[Int(round(10*0.1))])) |>gpu
    nn_dudt(i, p)
end

u is a CuArray too?

Yeah, I think so. As suggested on the document, if we want to use FastChain both u0 and p need to be on GPU. Then, I would suppose the output after solving the ODE problem will also be CuArray (I used gpu(solve(…))). The relavent code is shown as below.

This is atually the same as what you posted in the tutorial. I only modified a bit to make it a GPU version. The error only occurs on the line with vcat().

ode_data = gpu(Float32.(hammerstein_system(ex)))
nn_dudt = FastChain(
                  FastDense(2, 8, tanh),
                  FastDense(8, 1))
u0 = Float32[0.0]|> gpu
p = initial_params(nn_dudt)|> gpu
#dudt2_(u, p, t) = dudt2(u,p)
function dudt(u, p, t)
    #in_vect = vcat(u[1])
    #nn_model(gpu(in_vect), p)
    #nn_dudt(vcat(u,CUDA.@allowscalar(ex[1])), p)
    i = vcat(u[1], CUDA.@allowscalar(ex[Int(round(10*0.1))])) |>gpu
    nn_dudt(i, p)
end
prob_gpu = ODEProblem(dudt, u0, tspan, nothing)
# Runs on a GPU
function predict_neuralode(p)
     _prob_gpu = remake(prob_gpu,p=p)
     gpu(solve(_prob_gpu, Tsit5(), saveat = tsteps, abstol = 1e-8, reltol = 1e-6))
end

function loss_neuralode(p)
    pred =predict_neuralode(p)
    N = length(pred)
    l = sum(abs2, ode_data[1:N] .- pred)/N
    return l, pred
end
res0 = DiffEqFlux.sciml_train(loss_neuralode,p ,ADAM(0.01), maxiters=10)

Also I just tested to make sure u works fine by disgarding the input ex signal. Only this part is changed and this works all good. I think the issue is still with ex[Int(round(10*0.1))] (even @allowscalar is applied).

nn_dudt = FastChain(
                  FastDense(1, 8, tanh),
                  FastDense(8, 1))
u0 = Float32[0.0]|> gpu
p = initial_params(nn_dudt)|> gpu
#dudt2_(u, p, t) = dudt2(u,p)
function dudt(u, p, t)
    #in_vect = vcat(u[1])
    #nn_model(gpu(in_vect), p)
    nn_dudt(vcat(u), p)
end

Can you give a code I can copy-paste to run? I don’t know what you’re actually running and won’t be able to recreate it myself without that.

Hi, you can find my code here, this is where I came across with the issue of scalar index is disallowed. You can also find the CPU version of the code here. The CPU code is working alright and it is what I am expecting to get when running on the GPU. Thank you very much for your time and I really appreciate it!

Cheers.

Note I still plan to look at this, but for the holidays I have been away from GPUs.

Here u also indexed by scalar? Isn’t it in gpu?

Not a problem. Thank you very much for your help!!

Yeah. That’s what I thought before as well. But the error shows that it is still on cpu.

@ayaoyao214 your issue is that you are the one directly indexing a GPU array with a scalar :wink:.

ex = gpu(ex)

and then

ex[Int(round(t*10))]

so there you go. You’d need to @allow_scalar(ex[...]), or, since it’s just scalar indexing, put it on the CPU.

Now on the CPU, your issue turns out to be an instance of the bug https://github.com/SciML/DiffEqFlux.jl/issues/401, but I hope to get that fixed this week. @dhairyagandhi96