Overcoming Slow Scalar Operations on GPU Arrays

sazio · November 3, 2020, 9:32pm

Hi all,

I’ve been trying to exploit GPU capabilities for solving a NeuralPDE with GalacticOptim.solve, but I’ve been stumbling into a problem related to scalar operations on GPU arrays.

In line 123 I use the |> gpu operator and in line 179 if CUDA.allowscalar(true) I get:

Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with allowscalar(false)

Indeed, if I set CUDA.allowscalar(false) I get an error:
`scalar getindex is disallowed``

Is there a way to exploit GPU’s speedup with these settings?

Here you can find my code, thanks

maleadt · November 4, 2020, 5:00pm

Copying my reply from Slack for archival purposes:

this is too large of an example to have a quick look at

see https://juliagpu.gitlab.io/CUDA.jl/usage/workflow/#UsageWorkflowScalar, scalar indexing is something you want to avoid because it kills performance (edited)

there’s no ‘setting’ to automatically speed these operations up. you need to avoid the problematic calls and replace them by something vectorized the array infrastructure can work with

ayaoyao214 · December 15, 2020, 7:43am

Hi, thanks for uploading explanation on the home page about this. I have come across same issues here. As you mentioned in this example, we need to aviod indexing a CuArray.

However, if I only want the second element of the CuArray to be added 1, what should I do?

This is just an example I am testing since I have a whole implementation works fine on CPU but gets “scalar getindex is disallowed” error when I do CUDA.allowscalar(false) on GPU version. Thanks in advance!

Cheers.

maleadt · December 15, 2020, 7:59am

That broadcast doesn’t do what you expect:

julia> a = ones(3)
3-element Array{Float64,1}:
 1.0
 1.0
 1.0

julia> a[2] .+ 1
2.0

julia> a
3-element Array{Float64,1}:
 1.0
 1.0
 1.0

If you want to mutate the array, use a view. If you actually want to access a single element, scalar indexing is fine, and you can allow it locally using @allowscalar.

ayaoyao214 · December 15, 2020, 8:15am

Thanks for pointing out. It’s okay, this is just a simple example not exactly what I want.

I think I can have a try to use @allowsaaclar. But will it slow the whole program down significantly? (Maybe it is a task dependent question and we can only tell after I try it out.)

maleadt · December 15, 2020, 8:29am

No. It’s a slow operation because it’s a memory copy, so don’t use it in a loop to process item per item. But if you only need to fetch a single item after a bunch of operations, it’s fine.

ayaoyao214 · December 15, 2020, 8:46am

Thanks again for your reply! Then I think what I am doing is fetching it item by item, as you said, this going to be slow.
Actually, I am trying to implement the DiffEqFlux example as in https://diffeqflux.sciml.ai/dev/examples/exogenous_input/ on GPU. Since in this case, I have an input which is an exogeous signal, and I would like to input it into the ODE layer at each time instant, thus I modified the dudt function to be

function dudt(u, p, t)
    #in_vect = vcat(u[1])
    #nn_model(gpu(in_vect), p)
    #nn_dudt(vcat(u,CUDA.@allowscalar(ex[1])), p)
    nn_dudt(vcat(u,CUDA.@allowscalar(ex[Int(round(t*10))])), p)
end

When I do @allowscalar as we just discussed, the error still occurs. Do you have any suggestions on that? Thank you very much.

This is the code on the tutorial page:

function dudt(u, p, t)
    #input_val = u_vals[Int(round(t*10)+1)]
    nn_model(vcat(u[1], ex[Int(round(10*0.1))]), p)
end

maleadt · December 15, 2020, 8:49am

Yeah, this is not going to work like that. You should use vectorized functions, or custom kernels, that perform these operations on the GPU. Unless dudt is called in a kernel, of course, but that isn’t the case here or you wouldn’t run into the scalar iteration error on the host. @chrisrackauckas will know more.

ChrisRackauckas · December 15, 2020, 10:30am

You have to put the resulting array on the GPU. vcat(u[1], ex[Int(round(10*0.1))]) won’t be on the GPU.

ayaoyao214 · December 15, 2020, 12:18pm

Thanks for replying. I have just changed it to make it a CuArray. (as seen in the code below) But there is still a scalar indexing error.
When you say vcat() won’t be on GPU, do you mean that I need to use other functions that can perform this operation?

function dudt(u, p, t)
    #in_vect = vcat(u[1])
    #nn_model(gpu(in_vect), p)
    #nn_dudt(vcat(u,CUDA.@allowscalar(ex[1])), p)
    i = vcat(u[1], CUDA.@allowscalar(ex[Int(round(10*0.1))])) |>gpu
    nn_dudt(i, p)
end

ChrisRackauckas · December 15, 2020, 1:30pm

u is a CuArray too?

ayaoyao214 · December 15, 2020, 2:02pm

Yeah, I think so. As suggested on the document, if we want to use FastChain both u0 and p need to be on GPU. Then, I would suppose the output after solving the ODE problem will also be CuArray (I used gpu(solve(…))). The relavent code is shown as below.

This is atually the same as what you posted in the tutorial. I only modified a bit to make it a GPU version. The error only occurs on the line with vcat().

ode_data = gpu(Float32.(hammerstein_system(ex)))
nn_dudt = FastChain(
                  FastDense(2, 8, tanh),
                  FastDense(8, 1))
u0 = Float32[0.0]|> gpu
p = initial_params(nn_dudt)|> gpu
#dudt2_(u, p, t) = dudt2(u,p)
function dudt(u, p, t)
    #in_vect = vcat(u[1])
    #nn_model(gpu(in_vect), p)
    #nn_dudt(vcat(u,CUDA.@allowscalar(ex[1])), p)
    i = vcat(u[1], CUDA.@allowscalar(ex[Int(round(10*0.1))])) |>gpu
    nn_dudt(i, p)
end
prob_gpu = ODEProblem(dudt, u0, tspan, nothing)
# Runs on a GPU
function predict_neuralode(p)
     _prob_gpu = remake(prob_gpu,p=p)
     gpu(solve(_prob_gpu, Tsit5(), saveat = tsteps, abstol = 1e-8, reltol = 1e-6))
end

function loss_neuralode(p)
    pred =predict_neuralode(p)
    N = length(pred)
    l = sum(abs2, ode_data[1:N] .- pred)/N
    return l, pred
end
res0 = DiffEqFlux.sciml_train(loss_neuralode,p ,ADAM(0.01), maxiters=10)

ayaoyao214 · December 15, 2020, 2:11pm

Also I just tested to make sure u works fine by disgarding the input ex signal. Only this part is changed and this works all good. I think the issue is still with ex[Int(round(10*0.1))] (even @allowscalar is applied).

nn_dudt = FastChain(
                  FastDense(1, 8, tanh),
                  FastDense(8, 1))
u0 = Float32[0.0]|> gpu
p = initial_params(nn_dudt)|> gpu
#dudt2_(u, p, t) = dudt2(u,p)
function dudt(u, p, t)
    #in_vect = vcat(u[1])
    #nn_model(gpu(in_vect), p)
    nn_dudt(vcat(u), p)
end

ChrisRackauckas · December 16, 2020, 5:01pm

Can you give a code I can copy-paste to run? I don’t know what you’re actually running and won’t be able to recreate it myself without that.

ayaoyao214 · December 17, 2020, 7:04am

Hi, you can find my code here, this is where I came across with the issue of scalar index is disallowed. You can also find the CPU version of the code here. The CPU code is working alright and it is what I am expecting to get when running on the GPU. Thank you very much for your time and I really appreciate it!

Cheers.

ChrisRackauckas · December 31, 2020, 1:57pm

Note I still plan to look at this, but for the holidays I have been away from GPUs.

tomaklutfu · December 31, 2020, 3:25pm

Here u also indexed by scalar? Isn’t it in gpu?

ayaoyao214 · January 4, 2021, 3:51am

Not a problem. Thank you very much for your help!!

ayaoyao214 · January 4, 2021, 3:53am

Yeah. That’s what I thought before as well. But the error shows that it is still on cpu.

ChrisRackauckas · January 4, 2021, 5:00pm

@ayaoyao214 your issue is that you are the one directly indexing a GPU array with a scalar .

ex = gpu(ex)

and then

ex[Int(round(t*10))]

so there you go. You’d need to @allow_scalar(ex[...]), or, since it’s just scalar indexing, put it on the CPU.

Now on the CPU, your issue turns out to be an instance of the bug https://github.com/SciML/DiffEqFlux.jl/issues/401, but I hope to get that fixed this week. @dhairyagandhi96

Topic		Replies	Views
Scalar Indexing Error: multiplying matrix by scalar GPU	2	927	October 25, 2021
Scalar indexing GPU problem in Flux.jl model GPU question , flux	4	369	May 8, 2024
Scalar getindex is disallowed General Usage gpu , gpuarrays	2	1348	January 13, 2021
Is there a way to use @allowscalar in a heterogeneous manner using KernelAbstractions? GPU question , gpuarrays , kernelabstractions	3	74	April 10, 2025
Scalar indexing is disallowed GPU cuda	4	1331	August 15, 2023

Overcoming Slow Scalar Operations on GPU Arrays

Related topics