Where do scalar-valued results of GPU operations go?

Hello everyone. It does not seem possible to push a pure scalar to the GPU, but many common linear algebra operations result in scalars - most prominently, the inner product. So where do these scalars live? Do they go back to the CPU?

I ask because a performance-critical core of my current application relies on combining such scalar results and other GPU arrays. I can only assume these scalars being sent back and forth between CPU and GPU would hurt performance. So should I be worried about things of the following kind?

using Flux, CUDA, LinearAlgebra 


n = 3
v1 = rand(Float32,n) |> gpu
v2 = rand(Float32,n)'|> gpu #Adjoint
v3 = rand(Float32,n) |> gpu
v4 = rand(Float32,n) |> gpu

v2*v1*v3 + v4 #scalar Ă— vector + vector 
sum(v1)*v3 + v4 #scalar Ă— vector + vector 

I think you are correct, and that you can expect that a type determines where the instances can be stored. This is what CUDA.jl’s docs section Scalar indexing says:

Many array operations in Julia are implemented using loops, processing one element at a time. Doing so with GPU arrays is very ineffective, as the loop won’t actually execute on the GPU, but transfer one element at a time and process it on the CPU.

The scalars returned by sum etc are sent to the CPU (aka the host), but the host also runs your program, and communicates with the GPU to launch kernels, so it’s probably not that bad to “send the scalar” with the kernel as you’re having to communicate anyway, and a scalar is a very small amount of memory.

Usually the only way to see if this is an issue is to benchmark and profile (e.g. using CUDA.jl docs as a guide) the code to see if the GPU kernel launches are far apart and it sitting idle while waiting for new kernels. If that’s the case, you should increase the workload size, if appropriate, or consider writing a custom kernel that fuses the operations together, but this is much harder.