Where do scalar-valued results of GPU operations go?

Bizzi · July 12, 2023, 5:53pm

Hello everyone. It does not seem possible to push a pure scalar to the GPU, but many common linear algebra operations result in scalars - most prominently, the inner product. So where do these scalars live? Do they go back to the CPU?

I ask because a performance-critical core of my current application relies on combining such scalar results and other GPU arrays. I can only assume these scalars being sent back and forth between CPU and GPU would hurt performance. So should I be worried about things of the following kind?

using Flux, CUDA, LinearAlgebra 

CUDA.allowscalar(false)

n = 3
v1 = rand(Float32,n) |> gpu
v2 = rand(Float32,n)'|> gpu #Adjoint
v3 = rand(Float32,n) |> gpu
v4 = rand(Float32,n) |> gpu

v2*v1*v3 + v4 #scalar × vector + vector 
sum(v1)*v3 + v4 #scalar × vector + vector

Benny · July 13, 2023, 4:42am

I think you are correct, and that you can expect that a type determines where the instances can be stored. This is what CUDA.jl’s docs section Scalar indexing says:

Many array operations in Julia are implemented using loops, processing one element at a time. Doing so with GPU arrays is very ineffective, as the loop won’t actually execute on the GPU, but transfer one element at a time and process it on the CPU.

jmair · July 13, 2023, 6:23am

The scalars returned by sum etc are sent to the CPU (aka the host), but the host also runs your program, and communicates with the GPU to launch kernels, so it’s probably not that bad to “send the scalar” with the kernel as you’re having to communicate anyway, and a scalar is a very small amount of memory.

Usually the only way to see if this is an issue is to benchmark and profile (e.g. using CUDA.jl docs as a guide) the code to see if the GPU kernel launches are far apart and it sitting idle while waiting for new kernels. If that’s the case, you should increase the workload size, if appropriate, or consider writing a custom kernel that fuses the operations together, but this is much harder.

Topic		Replies	Views
Overcoming Slow Scalar Operations on GPU Arrays GPU performance	19	6189	January 4, 2021
How to accelerate GPU operation？ GPU question	12	361	November 18, 2024
Scalar Indexing with CUDA General Usage cuda	4	585	June 10, 2024
Problems copying data on the GPU GPU	8	1211	October 1, 2020
Possible reordering of floating point arithmetic in CUDA.jl GPU question	3	280	June 22, 2023

Where do scalar-valued results of GPU operations go?

Related topics