Scalar Indexing with CUDA

Jason_Meziere · June 9, 2024, 6:43am

I have a portion of my code that I haven’t quite figured out how to do without scalar indexing. I was messing around in the repl and noticed that, while something like

a[1]=1

will give the scalar indexing warning, something like

a[1:1].=1

does not. My question is:

Will these both have the same performance hit that comes from using scalar indexing, or is there something different about the second compared to the first in terms of performance?

maleadt · June 9, 2024, 11:26am

The performance of both will be bad in that there’s quite some overhead associated with fetching GPU memory, much larger than the actual transfer time of a single element. The reason that we don’t error on the second expression, is that we simply detect scalar transfers by hooking the relevant scalar getindex methods, and not the vectorized one you’re using in the second example.

If you know that the performance overhead of this operation isn’t problematic (e.g., because you only perform it rarely), you can annotate the expression with CUDA.@allowscalar. If you do need frequent scalar accesses to GPU memory, e.g. because you’re porting a CPU application but haven’t ported all the algorithms yet, consider using unified memory (see CUDA.jl 5.4: Memory management mayhem ⋅ JuliaGPU).

Jason_Meziere · June 9, 2024, 2:06pm

I see, that makes sense. This is more for my personal understanding now, but I do have a few more questions.

What is the difference between a getindex on the cpu and one that is compiled into a kernel? Is it that in a kernel, the getindex is compiled and run on the gpu as well?

Also, how many elements would it take for the setindex command to be worth it in a simple assignment command (i.e. a[1:n] .= 1)?

Finally, does this change in the case of a view, such as @views b .= a[1:n]?

de-souza · June 9, 2024, 2:57pm

In NNlib.jl for example, the gather method seems to work by compiling a call to getindex into a GPU kernel:

github.com

FluxML/NNlib.jl/blob/v0.9.17/src/gather.jl#L122-L129


      
          @kernel function _gather!(
              dst, @Const(src), @Const(idx),
              dim_ids::CartesianIndices, max_dims_idx::Int,
          )
              i = @index(Global)
              j, k = divrem(i - 1, max_dims_idx)
              @inbounds dst[i] = src[dim_ids[k + 1], Tuple(idx[j + 1])...]
          end

Would it be faster to use unified memory instead of calling this method?

maleadt · June 10, 2024, 8:04am

Correct; getindex in a kernel is fast, it’s only slow when executed from the CPU where each access needs to fetch memory from the device. If it’s already executing on the GPU (i.e. in a kernel), these accesses are fast.

Not necessarily. Unified memory comes with its own drawbacks too. For one, you’re now relying on the kernel driver to do the memory management, which may be suboptimal. The operation also becomes synchronous, i.e., blocking while the page fault is being handled, while a regular kernel launch is asynchronous and makes it possible for the CPU to do other work while waiting for it to complete.

Topic		Replies	Views
GPU: Scalar indexing in kernel programming GPU cuda	2	274	June 5, 2023
Overcoming Slow Scalar Operations on GPU Arrays GPU performance	19	6368	January 4, 2021
First CUDA Program- Addressing Scalar Indexing GPU	5	1161	June 28, 2021
Scalar Indexing Error: multiplying matrix by scalar GPU	2	945	October 25, 2021
Optimizing the use of Blocks, Threads vs. Array Indexing GPU	15	3306	September 21, 2018

Scalar Indexing with CUDA

Related topics