Scalar Indexing with CUDA

I have a portion of my code that I haven’t quite figured out how to do without scalar indexing. I was messing around in the repl and noticed that, while something like


will give the scalar indexing warning, something like


does not. My question is:

Will these both have the same performance hit that comes from using scalar indexing, or is there something different about the second compared to the first in terms of performance?


The performance of both will be bad in that there’s quite some overhead associated with fetching GPU memory, much larger than the actual transfer time of a single element. The reason that we don’t error on the second expression, is that we simply detect scalar transfers by hooking the relevant scalar getindex methods, and not the vectorized one you’re using in the second example.

If you know that the performance overhead of this operation isn’t problematic (e.g., because you only perform it rarely), you can annotate the expression with CUDA.@allowscalar. If you do need frequent scalar accesses to GPU memory, e.g. because you’re porting a CPU application but haven’t ported all the algorithms yet, consider using unified memory (see CUDA.jl 5.4: Memory management mayhem ⋅ JuliaGPU).


I see, that makes sense. This is more for my personal understanding now, but I do have a few more questions.

What is the difference between a getindex on the cpu and one that is compiled into a kernel? Is it that in a kernel, the getindex is compiled and run on the gpu as well?

Also, how many elements would it take for the setindex command to be worth it in a simple assignment command (i.e. a[1:n] .= 1)?

Finally, does this change in the case of a view, such as @views b .= a[1:n]?


In NNlib.jl for example, the gather method seems to work by compiling a call to getindex into a GPU kernel:

Would it be faster to use unified memory instead of calling this method?

1 Like

Correct; getindex in a kernel is fast, it’s only slow when executed from the CPU where each access needs to fetch memory from the device. If it’s already executing on the GPU (i.e. in a kernel), these accesses are fast.

Not necessarily. Unified memory comes with its own drawbacks too. For one, you’re now relying on the kernel driver to do the memory management, which may be suboptimal. The operation also becomes synchronous, i.e., blocking while the page fault is being handled, while a regular kernel launch is asynchronous and makes it possible for the CPU to do other work while waiting for it to complete.