Presentation on effective use of CUDAnative/CuArrays


Since our GPU stack is pretty ill-documented, I figured it can’t hurt to cross-post any content there is. Here’s a recent presentation of mine that explains how the relevant packages work, with demos, as well as tips/tricks and tools on how to do so effectively:


I don’t suppose there’s a video of that? Looks extremely useful, but for me a little hard to follow


Sadly, no. I did add presenter notes though, and I could add some more if certain parts are unclear.


This is beautiful! Thanks for sharing! We should have more of those… :slight_smile:


hi @maleadt, thanks a lot for sharing those. can i ask a quick question please? I don’t really understand what’s going on with this example here on slide 22:

function diff_y(a, b)
   a .= @views b[:, 2:end] .- b[:, 1:end-1]
# vs
function diff_y(a, b)
   s = size(a)
   for j = 1:s[2]
       @inbounds a[:,j] .= b[:,j+1] - b[:,j]

so, in the second case, upon doing @cuda diff_y(a,b) we would effectively generate one GPU kernel for each j, whereas in the first case it’s one unique kernel?

More in general: there is nothing wrong per se to split tasks on the GPU into functions, right? I mean, I could have 2 kernels, where kernel 1 calls kernel 2, instead of stuffing all tasks into one big function? (kernel 2 is written as a regular julia function, i.e. I don’t need @cuda in front of it, correct?)


No, we’re never doing an explicit @cuda in these examples, but relying on the array abstractions by CuArrays.jl to call @cuda in the implementations of these abstractions. And that’s exactly why the second example is better: only a single fused call to broadcast is executed, resulting in a single @cuda, whereas the original version puts that in a loop causing multiple calls to broadcast and consequently @cuda.

Correct, but we don’t call kernel 2 a kernel then, just an ordinary (device) function. Only when launching multiple kernels (ie. multiple calls to @cuda, either explicitly or as part of array abstractions from CuArrays.jl), and when those kernels are sufficiently small not to saturate the GPU easily, then fusion makes sense. This typically happens with short operations as the ones you end up with when doing broadcast. When writing your own kernels, it is much easier to saturate the GPU. But again, profile to be sure (see the screenshots at the end of my talk).


Does that mean that on the GPU we’re back to trying to make code vectorized?


Pretty much, although the definition of “vectorized” is much more broad nowadays (with broadcast fusion). Here’s hoping for similar expressibility improvements for other operators like reduce.