Presentation on effective use of CUDAnative/CuArrays

maleadt · December 22, 2018, 8:13am

Since our GPU stack is pretty ill-documented, I figured it can’t hurt to cross-post any content there is. Here’s a recent presentation of mine that explains how the relevant packages work, with demos, as well as tips/tricks and tools on how to do so effectively: Julia BeNeLux 2018-12 - GPU Tutorial - Google Slides

mkborregaard · December 23, 2018, 9:56pm

I don’t suppose there’s a video of that? Looks extremely useful, but for me a little hard to follow

maleadt · December 24, 2018, 7:23am

Sadly, no. I did add presenter notes though, and I could add some more if certain parts are unclear.

juliohm · December 26, 2018, 11:22am

This is beautiful! Thanks for sharing! We should have more of those…

floswald · December 30, 2018, 2:26pm

hi @maleadt, thanks a lot for sharing those. can i ask a quick question please? I don’t really understand what’s going on with this example here on slide 22:

function diff_y(a, b)
   a .= @views b[:, 2:end] .- b[:, 1:end-1]
end
# vs
function diff_y(a, b)
   s = size(a)
   for j = 1:s[2]
       @inbounds a[:,j] .= b[:,j+1] - b[:,j]
   end
end

so, in the second case, upon doing @cuda diff_y(a,b) we would effectively generate one GPU kernel for each j, whereas in the first case it’s one unique kernel?

More in general: there is nothing wrong per se to split tasks on the GPU into functions, right? I mean, I could have 2 kernels, where kernel 1 calls kernel 2, instead of stuffing all tasks into one big function? (kernel 2 is written as a regular julia function, i.e. I don’t need @cuda in front of it, correct?)
thanks!

maleadt · January 3, 2019, 6:36am

No, we’re never doing an explicit @cuda in these examples, but relying on the array abstractions by CuArrays.jl to call @cuda in the implementations of these abstractions. And that’s exactly why the second example is better: only a single fused call to broadcast is executed, resulting in a single @cuda, whereas the original version puts that in a loop causing multiple calls to broadcast and consequently @cuda.

Correct, but we don’t call kernel 2 a kernel then, just an ordinary (device) function. Only when launching multiple kernels (ie. multiple calls to @cuda, either explicitly or as part of array abstractions from CuArrays.jl), and when those kernels are sufficiently small not to saturate the GPU easily, then fusion makes sense. This typically happens with short operations as the ones you end up with when doing broadcast. When writing your own kernels, it is much easier to saturate the GPU. But again, profile to be sure (see the screenshots at the end of my talk).

mkborregaard · January 3, 2019, 7:24am

Does that mean that on the GPU we’re back to trying to make code vectorized?

maleadt · January 3, 2019, 7:28am

Pretty much, although the definition of “vectorized” is much more broad nowadays (with broadcast fusion). Here’s hoping for similar expressibility improvements for other operators like reduce.

Topic		Replies	Views
Map Performance with CuArrays GPU question , fftw , cuda , broadcast	15	5176	January 4, 2021
cuArrays vs CUDANative GPU	3	1362	November 14, 2018
CUDA kernel: how to pass an array of functions GPU cuda	7	1480	February 8, 2021
Is is possible to merge multiple kernels in CUDAnative to minimize launch overhead and execution overhead? GPU	12	1603	November 11, 2018
Performance of kernel function GPU	3	456	November 28, 2019

Presentation on effective use of CUDAnative/CuArrays

Related topics