Yay, I finally managed to define my custom kernel. But now I need it to be fast(er)

Hi! It’s been about a week that I started learning GPGPU in Julia. At first I used ArrayFire, but then I needed to do some more sophisticated computations on the GPU and started trying to define my own custom kernels via GPUArrays/CUDAnative/CuArrays. Boy if that isn’t hard! I find myself needing to actually read papers or so, instead of just reverse-engineering codes I find over the internet. The scarce documentation/material is also noteworthily hindering.
Anyway, what I’m trying to do is to implement something like a convolution kernel with a particularity: I want one filter for each pixel (and perhaps to change individually each filter per iteration). It didn’t seem so hard, but I knew close to nothing about GPUs architectures or about how they work and got stuck multiple times. For reference, I started tweaking this code I found here: Base function in Cuda kernels
which helped me at least manage to implement a working convolutional kernel.
I also found this post: Passing array of pointers to CUDA kernel
and therefore could at least pass a CuArray with CuDeviceArrays as argument without much trouble (each CuDeviceArray being a filter).
I was about to make a post here because I was having illegal memory access errors, infamous code 700 (which I still don’t quite get what is), but while organizing the code and such I somehow solved it. Now the problem I’m having is somewhat subtler: my code is not quite performant and, well, I know it is a deep way down for me to fully understand what is happening here, but I believe you might be able to shed some light over it and help me understand and move a bit more fluidly in my studies. Sorry the long introduction.

TL;DR: I need help (some tips/orientations will do) optimizing the following code (I don’t know if it is good practice to share gists here but I believed it would be better sharing the entire code instead of just the kernel function):

I’m afraid you’ll have to do some more reading about GPU programming, since that kernel doesn’t really use the hardware and will never run fast :frowning: Some elaboration: GPUs are massively parallel devices, and you program them by writing kernel functions which will be executed in parallel. You can think of the function getting called thousands of times, and your code will have to differentiate behavior based on which exact invocation of the function is currently happening (by inspecting counters like threadIdx and blockDim).

Your code however only invokes the kernel function once (since you don’t specify and threads or blocks argument to @cuda), and to perform work you have a for loop in your kernel. So you’re effectively performing all the work on a single thread, not using any of the parallel hardware a GPU has to offer.

Did you have a look at the introductory tutorial? It explains, with a simple example, how to write a kernel. For convolutions, there also this code we recently removed from GPUArrays (as it was unused/untested) that might be interesting as a starting point.

3 Likes

I actually read the tutorial (perhaps not as carefully as I could and should). Not only it but also took a look at this: http://mikeinnes.github.io/2017/08/24/cudanative.html and at this: https://nextjournal.com/sdanisch/julia-gpu-programming. Also tried to follow these slides: https://docs.google.com/presentation/d/1l-BuAtyKgoVYakJSijaSqaTL3friESDyTOnU2OLqGoA/edit#slide=id.p. And today I’m about to start reading this paper: https://arxiv.org/abs/1712.03112.

I was actually using those parameters (threads and blocks) and removed them to try to follow a “make it work, then make it fast” approach. The 8 to 13th lines are what I used to try to define the number of blocks and threads I could use, and the line 100 shows how I was trying to call the kernel with @cuda (except by the number of threads, that was being defined at the line 13 as threads, it wasn’t 1). I discovered I could use @cuda without parameters and removed them to lessen the number of possible errors, but I didn’t know what that actually would mean (like, maybe the parameterless call would infer some “average possible values” to use? idk I might be too addicted to high level languages)
Thing is, I still have many doubts, and I didn’t think any of these materials I posted (except perhaps by the paper, that I’m yet to read) solved all of them. Without them, tho, I wouldn’t have come so far. The code might have look somewhat noobish, but I guess I’m starting to get the hang of it. For instance, I believe that, besides the @cuda calls, the code is reasonable, isn’t it? I’ll tweak it and find it out, but I’m asking anyway so as to keep moving the discussion a little further.
One thing I quite don’t get yet is how the loops work within the GPU calls. For instance, this block of code (found in the 7th of those slides):

codeblock for original `say` function
function say(num)
        @cuprintf("Thread %ld says: %ld\n",
                  threadIdx().x, num)
        return
      end
@cuda threads=4 say(42)

calls @cuprintf four times, one in each thread. But what if I changed it to, for instance,

changed `say` function
function say(num)
  index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
  for i in index:10
    @cuprintf("Thread %ld says: %ld at iteration %ld\n
                with blockIdx: %ld and blockDim: %ld;\n ", threadIdx().x, num, i, blockIdx(), blockDim())
  end
  return
end

@cuda threads=4 say(42)

then I quite don’t get its behavior. Specifically: it looks (quite obviously actually), from the output, that this loop is called once for each thread, in parallel. But that’s not what I want, I want this loop to be split across multiple threads, and I’m having trouble going from what is found in those materials to that. I don’t even know if that’s possible.

That’s not going to happen automatically. CUDAnative works at the same abstraction level as CUDA C, where nothing automatic like that happens. The GPU executes exactly as you write it. If you want higher-level abstractions, I suggest you look at CuArrays but even there we don’t have the abstraction you’re looking for.

Finally, the documentation is a little sparse because it assumes knowledge of CUDA C. Since we provide a very similar programming experience, you can also look for CUDA C code and documentation to find the information you are looking for.

Take a look at https://github.com/vchuravy/GPUifyLoops.jl. I was able to get a custom kernel to run after only reading, like, 2 papers :wink: