Yay, I finally managed to define my custom kernel. But now I need it to be fast(er)

Hi! It’s been about a week that I started learning GPGPU in Julia. At first I used ArrayFire, but then I needed to do some more sophisticated computations on the GPU and started trying to define my own custom kernels via GPUArrays/CUDAnative/CuArrays. Boy if that isn’t hard! I find myself needing to actually read papers or so, instead of just reverse-engineering codes I find over the internet. The scarce documentation/material is also noteworthily hindering.
Anyway, what I’m trying to do is to implement something like a convolution kernel with a particularity: I want one filter for each pixel (and perhaps to change individually each filter per iteration). It didn’t seem so hard, but I knew close to nothing about GPUs architectures or about how they work and got stuck multiple times. For reference, I started tweaking this code I found here: Base function in Cuda kernels
which helped me at least manage to implement a working convolutional kernel.
I also found this post: Passing array of pointers to CUDA kernel
and therefore could at least pass a CuArray with CuDeviceArrays as argument without much trouble (each CuDeviceArray being a filter).
I was about to make a post here because I was having illegal memory access errors, infamous code 700 (which I still don’t quite get what is), but while organizing the code and such I somehow solved it. Now the problem I’m having is somewhat subtler: my code is not quite performant and, well, I know it is a deep way down for me to fully understand what is happening here, but I believe you might be able to shed some light over it and help me understand and move a bit more fluidly in my studies. Sorry the long introduction.

TL;DR: I need help (some tips/orientations will do) optimizing the following code (I don’t know if it is good practice to share gists here but I believed it would be better sharing the entire code instead of just the kernel function):

I’m afraid you’ll have to do some more reading about GPU programming, since that kernel doesn’t really use the hardware and will never run fast :frowning: Some elaboration: GPUs are massively parallel devices, and you program them by writing kernel functions which will be executed in parallel. You can think of the function getting called thousands of times, and your code will have to differentiate behavior based on which exact invocation of the function is currently happening (by inspecting counters like threadIdx and blockDim).

Your code however only invokes the kernel function once (since you don’t specify and threads or blocks argument to @cuda), and to perform work you have a for loop in your kernel. So you’re effectively performing all the work on a single thread, not using any of the parallel hardware a GPU has to offer.

Did you have a look at the introductory tutorial? It explains, with a simple example, how to write a kernel. For convolutions, there also this code we recently removed from GPUArrays (as it was unused/untested) that might be interesting as a starting point.


I actually read the tutorial (perhaps not as carefully as I could and should). Not only it but also took a look at this: http://mikeinnes.github.io/2017/08/24/cudanative.html and at this: https://nextjournal.com/sdanisch/julia-gpu-programming. Also tried to follow these slides: https://docs.google.com/presentation/d/1l-BuAtyKgoVYakJSijaSqaTL3friESDyTOnU2OLqGoA/edit#slide=id.p. And today I’m about to start reading this paper: https://arxiv.org/abs/1712.03112.

I was actually using those parameters (threads and blocks) and removed them to try to follow a “make it work, then make it fast” approach. The 8 to 13th lines are what I used to try to define the number of blocks and threads I could use, and the line 100 shows how I was trying to call the kernel with @cuda (except by the number of threads, that was being defined at the line 13 as threads, it wasn’t 1). I discovered I could use @cuda without parameters and removed them to lessen the number of possible errors, but I didn’t know what that actually would mean (like, maybe the parameterless call would infer some “average possible values” to use? idk I might be too addicted to high level languages)
Thing is, I still have many doubts, and I didn’t think any of these materials I posted (except perhaps by the paper, that I’m yet to read) solved all of them. Without them, tho, I wouldn’t have come so far. The code might have look somewhat noobish, but I guess I’m starting to get the hang of it. For instance, I believe that, besides the @cuda calls, the code is reasonable, isn’t it? I’ll tweak it and find it out, but I’m asking anyway so as to keep moving the discussion a little further.
One thing I quite don’t get yet is how the loops work within the GPU calls. For instance, this block of code (found in the 7th of those slides):

codeblock for original `say` function
function say(num)
        @cuprintf("Thread %ld says: %ld\n",
                  threadIdx().x, num)
@cuda threads=4 say(42)

calls @cuprintf four times, one in each thread. But what if I changed it to, for instance,

changed `say` function
function say(num)
  index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
  for i in index:10
    @cuprintf("Thread %ld says: %ld at iteration %ld\n
                with blockIdx: %ld and blockDim: %ld;\n ", threadIdx().x, num, i, blockIdx(), blockDim())

@cuda threads=4 say(42)

then I quite don’t get its behavior. Specifically: it looks (quite obviously actually), from the output, that this loop is called once for each thread, in parallel. But that’s not what I want, I want this loop to be split across multiple threads, and I’m having trouble going from what is found in those materials to that. I don’t even know if that’s possible.

That’s not going to happen automatically. CUDAnative works at the same abstraction level as CUDA C, where nothing automatic like that happens. The GPU executes exactly as you write it. If you want higher-level abstractions, I suggest you look at CuArrays but even there we don’t have the abstraction you’re looking for.

Finally, the documentation is a little sparse because it assumes knowledge of CUDA C. Since we provide a very similar programming experience, you can also look for CUDA C code and documentation to find the information you are looking for.

1 Like

Take a look at https://github.com/vchuravy/GPUifyLoops.jl. I was able to get a custom kernel to run after only reading, like, 2 papers :wink:

1 Like

Hi! I tried to study a bit more, and at least could manage my kernel to work. I’m quite sure this runs appropriately on the GPU. I was inclined to resume my studies in C, since it seemed quite necessary in order to properly understand, in a lower-level, how to work with CUDAnative. I haven’t done so, in fact. I kept studying Julia and tried to understand a little better CUDA.jl introduction. I still have many doubts and still find using GPGPU somewhat confusing, even - or maybe precisely because I only understand it at this higher level - at the level of abstraction offered by CUDA.jl. Still, I guess that asking for some optimization tips now isn’t as unreasonable as before. Basically, I’d like to know if there is any evident part of this code that could be improved. It works, and it is quite fast tbh, but, as I said, I’m yet to get a solid grasp over CUDA, my comprehension over the lower-level functioning of this code is still pretty shallow, so I might be not seeing obvious changes that could improve the performance. I’m basically trying to learn how to make things as efficient as it is possible using the GPU, and that, afaik, follows what generally happens with programming and usually comes mostly with practice. So asking here is quite the shortcut. The only issue I’m stumbling upon with some frequency is that my GPU runs out of memory. I could let the GC work a little more by encapsulating some of the last lines within functions, I guess. But I think that passing such a huge number of filters (the schemes argument) to a kernel is inherently memory-intensive.

Looks like an OK kernel. You could probably speed it up a little by adding @inbounds. I also think the OOM comes from allocating and caching output frames, as the kernel itself doesn’t allocate. There’s advanced optimizations possible for a memory-bound kernel like this, e.g. making sure accesses are coalesced, or putting the coefficients in constant memory, but those seem out of scope here.

1 Like