Hi! It’s been about a week that I started learning GPGPU in Julia. At first I used ArrayFire, but then I needed to do some more sophisticated computations on the GPU and started trying to define my own custom kernels via GPUArrays/CUDAnative/CuArrays. Boy if that isn’t hard! I find myself needing to actually read papers or so, instead of just reverse-engineering codes I find over the internet. The scarce documentation/material is also noteworthily hindering.
Anyway, what I’m trying to do is to implement something like a convolution kernel with a particularity: I want one filter for each pixel (and perhaps to change individually each filter per iteration). It didn’t seem so hard, but I knew close to nothing about GPUs architectures or about how they work and got stuck multiple times. For reference, I started tweaking this code I found here: Base function in Cuda kernels
which helped me at least manage to implement a working convolutional kernel.
I also found this post: Passing array of pointers to CUDA kernel
and therefore could at least pass a
CuDeviceArrays as argument without much trouble (each
CuDeviceArray being a filter).
I was about to make a post here because I was having
illegal memory access errors, infamous
code 700 (which I still don’t quite get what is), but while organizing the code and such I somehow solved it. Now the problem I’m having is somewhat subtler: my code is not quite performant and, well, I know it is a deep way down for me to fully understand what is happening here, but I believe you might be able to shed some light over it and help me understand and move a bit more fluidly in my studies. Sorry the long introduction.
TL;DR: I need help (some tips/orientations will do) optimizing the following code (I don’t know if it is good practice to share gists here but I believed it would be better sharing the entire code instead of just the kernel function):