I have zero experience with GPU programming in Julia. However, generally, if the computation is very simple, it can be more expensive to transfer the data to the GPU memory and perform the computation on GPU than to perform the computation on CPU directly. I am not sure this is the case though.
Thanks for the suggestion. I timed the code for multiple runs after all the memory was copied over. I still had a 10x slow down over the CPU. I did notice that in the kernel code if I deliberately introduce a bug and replace images[i+p-1,j+q-1,ch,n] by images[i+p-1,j+q-1,ch] (notice lack of 4th index), then the GPU code is indeed faster than the CPU code. My guess was that this indicates that the memory access pattern inside my kernel code was probably not done the right way. But I have no idea what the right way is!
I browsed through some other posts here and guessed that much of the slow down could be because the query tile is shared among all threads. So I thought I should put it in dynamic (as I don’t have a static size) shared memory. The documentation in CUDA.jl was not very clear about how to do this, but I tried something and got the following error:
error: :1:16: invalid register name
mov.u32 %edx, %dynamic_smem_size;
I can clean up and post my code if needed, but does anyone have any advice before I do that?
I realized my mistake and fixed the crash. CuDynamicSharedArray has to be called from inside the device code. The examples helped me.
However the code is still 10x slower. Maybe this is a case where AVX512 is just better?