Fast tile search

The following code (using Tullio.jl):

@tullio dists[u,v,n] := abs( query[i, j, ch] - images[u+i-1, v+j-1, ch, n] )

is very fast on my AVX512 capable 4-core CPU.

I would like to run this even faster on my CUDA capable GPU. However all my naive attempts:

  • using Tullio itself
  • using mapreduce in various ways
  • kernel code with images[:,:,:,n] being treated on thread n

all gave essentially a 10x slower code.

For reference query is typically around 10x10x3 and images is around 32x32x3x20000.

Before I dig further into GPU programming I am wondering if I can get some sage advice on this forum.


I have zero experience with GPU programming in Julia. However, generally, if the computation is very simple, it can be more expensive to transfer the data to the GPU memory and perform the computation on GPU than to perform the computation on CPU directly. I am not sure this is the case though.

1 Like

Thanks for the suggestion. I timed the code for multiple runs after all the memory was copied over. I still had a 10x slow down over the CPU. I did notice that in the kernel code if I deliberately introduce a bug and replace images[i+p-1,j+q-1,ch,n] by images[i+p-1,j+q-1,ch] (notice lack of 4th index), then the GPU code is indeed faster than the CPU code. My guess was that this indicates that the memory access pattern inside my kernel code was probably not done the right way. But I have no idea what the right way is!

I browsed through some other posts here and guessed that much of the slow down could be because the query tile is shared among all threads. So I thought I should put it in dynamic (as I don’t have a static size) shared memory. The documentation in CUDA.jl was not very clear about how to do this, but I tried something and got the following error:

error: :1:16: invalid register name
mov.u32 %edx, %dynamic_smem_size;

I can clean up and post my code if needed, but does anyone have any advice before I do that?

Does anyone know if the techniques on dynamic shared memory from this old thread:

are still valid in 1.8.2?

I realized my mistake and fixed the crash. CuDynamicSharedArray has to be called from inside the device code. The examples helped me.
However the code is still 10x slower. Maybe this is a case where AVX512 is just better?

Just a few hints, since I had similar tasks in the past:

  • Tullio is crazily fast and speedup can be for those array sizes sometimes only a factor of 10 with CUDA. (what is your GPU btw?)
  • did you use Float32?
  • can you post a full MWE? Then we could run your code directly. That’s probably one reason why we don’t see many replies.