Fast tile search

00shiv · November 5, 2022, 7:19pm

The following code (using Tullio.jl):

@tullio dists[u,v,n] := abs( query[i, j, ch] - images[u+i-1, v+j-1, ch, n] )

is very fast on my AVX512 capable 4-core CPU.

I would like to run this even faster on my CUDA capable GPU. However all my naive attempts:

using Tullio itself
using mapreduce in various ways
kernel code with images[:,:,:,n] being treated on thread n

all gave essentially a 10x slower code.

For reference query is typically around 10x10x3 and images is around 32x32x3x20000.

Before I dig further into GPU programming I am wondering if I can get some sage advice on this forum.

Thanks.
–shiv–

barucden · November 7, 2022, 9:39am

I have zero experience with GPU programming in Julia. However, generally, if the computation is very simple, it can be more expensive to transfer the data to the GPU memory and perform the computation on GPU than to perform the computation on CPU directly. I am not sure this is the case though.

00shiv · November 7, 2022, 2:50pm

Thanks for the suggestion. I timed the code for multiple runs after all the memory was copied over. I still had a 10x slow down over the CPU. I did notice that in the kernel code if I deliberately introduce a bug and replace images[i+p-1,j+q-1,ch,n] by images[i+p-1,j+q-1,ch] (notice lack of 4th index), then the GPU code is indeed faster than the CPU code. My guess was that this indicates that the memory access pattern inside my kernel code was probably not done the right way. But I have no idea what the right way is!

00shiv · November 11, 2022, 6:29pm

I browsed through some other posts here and guessed that much of the slow down could be because the query tile is shared among all threads. So I thought I should put it in dynamic (as I don’t have a static size) shared memory. The documentation in CUDA.jl was not very clear about how to do this, but I tried something and got the following error:

error: :1:16: invalid register name
mov.u32 %edx, %dynamic_smem_size;

I can clean up and post my code if needed, but does anyone have any advice before I do that?
Thanks.
–shiv–

00shiv · November 11, 2022, 7:23pm

Does anyone know if the techniques on dynamic shared memory from this old thread:

are still valid in 1.8.2?
Thanks.
–shiv–

00shiv · November 11, 2022, 7:46pm

I realized my mistake and fixed the crash. CuDynamicSharedArray has to be called from inside the device code. The examples helped me.
However the code is still 10x slower. Maybe this is a case where AVX512 is just better?
–shiv–

roflmaostc · November 11, 2022, 10:56pm

Just a few hints, since I had similar tasks in the past:

Tullio is crazily fast and speedup can be for those array sizes sometimes only a factor of 10 with CUDA. (what is your GPU btw?)
did you use Float32?
can you post a full MWE? Then we could run your code directly. That’s probably one reason why we don’t see many replies.

Topic		Replies	Views
GPU-Kernel function for fast matrix multiplication using shared memory GPU kernel	1	1746	August 13, 2021
I don't understand why it is slower with CuStaticSharedArray New to Julia gpu , cuda , sharedarrays , cudajl	9	282	March 17, 2025
Correct usage of shared memory？ GPU	5	850	January 20, 2024
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	235	June 4, 2025
Thinking through distance matrix calculation GPU gpu , multithreading	5	2755	May 16, 2017

Fast tile search

Related topics