Writing fast stencil computation kernels that work on both CPUs and GPUs

PolarizedPoutine · January 29, 2019, 2:04pm

Thanks for the example @maleadt and for devoting so much of your time to building up the GPU/CUDA infrastructure in Julia!

I must have been trying so hard to get a CPU/GPU shared kernel working that I forgot @cuda threads=... won’t automatically parallelize code without assigning threads/blocks.

Indeed I just tried out your kernel and @benchmark gives a median time of 1.614 ms with 128 threads (compared with 178.504 μs using @views). I’m sure it could be made faster as you say but I’m interested in seeing how long I can stick to @views and avoid planning out messy/elaborate thread/block kernels. I don’t mind sacrificing a bit of performance in favor of more readable and generic code. I will try fusing some smaller operations and see how well @views performs on a much bigger kernel which should hopefully help bring down the broadcasting overhead.

The big issue is that unfortunately I cannot use threadIdx(), blockIdx(), etc. as the kernel will not run on a CPU anymore, so I felt like the vast majority of research and tutorials out there on implementing efficient stencil calculations won’t help too much.

But if I can efficiently implement all the operators using @views that would be excellent! Then the same code should run on both CPUs (using regular Arrays) and GPUs (using CuArrays). Shouldn’t be too much work so I’ll give it a try.

Topic		Replies	Views
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	194	June 4, 2025
[ANN] Stencils.jl for fast small/direct stencils on CPU/GPU Package Announcements stencils	2	555	November 22, 2023
cuArrays vs CUDANative GPU	3	1361	November 14, 2018
CUDA \| nested loops kernel GPU question	5	161	May 12, 2025
Writing stencils for CuArray GPU	6	1134	July 31, 2019

Writing fast stencil computation kernels that work on both CPUs and GPUs

Related topics