Writing fast stencil computation kernels that work on both CPUs and GPUs

Thanks for the example @maleadt and for devoting so much of your time to building up the GPU/CUDA infrastructure in Julia!

I must have been trying so hard to get a CPU/GPU shared kernel working that I forgot @cuda threads=... won’t automatically parallelize code without assigning threads/blocks.

Indeed I just tried out your kernel and @benchmark gives a median time of 1.614 ms with 128 threads (compared with 178.504 μs using @views). I’m sure it could be made faster as you say but I’m interested in seeing how long I can stick to @views and avoid planning out messy/elaborate thread/block kernels. I don’t mind sacrificing a bit of performance in favor of more readable and generic code. I will try fusing some smaller operations and see how well @views performs on a much bigger kernel which should hopefully help bring down the broadcasting overhead.

The big issue is that unfortunately I cannot use threadIdx(), blockIdx(), etc. as the kernel will not run on a CPU anymore, so I felt like the vast majority of research and tutorials out there on implementing efficient stencil calculations won’t help too much.

But if I can efficiently implement all the operators using @views that would be excellent! Then the same code should run on both CPUs (using regular Arrays) and GPUs (using CuArrays). Shouldn’t be too much work so I’ll give it a try.