Choosing between KernelAbstractions, AcceleratedKernels, ParallelStencils, or just CUDA.jl

The issue with ParallelStencil.jl was the default number of blocks. I only know a little about GPU computing, so I didn’t realize it could matter quite so much.
By setting threads=(16,16) and blocks=(Nx,Ny) .\div threads I get

22.434988 seconds (13.81 M CPU allocations: 587.483 MiB, 2.22% gc time) (2.00 k GPU allocations: 445.312 KiB, 0.03% memmgmt time)

And It is even faster with threads=(32,8).

Thank you for your help,
Best regards

2 Likes