Choosing between KernelAbstractions, AcceleratedKernels, ParallelStencils, or just CUDA.jl

Ludovic_Dumoulin · March 5, 2026, 6:02pm

The issue with ParallelStencil.jl was the default number of blocks. I only know a little about GPU computing, so I didn’t realize it could matter quite so much.
By setting threads=(16,16) and blocks=(Nx,Ny) .\div threads I get

22.434988 seconds (13.81 M CPU allocations: 587.483 MiB, 2.22% gc time) (2.00 k GPU allocations: 445.312 KiB, 0.03% memmgmt time)

And It is even faster with threads=(32,8).

Thank you for your help,
Best regards

Topic		Replies	Views
Writing fast stencil computation kernels that work on both CPUs and GPUs GPU	3	2288	January 29, 2019
[ANN] AcceleratedKernels.jl - Cross-architecture parallel algorithms for Julia's GPU backends Package Announcements package , announcement , gpu , performance , parallel	17	2048	March 3, 2026
Is Intel’s ParallelAccelerator.jl still maintained? Community question , package , parallel	18	3553	January 3, 2019
KernelAbstractions is slower than CUDA GPU gpu , cuda , kernelabstractions	8	1556	November 10, 2022
Julia (AcceleratedKernels) vs JAX time comparison Performance gpu , performance , kernelabstractions , jax	21	1521	June 11, 2025

Choosing between KernelAbstractions, AcceleratedKernels, ParallelStencils, or just CUDA.jl

Related topics