KernelAbstractions Autotuning

I’m going through the recent paper comparing batched KernelAbstractions kernels to standard array batching in jax and pytorch. In section 5.1.2, I found this interesting tidbit:

KernelAbstractions.jl performs a limited form of auto-tuning by optimizing the launch parameters for occupancy.

I went back to the docs to see if I could find anything describing this, but came up empty handed.

Perhaps I’m looking in the wrong place. Does anyone have any references that describe this functionality (and how well it works across different hardware platforms)?



It is backend dependent, but if you don’t specify the workgroupsize the back ends makes an educated guess.

As an example CUDA does CUDA.jl/src/CUDAKernels.jl at 3605167a9ea3aebfc944cc88ea0f86f01723a764 · JuliaGPU/CUDA.jl · GitHub

1 Like