Asynchronous kernel scheduling with KernelAbstractions

The KernelAbstractions docs mention that kernels are launched asynchronously. I’m hoping to leverage this within a solver I’m working on, where I hope to hide the communication between GPUs behind some computation (a common technique with finite difference codes).

Typically, I would have to map different kernel calls to different SMs myself (if using CUDA). Does KernelAbstractions do this under the hood (for the various backends that support this)? Or are there some “scheduling implications” I should be aware of?


Can you elaborate? With CUDA, you cannot decide which SMs a kernel executes on. That would also only matter if you want to overlap kernel execution, which is separate from their asynchronous nature.

Ah, thanks @maleadt. You’re right, I should clarify.

What I’m really after is overlapping kernels, and not just asynchronous launching.

After a bit more digging, it seems like there’s been some work around this within KernelAbstractions, but I’m not sure what the current status is:

After, KA.jl should be compatible with CUDA.jl’s task mechanism. So you should use Julia tasks in order for kernels to launch on different streams, and potentially overlap. See CUDA.jl 3.0 ⋅ JuliaGPU

In particular with KernelAbstractions 0.9 you would do it “just like” CUDA.jl

You can use multiple Julia tasks to represent concurrent work and example here is

where I use Julia tasks to do some MPI communication concurrently.


Thank you @maleadt and @vchuravy! This helps a lot.