The KernelAbstractions docs mention that kernels are launched asynchronously. I’m hoping to leverage this within a solver I’m working on, where I hope to hide the communication between GPUs behind some computation (a common technique with finite difference codes).
Typically, I would have to map different kernel calls to different SMs myself (if using CUDA). Does KernelAbstractions do this under the hood (for the various backends that support this)? Or are there some “scheduling implications” I should be aware of?