The KernelAbstractions docs mention that kernels are launched asynchronously. I’m hoping to leverage this within a solver I’m working on, where I hope to hide the communication between GPUs behind some computation (a common technique with finite difference codes).
Typically, I would have to map different kernel calls to different SMs myself (if using CUDA). Does KernelAbstractions do this under the hood (for the various backends that support this)? Or are there some “scheduling implications” I should be aware of?
Can you elaborate? With CUDA, you cannot decide which SMs a kernel executes on. That would also only matter if you want to overlap kernel execution, which is separate from their asynchronous nature.
I am currently also trying to implement an asynchronous MPI halo exchange (on GPUs and CPUs) and trying to follow the example you posted.
In my case I perform multiple exchanges (2 per dimension), i.e., my starting point is something like:
requests = MPI.Request[]
for dim in 1:ndims
neg_nbr, pos_nbr = get_nbrs(...)
push!(requests, Irecv!(#= from neg_nbr =#))
push!(requests, Isend!(#= to pos_nbr =#))
push!(requests, Irecv!(#= from pos_nbr =#))
push!(requests, Isend!(#= to neg_nbr =#))
if do_edges
# wait for all requests of this dimension to complete before going to the next
empty!(requests)
end
end
MPI.Waitall(requests) # for async, this should excluded and the requests/tasks returned
My assumption is that I have to split requests into recv_requests and send_requests to make this work similar to your example, but it isn’t quite clear to me how to handle the sends. Do I have to @spawn a separate task for every send like it seems in the example, or can I bundle them like the receives?
Also, for the case do_edges=true is it even possible to do this asynchronously, i.e., have it respect the fact that the dimensions are interdependent while still being able to overlap with computation?