Asynchronous kernel scheduling with KernelAbstractions

The KernelAbstractions docs mention that kernels are launched asynchronously. I’m hoping to leverage this within a solver I’m working on, where I hope to hide the communication between GPUs behind some computation (a common technique with finite difference codes).

Typically, I would have to map different kernel calls to different SMs myself (if using CUDA). Does KernelAbstractions do this under the hood (for the various backends that support this)? Or are there some “scheduling implications” I should be aware of?

Thanks!

Can you elaborate? With CUDA, you cannot decide which SMs a kernel executes on. That would also only matter if you want to overlap kernel execution, which is separate from their asynchronous nature.

Ah, thanks @maleadt. You’re right, I should clarify.

What I’m really after is overlapping kernels, and not just asynchronous launching.

After a bit more digging, it seems like there’s been some work around this within KernelAbstractions, but I’m not sure what the current status is:

After https://github.com/JuliaGPU/KernelAbstractions.jl/pull/317, KA.jl should be compatible with CUDA.jl’s task mechanism. So you should use Julia tasks in order for kernels to launch on different streams, and potentially overlap. See CUDA.jl 3.0 ⋅ JuliaGPU

1 Like

In particular with KernelAbstractions 0.9 you would do it “just like” CUDA.jl

You can use multiple Julia tasks to represent concurrent work and example here is

where I use Julia tasks to do some MPI communication concurrently.

2 Likes

Thank you @maleadt and @vchuravy! This helps a lot.

I am currently also trying to implement an asynchronous MPI halo exchange (on GPUs and CPUs) and trying to follow the example you posted.

In my case I perform multiple exchanges (2 per dimension), i.e., my starting point is something like:

requests = MPI.Request[]
for dim in 1:ndims
    neg_nbr, pos_nbr = get_nbrs(...)
    push!(requests, Irecv!(#= from neg_nbr =#))
    push!(requests, Isend!(#= to pos_nbr =#))
    push!(requests, Irecv!(#= from pos_nbr =#))
    push!(requests, Isend!(#= to neg_nbr =#))

    if do_edges
        # wait for all requests of this dimension to complete before going to the next
        empty!(requests)
    end
end
MPI.Waitall(requests) # for async, this should excluded and the requests/tasks returned

My assumption is that I have to split requests into recv_requests and send_requests to make this work similar to your example, but it isn’t quite clear to me how to handle the sends. Do I have to @spawn a separate task for every send like it seems in the example, or can I bundle them like the receives?
Also, for the case do_edges=true is it even possible to do this asynchronously, i.e., have it respect the fact that the dimensions are interdependent while still being able to overlap with computation?