Asynchronous kernel scheduling with KernelAbstractions

smartalecH · June 18, 2023, 11:32pm

The KernelAbstractions docs mention that kernels are launched asynchronously. I’m hoping to leverage this within a solver I’m working on, where I hope to hide the communication between GPUs behind some computation (a common technique with finite difference codes).

Typically, I would have to map different kernel calls to different SMs myself (if using CUDA). Does KernelAbstractions do this under the hood (for the various backends that support this)? Or are there some “scheduling implications” I should be aware of?

Thanks!

maleadt · June 19, 2023, 6:06am

Can you elaborate? With CUDA, you cannot decide which SMs a kernel executes on. That would also only matter if you want to overlap kernel execution, which is separate from their asynchronous nature.

smartalecH · June 19, 2023, 2:36pm

Ah, thanks @maleadt. You’re right, I should clarify.

What I’m really after is overlapping kernels, and not just asynchronous launching.

After a bit more digging, it seems like there’s been some work around this within KernelAbstractions, but I’m not sure what the current status is:

github.com/JuliaGPU/KernelAbstractions.jl

Launch kernels and dependencies

opened 11:32AM - 02 Sep 21 UTC

closed 10:34PM - 23 Jun 23 UTC

vchuravy

design

KA currently uses a very verbose and explicit dependency management. ``` eve…nt = kernel(CPU())(...) event = kernel(CPU())(..., dependencies=(event,)) ``` This was added since at the time CUDA.jl used one stream, and thus exposing concurrency was harder. Now @maleadt added a really nice design around task local streams, allowing users to use Julia tasks to express concurrency on the GPU as well. So I am thinking that in the interest of reducing the complexity of KA in usage and to align it better with CUDA.jl I would like to remove the dependency management and move to a stream based model. One open question is how to deal with the CPU (but this could mean we simply move to synchronous execution there, reducing latency as well) An alternative that I see is to explore an more implicit dependency model based on the arguments to the kernel, I think that would be similar to SYCL or what AMDGPU currently does. This would be the first step towards KA 1.0 CC interested parties: @glwagner @lcw @jpsamaroo @simonbyrne @kpamnany @omlins

maleadt · June 19, 2023, 4:59pm

After https://github.com/JuliaGPU/KernelAbstractions.jl/pull/317, KA.jl should be compatible with CUDA.jl’s task mechanism. So you should use Julia tasks in order for kernels to launch on different streams, and potentially overlap. See CUDA.jl 3.0 ⋅ JuliaGPU

vchuravy · June 23, 2023, 10:36pm

In particular with KernelAbstractions 0.9 you would do it “just like” CUDA.jl

You can use multiple Julia tasks to represent concurrent work and example here is

github.com

JuliaGPU/KernelAbstractions.jl/blob/main/examples/mpi.jl

# EXCLUDE FROM TESTING
using KernelAbstractions
using MPI

# TODO: Implement in MPI.jl
function cooperative_test!(req)
    done = false
    while !done
        done, _ = MPI.Test(req, MPI.Status)
        yield()
    end
end

function cooperative_wait(task::Task)
    while !Base.istaskdone(task)
        MPI.Iprobe(MPI.MPI_ANY_SOURCE, MPI.MPI_ANY_TAG, MPI.COMM_WORLD)
        yield()
    end
    wait(task)
end

This file has been truncated. show original

where I use Julia tasks to do some MPI communication concurrently.

smartalecH · June 25, 2023, 4:53am

Thank you @maleadt and @vchuravy! This helps a lot.

Gianluca_Fuwa · July 3, 2025, 1:04pm

I am currently also trying to implement an asynchronous MPI halo exchange (on GPUs and CPUs) and trying to follow the example you posted.

In my case I perform multiple exchanges (2 per dimension), i.e., my starting point is something like:

requests = MPI.Request[]
for dim in 1:ndims
    neg_nbr, pos_nbr = get_nbrs(...)
    push!(requests, Irecv!(#= from neg_nbr =#))
    push!(requests, Isend!(#= to pos_nbr =#))
    push!(requests, Irecv!(#= from pos_nbr =#))
    push!(requests, Isend!(#= to neg_nbr =#))

    if do_edges
        # wait for all requests of this dimension to complete before going to the next
        empty!(requests)
    end
end
MPI.Waitall(requests) # for async, this should excluded and the requests/tasks returned

My assumption is that I have to split requests into recv_requests and send_requests to make this work similar to your example, but it isn’t quite clear to me how to handle the sends. Do I have to @spawn a separate task for every send like it seems in the example, or can I bundle them like the receives?
Also, for the case do_edges=true is it even possible to do this asynchronously, i.e., have it respect the fact that the dimensions are interdependent while still being able to overlap with computation?

Topic		Replies	Views
Parallel launch of CUDA kernels GPU cuda , kernelabstractions	5	296	November 13, 2024
Several questions about KernelAbstractions GPU gpu , cuda , kernelabstractions	6	1591	January 18, 2022
CUDA async is not working properly New to Julia gpu , cuda	4	166	December 31, 2024
CUDA.jl - When to synchronize General Usage cuda	11	617	March 6, 2025
With ParallelStencil, is it possible to launch multiple kernels and sync later? GPU	7	103	April 9, 2025

Asynchronous kernel scheduling with KernelAbstractions

Related topics