The KernelAbstractions docs mention that kernels are launched asynchronously. I’m hoping to leverage this within a solver I’m working on, where I hope to hide the communication between GPUs behind some computation (a common technique with finite difference codes).
Typically, I would have to map different kernel calls to different SMs myself (if using CUDA). Does KernelAbstractions do this under the hood (for the various backends that support this)? Or are there some “scheduling implications” I should be aware of?
Thanks!
Can you elaborate? With CUDA, you cannot decide which SMs a kernel executes on. That would also only matter if you want to overlap kernel execution, which is separate from their asynchronous nature.
Ah, thanks @maleadt . You’re right, I should clarify.
What I’m really after is overlapping kernels, and not just asynchronous launching.
After a bit more digging, it seems like there’s been some work around this within KernelAbstractions, but I’m not sure what the current status is:
opened 11:32AM - 02 Sep 21 UTC
closed 10:34PM - 23 Jun 23 UTC
design
KA currently uses a very verbose and explicit dependency management.
```
eve… nt = kernel(CPU())(...)
event = kernel(CPU())(..., dependencies=(event,))
```
This was added since at the time CUDA.jl used one stream, and thus exposing concurrency was harder.
Now @maleadt added a really nice design around task local streams, allowing users to use Julia tasks to express concurrency on the GPU as well.
So I am thinking that in the interest of reducing the complexity of KA in usage and to align it better with CUDA.jl I would like to remove the dependency management
and move to a stream based model.
One open question is how to deal with the CPU (but this could mean we simply move to synchronous execution there, reducing latency as well)
An alternative that I see is to explore an more implicit dependency model based on the arguments to the kernel, I think that would be similar to SYCL or what AMDGPU currently does.
This would be the first step towards KA 1.0
CC interested parties: @glwagner @lcw @jpsamaroo @simonbyrne @kpamnany @omlins
After https://github.com/JuliaGPU/KernelAbstractions.jl/pull/317 , KA.jl should be compatible with CUDA.jl’s task mechanism. So you should use Julia tasks in order for kernels to launch on different streams, and potentially overlap. See CUDA.jl 3.0 ⋅ JuliaGPU
1 Like
In particular with KernelAbstractions 0.9 you would do it “just like” CUDA.jl
You can use multiple Julia tasks to represent concurrent work and example here is
# EXCLUDE FROM TESTING
using KernelAbstractions
using MPI
# TODO: Implement in MPI.jl
function cooperative_test!(req)
done = false
while !done
done, _ = MPI.Test(req, MPI.Status)
yield()
end
end
function cooperative_wait(task::Task)
while !Base.istaskdone(task)
MPI.Iprobe(MPI.MPI_ANY_SOURCE, MPI.MPI_ANY_TAG, MPI.COMM_WORLD)
yield()
end
wait(task)
end
This file has been truncated. show original
where I use Julia tasks to do some MPI communication concurrently.
2 Likes
Thank you @maleadt and @vchuravy ! This helps a lot.