What is the current status of KernelAbstractions in the Julia GPU ecosystem? Is it being seen as an exploration package to try out some possible directions of heterogeneous CPU/GPU programming, or is it aiming to become a standard for future code development, agnostic to CPU/GPU?
Why the return statements are not permitted in kernel functions? There is no such issue with kernels written with CUDA.jl.
Why one have to be so strict with the kernel event ending? I see wait(event) in every example in the documentation. In the same time, as far as I understand, kernels written with CUDA.jl are also asynchronous, but no one forces to use @sync after every call.
In CUDA.jl there is launch_configuration function which allows to measure an optimal number of threads and blocks to launch a kernel. Is there a similar function in KernelAbstractions?
KA (KernelAbstractions) is currently acting as a more minimal, cross-vendor alternative to writing vendor-specific kernel functions. I don’t think it’s going anywhere but up; it already works well with CUDA and AMDGPU (WIP), and it’s kept well maintained and tested by @vchuravy and users in the HPC space.
My guess is that this causes problematic behavior due to thread divergence, but I’m not clear on the exact reasoning. It might also be related to how KA optimizes code, and that return statements could make that harder if code paths diverge significantly.
It’s not strict, just explicit. You don’t have to call wait(event) just after a kernel is launched; in fact, you don’t need to ever call it if you don’t want to, it’s just an indicator that a kernel is finished, and lets other dependent kernels execute in the order that the user expected. AMDGPU.jl also does this, and it has worked out well.
For CUDAKernels, launch configuration will be calculated automatically if workgroupsize is set to nothing.
return value statements are also not permitted in CUDA.jl (only return or return nothing; I assume the same applies to KA.jl). The reason is that there’s no clear meaning – what if different threads return different values – and it could makie the kernel launch synchronous. CUDA C also does not allow returning values.