I’ve been making some progress there too, eg. there’s now a compiled run-time library as a counterpart to some of the C functions: https://github.com/JuliaGPU/CUDAnative.jl/blob/95fbf9356eaa6c3da3c3321ff35c3ffa5d41f77a/src/device/runtime_intrinsics.jl
Currently supports boxing, allocations, and some exception handling. More to come, if and as soon as I have time to work on that.
Basically, reconfiguring the existing compiler to emit GPU-compatible LLVM IR (mainly through regular dispatch, but Cassette would be great for this once it generates some better code) in combination with a custom back-end and run-time to compile and handle that IR.
Wrt. the lack of documentation on the CUDAnative internals, I’m considering creating a package that isolates and demonstrates the approach and submit that for a talk on the next JuliaCon.