This is essentially what I discussed in [ANN] Folds.jl: threaded, distributed, and GPU-based high-level data-parallel interface for Julia and
KernelAbstractions and Dagger are great in that they are very flexible and generic. But having more higher-level structured representation helps for lowering to target-specific programs. As always, adding more constraints can help for encoding structures of your code which then can be exploited by the underlying framework for performance and composability.