GPUArrays.jl is actually only used by CuArrays/CLArrays. It’s also not fully abstract, i.e., one of the original design goals by @sdanisch was to be able to instantiate a GPUArray. I’ve been moving away from that design a little, and I hope to clean the package up some more in order to make it more reliable for use by other GPU back-ends (esp. the one for AMD).
Availability of hardware / software. In theory, each of these packages should just implement the Base array interface, and you should be able to write generic array code and move between implementations. Of course, we’re not quite there yet.
The current broadcast implementation has some overheads, because of how it pads dimensions (run-time checks on keeps
and defaults
). This works out fine for CPUs, but is somewhat costly on GPUs. We could probably customize broadcast to specialize on the container shapes, and that would be something that would fit great in GPUArrays.jl. (I’m not sure, after reading #32051)
Other than that, at least in the case of CuArrays, you end up with regular CuDeviceArrays on the device, regardless of writing your own kernel vs. broadcasting over a CuArray.
That’s an underdeveloped part of CuArrays. IIRC, it just defaults to 256 threads and pads the blocks for that (we really need to do an occupancy analysis there). For some calls, where we have expensive kernels that might use lots of registers (eg. reduce) we do something more sophisticated, but still no occupancy analysis.
When you need the flexibility of kernel programming. If you can rephrase a problem in terms of array expressions, do so. It’s much easier, you get kernel fusion out of the box, dispatch to CUBLAS and friends whenever possible, etc.
In terms of the underlying software stack: not compatible at all. Concerning the implementations, @jpsamaroo has expressed interest in sharing code between CUDA/AMDnative (and presumably the array infrastructure that would be built on top of that). I haven’t had the time to look into that, but I’m all for sharing code between back ends whenever possible.