What do I get when I subtype GPUArray?

As the main developer of ROCArrays.jl, I’m trying to understand what benefits I get when I make ROCArray <: GPUArray. GPUArrays.jl is a rather large package (likely due to it also implementing JLArray), so it’s hard for me to tease out what the differences are between making ROCArray a subtype of GPUArray over DenseArray or AbstractArray. Having that info at hand would make it easier for me to debug why things break as I work towards making this a reality, and also help me contribute to GPUArrays.jl in the future.

If there’s an overview somewhere of the differences between <: GPUArray and <: DenseArray/<: AbstractArray, or just a file I should look at that has most of this functionality isolated out, I would greatly appreciated being pointed to it :slightly_smiling_face:

Implementing the GPUArrays gives you two things: a bunch of methods that implement Base array functionality that would otherwise iterate (think constructors, copy methods, similar, etc which each have a pretty extensive surface area with lots of corners), and a couple of GPU kernels expressed in terms of a vendor-agnostic GPU programming interface. There’s other features, like a vendor-agnostic interface to inspect devices, but those have bitrotten and should be removed (GPUArrays was originally intended to be used by users, whereas it is now purely a support package for implementing GPU types, i.e. AbstractGPUArrays). Finally, there’s also a pretty extensive test suite that keeps your package well tested.

I think it’s worth implementing this interface since there’s quite some functionality that is potentially common to GPU array types. Vendor-agnostic kernels I’m not so sure, but on the other hand there’s a couple that have been used by CuArrays.jl forever (like the random number generator) so there is definitely value there. But the package needs some love: the required interface is ill-defined and ill-documented, some functionality has bitrotten / is unused and should get stripped, etc. If there an actual second user, e.g. ROCArrays, it might be worth cleaning it up and making more use of it.

I’ve been meaning to reduce its footprint. The latest version should have way less dependencies, and if it matters we could also move JLArray to the test suite too.

3 Likes

Ok great, that makes sense to me! I just realized that the JLArray implementation is totally contained in src/array.jl, so that makes it easier to know what is specific to just that backend. I’m making a FakeGPUArray locally using the same code as for JLArray so that I can get a handle on what a simpler backend looks like (and how it’s implemented), and then use that as a basis for ROCArrays until I can better understand everything else. It’s already going really well (I’ve almost got the testsuite passing), so I’m pretty confident I can figure this out now :smile:

JLArray was exactly intended for that purpose: to give a reference implementation for people to copy it for implementing their own GPUArray…
Btw, you should also get a fairly reasonable broadcast and mapreduce implementation for free!

1 Like

Can I use the GPUArray API to define dual use fast kernels for CUDA and CPU with multiple threads without undue runtime cost over using CuArrays directly ?

how does the GPUArray API interact with the recent cuda memory management improvements? Will these not carry over?

Do you plan on supporting osx?

Yes I do intend to support OSX, although I’ll need to do a bit of work to workaround the fact that ROCm doesn’t currently have an OSX driver. But it should be totally possible through the power of OpenCL.jl

1 Like

Just to add some more to this: lots of functionality that has been added to CuArrays.jl could reasonably move to GPUArrays for reuse by other packages. For example, the whole memory allocator currently just keeps track of a pointer and size, given an underlying alloc and free function that could just as well be from ROCM. Of course, if we ever track the ownership of a pointer (device, stream, which we’ll have to if we ever tackle single-process-multi-gpu), we’ll need to think about how to properly abstract this.

3 Likes