Collect for CuArray composite types

RainerHeintzmann · May 18, 2023, 7:27pm

Using CUDA.jl single index access is discouraged for good reasons. However, if I want to implement a composite type (e.g. ShiftedArrays.jl) to work well with CuArray it needs to support collect. Yet the collect mechanism is usually based on calling to get_index for each index. The question is how to implement collect correctly for such composite types. It makes sense to do what is needed for the type in a broadcasting way but since these composite types tend to be nested this would mean that each such type would itself need to call collect on the parent. Yet this leads to two problems:

If the collect needs a copy or similar, to allow to modify some of the results, the chain of collect calles leads to unneccessary allocations in comparison to the chain of get_index calls, which just shuffle some indices and result values.
We need to stop the chain of collect calls before reaching the base type such as CuArray to avoid collecting the array from the GPU to the CPU.

Is there a generally accepted mechanism to check whether an AbstractArray is a composite type? Something like has_parent(arr)? Or is there a better way to handle all of this? Maybe overloading collect is generally not the right way? Or is there a smart way of packaging all the chains of get_index calls into a CUDA kernel automatically?

Benny · May 18, 2023, 7:49pm

isstructtype(T) tells you if a type is composite (struct/mutable struct). But that’s probably not what you’re looking for exactly, it sounds like you want to know if a type has a parent you could collect directly to avoid scalar indexing a possible CuArray. If an AbstractArray doesn’t really have a parent, it just falls back to parent(a::AbstractArray) = a, so you could just do collect(parent(a)). But maybe this should be reserved for the case where the parent is a CuArray, not a general approach?

Wonder if there has been anyone else who has implemented generic code for wrappers of CuArray, or more generally arrays with a trait for parallel computation instead of scalar indexing? Sounds like something people would have considered before, but honestly I’ve only ever seen people resort to doing broadcasting on the parent array in that case and rewrapping.

RainerHeintzmann · May 19, 2023, 11:46am

Thanks for the hint with isstructtype. I guess testing for parent(a)==a may be a way. Using collect(parent(a)) is not wanted for a CuArray as indicated, since this would collect it by converting to Array, i.e. from GPU to CPU mem.
Maybe @maleadt can comment on the above issue and give ideas about possible general ways to automatically package chains of get_index calls into CUDA kernels?

Topic		Replies	Views
Dreaded CuArray only supports element types that are stored inline GPU	10	1238	September 22, 2022
Composite types array, and composite type with array fields GPU cudanative , cuda	5	1150	December 20, 2019
Type stable accumulator over heterogeneous collection General Usage question	3	612	August 19, 2017
CuArray is Row Major or Column Major? GPU	1	3812	November 30, 2017
Adapt BroadcastStyle for CUDA GPU cuda , broadcasting , adapt	1	73	March 18, 2025

Collect for CuArray composite types

Related topics