Collect for CuArray composite types

Using CUDA.jl single index access is discouraged for good reasons. However, if I want to implement a composite type (e.g. ShiftedArrays.jl) to work well with CuArray it needs to support collect. Yet the collect mechanism is usually based on calling to get_index for each index. The question is how to implement collect correctly for such composite types. It makes sense to do what is needed for the type in a broadcasting way but since these composite types tend to be nested this would mean that each such type would itself need to call collect on the parent. Yet this leads to two problems:

  1. If the collect needs a copy or similar, to allow to modify some of the results, the chain of collect calles leads to unneccessary allocations in comparison to the chain of get_index calls, which just shuffle some indices and result values.
  2. We need to stop the chain of collect calls before reaching the base type such as CuArray to avoid collecting the array from the GPU to the CPU.

Is there a generally accepted mechanism to check whether an AbstractArray is a composite type? Something like has_parent(arr)? Or is there a better way to handle all of this? Maybe overloading collect is generally not the right way? Or is there a smart way of packaging all the chains of get_index calls into a CUDA kernel automatically?

isstructtype(T) tells you if a type is composite (struct/mutable struct). But that’s probably not what you’re looking for exactly, it sounds like you want to know if a type has a parent you could collect directly to avoid scalar indexing a possible CuArray. If an AbstractArray doesn’t really have a parent, it just falls back to parent(a::AbstractArray) = a, so you could just do collect(parent(a)). But maybe this should be reserved for the case where the parent is a CuArray, not a general approach?

Wonder if there has been anyone else who has implemented generic code for wrappers of CuArray, or more generally arrays with a trait for parallel computation instead of scalar indexing? Sounds like something people would have considered before, but honestly I’ve only ever seen people resort to doing broadcasting on the parent array in that case and rewrapping.

Thanks for the hint with isstructtype. I guess testing for parent(a)==a may be a way. Using collect(parent(a)) is not wanted for a CuArray as indicated, since this would collect it by converting to Array, i.e. from GPU to CPU mem.
Maybe @maleadt can comment on the above issue and give ideas about possible general ways to automatically package chains of get_index calls into CUDA kernels?