Using CUDA.jl
single index access is discouraged for good reasons. However, if I want to implement a composite type (e.g. ShiftedArrays.jl
) to work well with CuArray
it needs to support collect
. Yet the collect mechanism is usually based on calling to get_index
for each index. The question is how to implement collect
correctly for such composite types. It makes sense to do what is needed for the type in a broadcasting way but since these composite types tend to be nested this would mean that each such type would itself need to call collect
on the parent
. Yet this leads to two problems:
- If the collect needs a
copy
orsimilar
, to allow to modify some of the results, the chain of collect calles leads to unneccessary allocations in comparison to the chain ofget_index
calls, which just shuffle some indices and result values. - We need to stop the chain of
collect
calls before reaching the base type such asCuArray
to avoid collecting the array from the GPU to the CPU.
Is there a generally accepted mechanism to check whether an AbstractArray
is a composite type? Something like has_parent(arr)
? Or is there a better way to handle all of this? Maybe overloading collect
is generally not the right way? Or is there a smart way of packaging all the chains of get_index
calls into a CUDA
kernel automatically?