Using CUDA.jl single index access is discouraged for good reasons. However, if I want to implement a composite type (e.g. ShiftedArrays.jl) to work well with CuArray it needs to support collect. Yet the collect mechanism is usually based on calling to get_index for each index. The question is how to implement collect correctly for such composite types. It makes sense to do what is needed for the type in a broadcasting way but since these composite types tend to be nested this would mean that each such type would itself need to call collect on the parent. Yet this leads to two problems:
- If the collect needs a
copyorsimilar, to allow to modify some of the results, the chain of collect calles leads to unneccessary allocations in comparison to the chain ofget_indexcalls, which just shuffle some indices and result values. - We need to stop the chain of
collectcalls before reaching the base type such asCuArrayto avoid collecting the array from the GPU to the CPU.
Is there a generally accepted mechanism to check whether an AbstractArray is a composite type? Something like has_parent(arr)? Or is there a better way to handle all of this? Maybe overloading collect is generally not the right way? Or is there a smart way of packaging all the chains of get_index calls into a CUDA kernel automatically?