CUDA.jl single index access is discouraged for good reasons. However, if I want to implement a composite type (e.g.
ShiftedArrays.jl) to work well with
CuArray it needs to support
collect. Yet the collect mechanism is usually based on calling to
get_index for each index. The question is how to implement
collect correctly for such composite types. It makes sense to do what is needed for the type in a broadcasting way but since these composite types tend to be nested this would mean that each such type would itself need to call
collect on the
parent. Yet this leads to two problems:
- If the collect needs a
similar, to allow to modify some of the results, the chain of collect calles leads to unneccessary allocations in comparison to the chain of
get_indexcalls, which just shuffle some indices and result values.
- We need to stop the chain of
collectcalls before reaching the base type such as
CuArrayto avoid collecting the array from the GPU to the CPU.
Is there a generally accepted mechanism to check whether an
AbstractArray is a composite type? Something like
has_parent(arr)? Or is there a better way to handle all of this? Maybe overloading
collect is generally not the right way? Or is there a smart way of packaging all the chains of
get_index calls into a
CUDA kernel automatically?