Adapt BroadcastStyle for CUDA

I am trying to get CUDA.jl to work seamlessly with a MutableShiftedArray .
By writing a CUDASupportExt.jl class:

module CUDASupportExt
using CUDA 
using Adapt
using MutableShiftedArrays
indexing CUDA error

# lets do this for the MutableShiftedArray type
Adapt.adapt_structure(to, x::MutableShiftedArray) = MutableShiftedArray(adapt(to, parent(x)), shifts(x), size(x); default=MutableShiftedArrays.default(x));

function Base.Broadcast.BroadcastStyle(::Type{T})  where {CT, N, CD, T<:MutableShiftedArray{<:Any,<:Any,<:Any,<:CuArray{CT,N,CD}}}
    CUDA.CuArrayStyle{N,CD}()
end

# Define the BroadcastStyle for SubArray of MutableShiftedArray with CuArray
function Base.Broadcast.BroadcastStyle(::Type{T})  where {CT, N, CD, T<:SubArray{<:Any, <:Any, <:MutableShiftedArray{<:Any,<:Any,<:Any,<:CuArray{CT,N,CD}}}}
    CUDA.CuArrayStyle{N,CD}()
end

end

this works in general fine and for a MutableShiftedArray ma wrapping a CuArray you can broadcast: q = ma .+ 1 or even use views: (@view ma[1:4]) .+1, since the broadcasting mechanism somehow compiles the single element access version of getindex correctly.
However, this does not seem to be the case for copy(ma) or subindexing like ma[1:4] without the view. One could start specifying copy and the getindex() function for Union{Int, AbstractRange} to handle this, but this sound wrong.
Is there no other way, such that this is not necessary and the broadcasting system can take care of the individual element access as it does in the other broadcast cases?
In the end, single-element accesses are anyway generated, but one does get the warning that this is not handled by CUDA and therefore slow.

1 Like

I ended up needing to implement a number of such function such as:
copy
collect
Array
==
Each of them performs the wanted operations in a broadcasting way instead. I assume that more such functions are missing.