Dealing with views and cuda array wrappers

I’m working with an array wrapper that uses a CuArray under the hood. When I try to @view into it, I get a complicated SubArray that calls the Base.fill!() routine rather than the Cuda.fill!() routine (i.e. it’s scalar indexing). Here’s the type:

my_array::SubArray{Float32, 2, MyWrapper{Float32, 2, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, 2}, Tuple{UnitRange{Int64}, UnitRange{Int64}}, false}, x::Int64)

(the view is not contiguous)

I’m seeing that the compiler sometimes has trouble dispatching if there are “too many” wrapper types.

What’s the canonical way to get around this (other than to abandon my wrapper)? I’ve got Adapt.jl set up with my wrapper (i.e. I overloaded adapt_structure), and it seems to work fine. Are there additional methods (e.g. from Base) that I could overload to help the compiler out?

Thanks!

Unfortunately, unless the view is contiguous you won’t be able to easily avoid scalar indexing.

Ideally, you would want to be operating on contiguous arrays on the GPU: it’s simpler and will take much more advantage of the power of the GPU. Can you rewrite your algorithm to perhaps materialise the view on the GPU as a contiguous array?

One hacky alternative: if the view is defined by a bitmask, you could send both the original array and the bit mask to the GPU and perform operations over both, e.g. something like (not tested)

values = CUDA.rand(1000)
bitmask = CUDA.rand(Bool, 1000)

# Perform some op only on valid entries
map!(values, values, bitmask) do (val, flag)
    return flag ? f(val) : val
end

Thank you, @torrance!

Can you rewrite your algorithm to perhaps materialise the view on the GPU as a contiguous array?

Good question. Ultimately I’d like to be doing some halo exchanges. These halos, as you probably know, consist of “boundary elements” surrounding the array. So they aren’t contiguous, but they are strided. Perhaps there’s a clever way I can leverage that structure?

if the view is defined by a bitmask, you could send both the original array and the bit mask to the GPU and perform operations over bot

Interesting idea! Thank you!