If I use multiple dispatch to execute a function on the CPU or GPU depending on the type of the input then passing it a view of a CuArray produces an unexpected result. Is there a better design paradigm to deal with this case?
I think you’re using CUDA@v2. (The result is true in previous versions and I like this change. Kudos to @maleadt )
You may need to change the signature of your function here. Previously you may have a function like this:
function f(x::AbstractArray)
# do sth with on CPU by default
end
function f(x::CuArray)
# do sth on GPU
end
# change the later into the following
function f(x::Union{CuArray, SubArray{<:Any,<:Any,<:CuArray})
# do sth on GPU
end
However, the view of a view of CuArray will not work here. And the same goes for ReshapedArray. So you may need to create something like this finally:
f(x::AbstractArray) = f(device_of(x), x)
function f(::OnGPU, x) #= do sth on GPU =# end
function f(::OnCPU, x) #= do sth on CPU =# end
device_of(x::AbstractArray) = OnCPU()
device_of(x::CuArray) = OnGPU()
device_of(x::Union{SubArray, ReshapedArray}) = device_of(parent(x))
# add some more dispatches if needed.
Then dispatch something like this, with a catch-all CPU method:
f(x::AbstractArray) = _f(storage_type(x), x)
function _f(::Type{T}, x) where {T<:CuArray} #= do sth on GPU =# end
function _f(::Type, x) #= do sth on CPU =# end
I changed view/reshape/reinterpret to reuse the Base wrappers with appropriate type unions to catch cases where we can get a contiguous or strided pointer (DenseCuArray, StridedCuArray, or AnyCuArray for anything that contains a CuArray). Although this works beautifully, as it turns out the precompilation and load-time cost of these type unions is pretty hefty… so I might have to revisit/revert this in the future
Other than the compilation cost, how does this make sure that things are dispatched to CuArray? I understand that forcing all wrappers of CuArray to also be a CuArray is not the way to go, but double wrappers always end up confusing me.
This is something that I would really like to understand, because I’ve encountered similar issues a few times with StructArrays.
For example, if I do
julia> c = CUDA.rand(10, 10);
julia> v = view(c, :, 1:3);
julia> v .+ v;
can julia manage to get an optimized version of broadcast here? (Sorry, unfortunately I’m working on a machine without CUDA support at the moment and can’t verify.) If that’s the case, is there a simple intuition as to how that works?