Is it possible to index a CuArray with a CuArray?

The case is that I have a data set of indexes (rather than vectors), so I need to frequently index a multi-dimensional array (which is the table), e.g

A = [rand(3, 3, 2) for i in 1:8]
data = rand(1:2, 8, 10) # last dimension is the length of data set

my_task(A, data) = [A[k][:, :, data[k, :]] for k in 1:8]

this (batched indexing) works with Array, but not with CuArray, I’m wondering if I could parallel the indexing on GPU without copying it back to CPU (since the batch size of indexes would be pretty large, and easy to parallel).

You can write a simple implementation with CUDAnative.jl, e.g.

function getindex(A::CuVector{T}, B::CuArray{<:Integer}) where T
    res = cuzeros(T, size(B)...)
    @inline function kernel(res, A, B)
        state = (blockIdx().x-1) * blockDim().x + threadIdx().x
        state <= length(res) && (@inbounds res[state] = A[B[state]])
        return
    end

    max_threads = 256
    X, Y = thread_blocks_heuristic(length(B))
    @cuda threads=X blocks=Y kernel(res, A, B)
    res
end