The case is that I have a data set of indexes (rather than vectors), so I need to frequently index a multi-dimensional array (which is the table), e.g

```
A = [rand(3, 3, 2) for i in 1:8]
data = rand(1:2, 8, 10) # last dimension is the length of data set
my_task(A, data) = [A[k][:, :, data[k, :]] for k in 1:8]
```

this (batched indexing) works with `Array`

, but not with `CuArray`

, I’m wondering if I could parallel the indexing on GPU without copying it back to CPU (since the batch size of indexes would be pretty large, and easy to parallel).