Question
I have an N \times N matrix A on the GPU (e.g. CuArray{Float32, 2, CUDA.Mem.DeviceBuffer} from CUDA.jl), and two N-element “index vectors” iVector and jVector also on the GPU.  I want to add the elements of A located at indices \{(i_1, j_1), (i_2, j_2), \ldots, (i_N, j_N)\}.  With everything on the CPU, this is of course trivial with something like
sum(A[iVector[k], jVector[k]] for k in 1:N)
There’s a similar method with everything on the GPU with
sum(@inbounds view(A, view(iVector, index), view(jVector, index)) for index in eachindex(iVector))
but my gut says this isn’t the best way to do it and is more just a workaround. Is there a better or more correct way to do this? What about doing the summation in parallel on the GPU as well? Possibly writing a custom CUDA kernal?
Example
using Random: randperm
using CUDA
using BenchmarkTools
const dimension = 10^4
const adjacencyMatrix = CUDA.rand(dimension, dimension)
const tour = convert.(Int32, randperm(dimension)) |> cu
function fitness(
    adjacencyMatrix :: Union{Matrix{T}, CuArray}, 
    position :: Union{Vector, CuArray}
) where T <: Real
    head = view(position, 1:length(position)-1)
    tail = view(position, 2:length(position))
    return sum(
        @inbounds view(
            adjacencyMatrix, 
            view(head, index), 
            view(tail, index)
         ) for index in eachindex(head)
    )
end
display(@btime fitness(adjacencyMatrix, tour);)
Output:
119.928 ms (240377 allocations: 8.36 MiB)