# Question

I have an N \times N matrix A on the GPU (e.g. `CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}`

from `CUDA.jl`

), and two N-element “index vectors” `iVector`

and `jVector`

also on the GPU. I want to add the elements of A located at indices \{(i_1, j_1), (i_2, j_2), \ldots, (i_N, j_N)\}. With everything on the CPU, this is of course trivial with something like

```
sum(A[iVector[k], jVector[k]] for k in 1:N)
```

There’s a similar method with everything on the GPU with

```
sum(@inbounds view(A, view(iVector, index), view(jVector, index)) for index in eachindex(iVector))
```

but my gut says this isn’t the best way to do it and is more just a workaround. **Is there a better or more correct way to do this? What about doing the summation in parallel on the GPU as well? Possibly writing a custom CUDA kernal?**

# Example

```
using Random: randperm
using CUDA
using BenchmarkTools
const dimension = 10^4
const adjacencyMatrix = CUDA.rand(dimension, dimension)
const tour = convert.(Int32, randperm(dimension)) |> cu
function fitness(
adjacencyMatrix :: Union{Matrix{T}, CuArray},
position :: Union{Vector, CuArray}
) where T <: Real
head = view(position, 1:length(position)-1)
tail = view(position, 2:length(position))
return sum(
@inbounds view(
adjacencyMatrix,
view(head, index),
view(tail, index)
) for index in eachindex(head)
)
end
display(@btime fitness(adjacencyMatrix, tour);)
```

Output:

119.928 ms (240377 allocations: 8.36 MiB)