# Question

I have an N \times N matrix A on the GPU (e.g. CuArray{Float32, 2, CUDA.Mem.DeviceBuffer} from CUDA.jl), and two N-element “index vectors” iVector and jVector also on the GPU. I want to add the elements of A located at indices \{(i_1, j_1), (i_2, j_2), \ldots, (i_N, j_N)\}. With everything on the CPU, this is of course trivial with something like

sum(A[iVector[k], jVector[k]] for k in 1:N)


There’s a similar method with everything on the GPU with

sum(@inbounds view(A, view(iVector, index), view(jVector, index)) for index in eachindex(iVector))


but my gut says this isn’t the best way to do it and is more just a workaround. Is there a better or more correct way to do this? What about doing the summation in parallel on the GPU as well? Possibly writing a custom CUDA kernal?

# Example

using Random: randperm
using CUDA
using BenchmarkTools

const dimension = 10^4
const adjacencyMatrix = CUDA.rand(dimension, dimension)
const tour = convert.(Int32, randperm(dimension)) |> cu

function fitness(
adjacencyMatrix :: Union{Matrix{T}, CuArray},
position :: Union{Vector, CuArray}
) where T <: Real
head = view(position, 1:length(position)-1)
tail = view(position, 2:length(position))
return sum(
@inbounds view(
adjacencyMatrix,
view(head, index),
view(tail, index)
) for index in eachindex(head)
)
end

display(@btime fitness(adjacencyMatrix, tour);)


Output:

119.928 ms (240377 allocations: 8.36 MiB)

1 Like

You can take the sum of a view with indices:

function fitness(adjacencyMatrix::Union{Matrix{T}, CuArray},
position::Union{Vector, CuArray}) where T <: Real
head = view(position, 1:length(position)-1)
tail = view(position, 2:length(position))
idx = CartesianIndex.(head, tail)
sum(view(adjacencyMatrix, idx))
end


That’s slightly faster 100us instead of 270ms on my system.

4 Likes