Question
I have an N \times N matrix A on the GPU (e.g. CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
from CUDA.jl
), and two N-element “index vectors” iVector
and jVector
also on the GPU. I want to add the elements of A located at indices \{(i_1, j_1), (i_2, j_2), \ldots, (i_N, j_N)\}. With everything on the CPU, this is of course trivial with something like
sum(A[iVector[k], jVector[k]] for k in 1:N)
There’s a similar method with everything on the GPU with
sum(@inbounds view(A, view(iVector, index), view(jVector, index)) for index in eachindex(iVector))
but my gut says this isn’t the best way to do it and is more just a workaround. Is there a better or more correct way to do this? What about doing the summation in parallel on the GPU as well? Possibly writing a custom CUDA kernal?
Example
using Random: randperm
using CUDA
using BenchmarkTools
const dimension = 10^4
const adjacencyMatrix = CUDA.rand(dimension, dimension)
const tour = convert.(Int32, randperm(dimension)) |> cu
function fitness(
adjacencyMatrix :: Union{Matrix{T}, CuArray},
position :: Union{Vector, CuArray}
) where T <: Real
head = view(position, 1:length(position)-1)
tail = view(position, 2:length(position))
return sum(
@inbounds view(
adjacencyMatrix,
view(head, index),
view(tail, index)
) for index in eachindex(head)
)
end
display(@btime fitness(adjacencyMatrix, tour);)
Output:
119.928 ms (240377 allocations: 8.36 MiB)