I have a scenario in which I need to repetitively updating a CuArray
along certain diagonals based on results computed from CPU. The number of diagonals involved can be small and may change across different problems (typically ranging from just 1-3 diagonals to somewhere below 100). Values on the same diagonal all take the same value based on results computed on CPU.
To be concrete, if the matrix to be updated is M
, what I am doing would be something like below on CPU:
N = 3
diags = rand(-5:5, N)
vals = rand(N)
M = zeros(10, 10);
for (d, v) in zip(diags, vals)
M[diagind(M, d)] .= v
end
In the actual problem, M
needs to be a CuArray
and will have a size ranging from 300x300
to 5000x5000
. Since diags
and vals
are generated from CPU with some simple calculations, one possibility is to first copy their content to a CuArray
and then write a kernel function to update the diagonal values. But it seems quite expensive to just copy a dozen of numbers to GPU, which needs to be done repetitively in a sequential loop.
Is there a more performant way to do this? Is it possible to combine the copying step for diags
and vals
and the step that updates the matrix content in a single kernel launch to reduce the overhead?
Thank you!
------------------------------------------Update------------------------------------------
It seems that one possibility is to pass diags
and vals
as SVector
s when launching the kernel.