Fast ways of updating a `CuArray` along certain diagonals based on results from CPU

I have a scenario in which I need to repetitively updating a CuArray along certain diagonals based on results computed from CPU. The number of diagonals involved can be small and may change across different problems (typically ranging from just 1-3 diagonals to somewhere below 100). Values on the same diagonal all take the same value based on results computed on CPU.

To be concrete, if the matrix to be updated is M, what I am doing would be something like below on CPU:

N = 3
diags = rand(-5:5, N)
vals = rand(N)
M = zeros(10, 10);
for (d, v) in zip(diags, vals)
    M[diagind(M, d)] .= v

In the actual problem, M needs to be a CuArray and will have a size ranging from 300x300 to 5000x5000. Since diags and vals are generated from CPU with some simple calculations, one possibility is to first copy their content to a CuArray and then write a kernel function to update the diagonal values. But it seems quite expensive to just copy a dozen of numbers to GPU, which needs to be done repetitively in a sequential loop.

Is there a more performant way to do this? Is it possible to combine the copying step for diags and vals and the step that updates the matrix content in a single kernel launch to reduce the overhead?

Thank you!


It seems that one possibility is to pass diags and vals as SVectors when launching the kernel.