I have a scenario in which I need to repetitively updating a `CuArray`

along certain diagonals based on results computed from CPU. The number of diagonals involved can be small and may change across different problems (typically ranging from just 1-3 diagonals to somewhere below 100). Values on the same diagonal all take the same value based on results computed on CPU.

To be concrete, if the matrix to be updated is `M`

, what I am doing would be something like below on CPU:

```
N = 3
diags = rand(-5:5, N)
vals = rand(N)
M = zeros(10, 10);
for (d, v) in zip(diags, vals)
M[diagind(M, d)] .= v
end
```

In the actual problem, `M`

needs to be a `CuArray`

and will have a size ranging from `300x300`

to `5000x5000`

. Since `diags`

and `vals`

are generated from CPU with some simple calculations, one possibility is to first copy their content to a `CuArray`

and then write a kernel function to update the diagonal values. But it seems quite expensive to just copy a dozen of numbers to GPU, which needs to be done repetitively in a sequential loop.

Is there a more performant way to do this? Is it possible to combine the copying step for `diags`

and `vals`

and the step that updates the matrix content in a single kernel launch to reduce the overhead?

Thank you!

------------------------------------------Update------------------------------------------

It seems that one possibility is to pass `diags`

and `vals`

as `SVector`

s when launching the kernel.