Note that there is some level of broadcasting that works in kernels if you use StaticArrays, e.g.
Y = CUDA.rand(10,2)
function kernel(Y)
X = @SVector zeros(10)
Y[:,1] .= X .+ 1
nothing
end
@cuda kernel(Y)
although I couldn’t get exactly your example working. Based on:
maybe its possible? Perhaps @mateuszbaran can comment further, I’m curious myself too.