Oh and try to use a scratch array to store the intermediate result:
ulia> function main(N)
x = CuArray(DGP(N))
V0 = CUDA.ones(Float64, N); idx = ()
a = 0.5
max_iter = 100
iter = 0
tmp = x .+ a * V0'
while iter < max_iter
V1 = V0
tmp .= x .+ a * V1'
V0, idx = findmax(tmp, dims=2)
iter += 1
end
return V0, idx, iter
end
That should get rid of most of the memory management time.