Hi Tim,
Thanks for the reply.
I understand that testing once in global is not accurate, it’s just that the difference is too large to be normal.
Here the new result following your code
ac = Array{Float64}(undef, 2^20)
ag = cu(ac)
@btime randn!(ac)
@btime CuArrays.@sync randn!(ag)
6.788 ms (0 allocations: 0 bytes)
7.626 s (5242883 allocations: 288.00 MiB)
So the allocations are clearly the problem, but I can’t really figure out the reason.
Even with rand there are some extra allocations
@btime rand!(ac)
@btime CuArrays.@sync rand!(ag)
957.006 μs (0 allocations: 0 bytes)
5.205 μs (38 allocations: 1.48 KiB)
Do you have any idea where the problem might be?
I’m using a 840m on my laptop with 2GB of memory, however, I don’t think array size is the problem either.