I’m trying to generate random normals directly on GPU when I find the following result:
using CuArrays
using Random
ac = Array{Float64}(undef, 2^20)
ag = cu(ac)
@time rand!(ac)
@time rand!(ag)
ac = Array{Float64}(undef, 2^20)
ag = cu(ac)
@time randn!(ac)
@time randn!(ag)
You shouldn’t benchmark once, in global scope, etc. Using BenchmarkTools.jl, and synchronizing the GPU:
julia> ac = Array{Float64}(undef, 2^20);
julia> ag = cu(ac);
julia> @btime randn!(ac);
5.507 ms (0 allocations: 0 bytes)
julia> @btime CuArrays.@sync randn!(ag);
83.943 μs (1 allocation: 16 bytes)
I’m also not sure your experiment makes sense; you’re benchmarking 2^20 twice with vastly different performance characteristics? Did you bump the array size? But even then:
julia> ac = Array{Float64}(undef, 2^30);
julia> ag = cu(ac);
julia> @btime randn!(ac);
5.843 s (0 allocations: 0 bytes)
julia> @btime CuArrays.@sync randn!(ag);
72.834 ms (1 allocation: 16 bytes)
Ah right, you’re probably using CuArrays v0.8.1 while the CURAND improvements haven’t been tagged yet (awaiting fixes to Flux.jl by @MikeInnes). Try the master branch.
@sync is the CuArrays equivalent of calling CUDAdrv.synchronize(), which synchronizes the GPU. That’s necessary to do timing measurements indeed, you typically won’t need that in normal application code since certain operations (like memory copies) are already synchronizing.