I’m trying to generate random normals directly on GPU when I find the following result:
using CuArrays
using Random
ac = Array{Float64}(undef, 2^20)
ag = cu(ac)
@time rand!(ac)
@time rand!(ag)
ac = Array{Float64}(undef, 2^20)
ag = cu(ac)
@time randn!(ac)
@time randn!(ag)
0.003868 seconds (4 allocations: 160 bytes)
0.000115 seconds (39 allocations: 1.469 KiB)
0.014382 seconds (4 allocations: 160 bytes)
8.251600 seconds (5.24 M allocations: 288.000 MiB, 0.69% gc time)
Which makes me wonder why this happens, is there a way to improve the randn performance?
I’m using CuArrays, and here’s the reference to curand in CuArrays:https://github.com/JuliaGPU/CuArrays.jl/blob/master/src/rand/highlevel.jl
You shouldn’t benchmark once, in global scope, etc. Using BenchmarkTools.jl, and synchronizing the GPU:
julia> ac = Array{Float64}(undef, 2^20);
julia> ag = cu(ac);
julia> @btime randn!(ac);
5.507 ms (0 allocations: 0 bytes)
julia> @btime CuArrays.@sync randn!(ag);
83.943 μs (1 allocation: 16 bytes)
I’m also not sure your experiment makes sense; you’re benchmarking 2^20 twice with vastly different performance characteristics? Did you bump the array size? But even then:
julia> ac = Array{Float64}(undef, 2^30);
julia> ag = cu(ac);
julia> @btime randn!(ac);
5.843 s (0 allocations: 0 bytes)
julia> @btime CuArrays.@sync randn!(ag);
72.834 ms (1 allocation: 16 bytes)
Hi Tim,
Thanks for the reply.
I understand that testing once in global is not accurate, it’s just that the difference is too large to be normal.
Here the new result following your code
ac = Array{Float64}(undef, 2^20)
ag = cu(ac)
@btime randn!(ac)
@btime CuArrays.@sync randn!(ag)
6.788 ms (0 allocations: 0 bytes)
7.626 s (5242883 allocations: 288.00 MiB)
So the allocations are clearly the problem, but I can’t really figure out the reason.
Even with rand there are some extra allocations
@btime rand!(ac)
@btime CuArrays.@sync rand!(ag)
957.006 μs (0 allocations: 0 bytes)
5.205 μs (38 allocations: 1.48 KiB)
Do you have any idea where the problem might be?
I’m using a 840m on my laptop with 2GB of memory, however, I don’t think array size is the problem either.
Ah right, you’re probably using CuArrays v0.8.1 while the CURAND improvements haven’t been tagged yet (awaiting fixes to Flux.jl by @MikeInnes). Try the master branch.
Hi Tim,
Thanks, checking out to mater actually fix the problem.
A small question, I would assume the use of @sync here is only for @btime purpose, am I right?
@sync
is the CuArrays equivalent of calling CUDAdrv.synchronize()
, which synchronizes the GPU. That’s necessary to do timing measurements indeed, you typically won’t need that in normal application code since certain operations (like memory copies) are already synchronizing.