GPU randn way slower than rand?

gpu
cuda

#1

I’m trying to generate random normals directly on GPU when I find the following result:

using CuArrays
using Random
ac = Array{Float64}(undef, 2^20)
ag = cu(ac)
@time rand!(ac)
@time rand!(ag)
ac = Array{Float64}(undef, 2^20)
ag = cu(ac)
@time randn!(ac)
@time randn!(ag)

0.003868 seconds (4 allocations: 160 bytes)
0.000115 seconds (39 allocations: 1.469 KiB)
0.014382 seconds (4 allocations: 160 bytes)
8.251600 seconds (5.24 M allocations: 288.000 MiB, 0.69% gc time)

Which makes me wonder why this happens, is there a way to improve the randn performance?
I’m using CuArrays, and here’s the reference to curand in CuArrays:https://github.com/JuliaGPU/CuArrays.jl/blob/master/src/rand/highlevel.jl


Timing square function in CUDA
#2

You shouldn’t benchmark once, in global scope, etc. Using BenchmarkTools.jl, and synchronizing the GPU:

julia> ac = Array{Float64}(undef, 2^20);

julia> ag = cu(ac);

julia> @btime randn!(ac);
  5.507 ms (0 allocations: 0 bytes)

julia> @btime CuArrays.@sync randn!(ag);
  83.943 μs (1 allocation: 16 bytes)

I’m also not sure your experiment makes sense; you’re benchmarking 2^20 twice with vastly different performance characteristics? Did you bump the array size? But even then:

julia> ac = Array{Float64}(undef, 2^30);

julia> ag = cu(ac);

julia> @btime randn!(ac);
  5.843 s (0 allocations: 0 bytes)

julia> @btime CuArrays.@sync randn!(ag);
  72.834 ms (1 allocation: 16 bytes)

#3

Hi Tim,

Thanks for the reply.
I understand that testing once in global is not accurate, it’s just that the difference is too large to be normal.

Here the new result following your code

ac = Array{Float64}(undef, 2^20)
ag = cu(ac)
@btime randn!(ac)
@btime CuArrays.@sync randn!(ag)

6.788 ms (0 allocations: 0 bytes)
7.626 s (5242883 allocations: 288.00 MiB)

So the allocations are clearly the problem, but I can’t really figure out the reason.

Even with rand there are some extra allocations

@btime rand!(ac)
@btime CuArrays.@sync rand!(ag)

957.006 μs (0 allocations: 0 bytes)
5.205 μs (38 allocations: 1.48 KiB)

Do you have any idea where the problem might be?

I’m using a 840m on my laptop with 2GB of memory, however, I don’t think array size is the problem either.


#4

Ah right, you’re probably using CuArrays v0.8.1 while the CURAND improvements haven’t been tagged yet (awaiting fixes to Flux.jl by @MikeInnes). Try the master branch.


#5

Hi Tim,

Thanks, checking out to mater actually fix the problem.

A small question, I would assume the use of @sync here is only for @btime purpose, am I right?


#6

@sync is the CuArrays equivalent of calling CUDAdrv.synchronize(), which synchronizes the GPU. That’s necessary to do timing measurements indeed, you typically won’t need that in normal application code since certain operations (like memory copies) are already synchronizing.


#7

Great. Thanks again.