GPU randn way slower than rand?

Zhiye_Xia · December 3, 2018, 5:16am

I’m trying to generate random normals directly on GPU when I find the following result:

using CuArrays
using Random
ac = Array{Float64}(undef, 2^20)
ag = cu(ac)
@time rand!(ac)
@time rand!(ag)
ac = Array{Float64}(undef, 2^20)
ag = cu(ac)
@time randn!(ac)
@time randn!(ag)

0.003868 seconds (4 allocations: 160 bytes)
0.000115 seconds (39 allocations: 1.469 KiB)
0.014382 seconds (4 allocations: 160 bytes)
8.251600 seconds (5.24 M allocations: 288.000 MiB, 0.69% gc time)

Which makes me wonder why this happens, is there a way to improve the randn performance?
I’m using CuArrays, and here’s the reference to curand in CuArrays:https://github.com/JuliaGPU/CuArrays.jl/blob/master/src/rand/highlevel.jl

maleadt · December 3, 2018, 5:48am

You shouldn’t benchmark once, in global scope, etc. Using BenchmarkTools.jl, and synchronizing the GPU:

julia> ac = Array{Float64}(undef, 2^20);

julia> ag = cu(ac);

julia> @btime randn!(ac);
  5.507 ms (0 allocations: 0 bytes)

julia> @btime CuArrays.@sync randn!(ag);
  83.943 μs (1 allocation: 16 bytes)

I’m also not sure your experiment makes sense; you’re benchmarking 2^20 twice with vastly different performance characteristics? Did you bump the array size? But even then:

julia> ac = Array{Float64}(undef, 2^30);

julia> ag = cu(ac);

julia> @btime randn!(ac);
  5.843 s (0 allocations: 0 bytes)

julia> @btime CuArrays.@sync randn!(ag);
  72.834 ms (1 allocation: 16 bytes)

Zhiye_Xia · December 3, 2018, 6:07am

Hi Tim,

Thanks for the reply.
I understand that testing once in global is not accurate, it’s just that the difference is too large to be normal.

Here the new result following your code

ac = Array{Float64}(undef, 2^20)
ag = cu(ac)
@btime randn!(ac)
@btime CuArrays.@sync randn!(ag)

6.788 ms (0 allocations: 0 bytes)
7.626 s (5242883 allocations: 288.00 MiB)

So the allocations are clearly the problem, but I can’t really figure out the reason.

Even with rand there are some extra allocations

@btime rand!(ac)
@btime CuArrays.@sync rand!(ag)

957.006 μs (0 allocations: 0 bytes)
5.205 μs (38 allocations: 1.48 KiB)

Do you have any idea where the problem might be?

I’m using a 840m on my laptop with 2GB of memory, however, I don’t think array size is the problem either.

maleadt · December 3, 2018, 6:08am

Ah right, you’re probably using CuArrays v0.8.1 while the CURAND improvements haven’t been tagged yet (awaiting fixes to Flux.jl by @MikeInnes). Try the master branch.

Zhiye_Xia · December 3, 2018, 6:37am

Hi Tim,

Thanks, checking out to mater actually fix the problem.

A small question, I would assume the use of @sync here is only for @btime purpose, am I right?

maleadt · December 3, 2018, 6:50am

@sync is the CuArrays equivalent of calling CUDAdrv.synchronize(), which synchronizes the GPU. That’s necessary to do timing measurements indeed, you typically won’t need that in normal application code since certain operations (like memory copies) are already synchronizing.

Zhiye_Xia · December 3, 2018, 6:53am

Great. Thanks again.

Topic		Replies	Views
Random numbers in [0.f0,1.f0] GPU	2	685	November 10, 2019
Random numbers in CUDA GPU	3	3250	March 23, 2019
Same random sequence on GPU and CPU? GPU question	8	862	September 8, 2021
Why is GPU kernel rand() not as "random" as CPU rand()? GPU question , cuda , kernel	10	511	May 17, 2023
CUDA.randn! allocation GPU	2	580	April 19, 2021

GPU randn way slower than rand?

Related topics