Why is GPU kernel rand() not as "random" as CPU rand()?

Hi all,

I’m implementing some optimization algorithms in the GPU (CUDA) that rely on, among other things, random numbers to set the exploration of the search space. I’m also comparing the results with the CPU implementation of the same algorithm in order to check the algorithm’s convergence and speedup.

The thing is that I was obtaining a huge difference in the convergence (to 0.0) from CPU VS GPU implementations. In the CPU it hoovers around 1.4e-19 to 1.4e-20 where in the GPU its around 2357.4 to 3462.1 in my GTX 970M, and about 20.1 to 30.2 in a friend’s RTX 3090…

I’ve tested for synchronization issues (inside and outside the kernels), racing conditions, variable types, variations on the implementations of the Julia functions in GPU kernels, large and small dataset to check for sync issues between GPU blocks, etc, and found nothing…

After a long comparison and debugging session I’ve found out that the issue relied in the rand() function in the GPU kernel, that for some reason was no as “random” as in the CPU! Here are some code snippets of the main instruction (that handles the algorithm convergence) to try to make sense of things:

## CPU code ##
# pop is a vector of vectors (main vector is of n_pop size and each sub vector is of prob.num_var size)
if LES
    xNew = pop[p] + rand( Float64, prob.num_var ) .* ( Pu - pop[p] ) - rand( Float64, prob.num_var ) .* ( Pl - pop[p] )
else
    xNew = pop[p] + randn( Float64 ) .* ( oldPop[p] - pop[p] )
end
## GPU kernel code ##
# pop is a matrix of n_pop x prob.num_var
if i <= size( data.pop, 1 ) && j <= size( data.pop, 2 )
    @inbounds if LES[i] == 1
        @inbounds xNew[i,j] = data.pop[i,j] + rand( Float64 ) * ( Pu[j] - data.pop[i,j] ) - rand( Float64 ) * ( Pl[j] - data.pop[i,j] )
    else
        @inbounds xNew[i,j] = data.pop[i,j] + randn( Float64 ) * ( oldPop[i,j] - data.pop[i,j] )
    end
end

The solution I’ve found to have both codes (CPU and GPU) to have a similar convergence was to firstly create the necessary random numbers in the CPU, bring them to the GPU (kernel) and then used them in the computation. Like this:

# outside the GPU kernel
_rand = CuArray( rand( Float64, n_pop, prob.num_var, 2 ) )
_randn = CuArray( randn( Float64, n_pop ) )

###########

# inside the GPU kernel
if i <= size( data.pop, 1 ) && j <= size( data.pop, 2 )
    @inbounds if LES[i] == 1
        @inbounds xNew[i,j] = data.pop[i,j] + _rand[i,j,1] * ( Pu[j] - data.pop[i,j] ) - _rand[i,j,2] * ( Pl[j] - data.pop[i,j] )
    else
        @inbounds xNew[i,j] = data.pop[i,j] + _randn[i] * ( oldPop[i,j] - data.pop[i,j] )
    end
end

Note that I’m not using random number seeding.

My question is if this is an expected behavior (as I’m not an experience Julia or CUDA programmer) or some other type of issue (something related to GPU parallelization, how random numbers are generated in the kernels, a bug, etc…)?

Thank you in advanced.

What do you mean by “less random”?

Sorry for my poor explanation, I’ll try to make it more clear.

The type of algorithm I’m using is called a stochastic optimization, meaning that they use random values in their formulation in order to optimize (maximize or minimize) a problem (function). It is a population based method which repeatedly modifies the population members (the possible solutions) in order to solve optimization problems.

Although this algorithms use random values for the optimization process, when using the same parameters, such as the same population size (n_pop), number of decision variables (num_var), number of iterations, etc, the end result should be on average similar (at least close of each other).

The “less random” statement is due that each time I run the CPU implementation, the result varies somewhat between 1.4e-19 to 1.4e-20, while the GPU implementation varies from 2357.4 to 3462.1 on a GTX 970M, and 20.1 to 30.2 in a RTX 3090 (running the same code - and same parameters - in both GPUs).

The way I found to solve this issue was to not use the rand() function (and the randn()) inside the kernel and instead, in each iteration, I first generate the required random numbers in the CPU (see my last code snippet), and finally use them to do the computation in the GPU kernel.

After this change, I’ve started to get the same results on both GPUs and also inline with the results from the CPU (somewhere around 1.4e-19 and 1.4e-20). So it looks like that the rand() function in the kernel (at least in the way I’m using it) is somehow not returning the same level of “randomization” as in the CPU code.

Could it be some artifact from having many cores in the GPU doing (two) rand() calls at the same time? My current test uses a matrix of 500x30 and I’m parallelizing in both dimensions in the GPU kernel.

Theories:

  • Could there be some type of race condition problem (is there something like a random number iterator)?
  • Maybe the random number generator uses time as a variable to generate the output and there could be some type of overlapping?
  • Or (most likely…) could this just be the result from another bad CUDA kernel implementation?

There does seem to be a problem, it’s just not clear where the problem is coming from. You’re not saying what is converging or showing how you’re calculating it, let alone how that implies the GPU’s RNG is less random. A more straightforward demonstration of that hypothesis would omit the unexplained variables like LES and pop, and would instead put random sequences between CUDA.rand and Base.rand through spectral tests or other kinds of randomness tests.

1 Like

Side note, something that could be of relevance, I did not mentioned that in the CPU code I calculate the xNew variable (vector of prob.num_var dimension) while looping each population member, and in the case of the GPU, the xNew variable is a matrix corresponding to all population’s “xNew” - meaning that is done all at once. Don’t think that this would have any type of influence in this issue but thought it could be important to mention.

About your reply, I did tested both the CPU and GPU algoritmos extensively even by removing any possible variation on the flow of the computation and even by simplifying the calculation.

Basically I can remove every possible variation in the computation (like the LES variable), use a fixe population for both CPU and GPU computation (instead of a randomly generated one), and any other conditional statement and just leave the main algorithm line, in the CPU case:

xNew = pop[p] + rand( Float64, prob.num_var ) .* ( Pu - pop[p] ) - rand( Float64, prob.num_var ) .* ( Pl - pop[p] )

and in the case of the GPU kernel:

@inbounds xNew[i,j] = data.pop[i,j] + rand( Float64 ) * ( Pu[j] - data.pop[i,j] ) - rand( Float64 ) * ( Pl[j] - data.pop[i,j] )

and i still get the similar converging variation between the CPU and GPU. Again, pre-generating the random number on the CPU:

_rand = CuArray( rand( Float64, n_pop, prob.num_var, 2 ) )
_randn = CuArray( randn( Float64, n_pop ) )

and using them inside the GPU kernel:

@inbounds xNew[i,j] = data.pop[i,j] + _rand[i,j,1] * ( Pu[j] - data.pop[i,j] ) - _rand[i,j,2] * ( Pl[j] - data.pop[i,j] )

resolves the issue!

BTW, I’ve changed the pre-generated random values from the CPU:

_rand = CuArray( rand( Float64, n_pop, prob.num_var, 2 ) )
_randn = CuArray( randn( Float64, n_pop ) )

to the GPU:

_rand = CUDA.rand( Float64, n_pop, prob.num_var, 2 )
_randn = CUDA.randn( Float64, n_pop )

and got the same result, meaning it works fine, got about the same convergence in the GPU code as in the CPU implementation.

Again, the issue only appears when using the rand() function inside the CUDA kernel…

1 Like

It’s hard to compare the CPU and GPU code because of the different instances assigned to synonymous variables e.g. CPU’s Xnew is a constructed vector while GPU’s Xnew is a mutated matrix. Without the surrounding code, it’s not possible to say if they’re really equivalent.

However, your GPU’s original code and the version using the CPU’s rand are actually different. Let me highlight the changes:

# version 1
    @inbounds xNew[i,j] = ... rand( Float64 ) ... rand( Float64 ) ...
  else
    @inbounds xNew[i,j] = ... randn( Float64 ) ... # every [i,j]!!
# version 2
_rand = CuArray( rand( Float64, n_pop, prob.num_var, 2 ) )
_randn = CuArray( randn( Float64, n_pop ) ) # NOT n_pop x prob.num_var??
...
    @inbounds xNew[i,j] = ... _rand[i,j,1] ... _rand[i,j,2] ...
  else
    @inbounds xNew[i,j] = ... _randn[i] ... # NOT _randn[i,j]??

The _rand part seems fine, you have a 3rd dimension to accomodate the 2 different calls, that’s clever. Problem is the _randn part. In the first version that gives you problems, you’re sampling randn for every i and j. In the second version, you’re only sampling randn for every i. The second version seems more similar to your CPU version where you sample randn( Float64 ) versus rand( Float64, prob.num_var ) along Xnew of length prob.num_var. Is it possible that sampling randn more often along an additional axis j is hindering the convergence?

The relatively lower entropy of the CUDA RNG was discussed here: Overflow in `randn` using CUDA.jl's native RNG · Issue #1464 · JuliaGPU/CUDA.jl · GitHub. However, the difference from the base Julia RNG is only one bit. However, one takeaway was that the CURAND RNG has more entropy than both. What happens if you replace rand(Float64, ...) with rand(CUDA.CURAND.default_rng(), Float64, ...) and similarly for randn?

Edit: To be clear, the difference I’m talking about here is how much entropy is discarded in the map from random bits to random floats. It’s not a statement about the joint distribution of the raw random bits emitted by the different RNGs.

1 Like

FWIW, the kernel rand() passes BigCrush last I checked, so normally the quality of generated numbers should be pretty good.

3 Likes

Touché…

Indeed the value of randn in the second function should stay the same for every j (the length prob.num_var), meaning that it should only be recalculated once per population member (every i)…

In my defense, what threw me off was the fact that I was getting two different, but consistent, results when running the same code in two different GPUs!

Guess that in the end, my 3rd possible cause for this issue was the correct one: “just the result from another bad CUDA kernel implementation”… :man_facepalming:

I’m sill seeing some variation between the convergence of the CPU and GPU results but not as significant and it must be something related to the relatively lower entropy of the CUDA RNG mentioned by @danielwe.

Thanks everyone for your inputs!

Just a quick follow-up to set things straight: the missing bit of entropy in CUDA.jl’s random floats is due to an unfortunate fallback in base Julia, it has nothing to do with CUDA.jl’s own code. The same issue appears with, for example, StableRNGs.jl. Fallback `rand(Float32/64)` discards more entropy than necessary · Issue #44887 · JuliaLang/julia · GitHub

If you still think that random float entropy affects your convergence it would be really interesting to try the CURAND rng: rand(CUDA.CURAND.default_rng(), Float64, ...).