Hi all!
I’m implementing with some algorithms in the GPU (CUDA) that are stochastic in nature, meaning that they rely on random numbers (namely uniformly distributed random numbers in [0,1]) to achieve results.
When preforming controlled execution tests, by replacing the random numbers with fixed values for both the CPU and GPU implementations of the algorithm, the obtained results are basically the same (only found a rounding variation on the 12th decimal place). But when using random numbers, there is a difference of several orders of magnitude between the results of the GPU and CPU algorithms, especially when the result is a very small number. When running these types of algorithms it is expected to have diferente results but on average the values should be close.
I’ve traced down this issue to the random number generation inside GPU kernels that seems to have some entropy/randomness problems.
In fact, the issue disappears when using pre-generated random numbers, created outside the kernel into a multidimensional array and then sent to the kernel for the computation. Additionally, there is no difference if the pre-generation of the random numbers is done by the CPU (e.g. _rand = CuArray( rand( Float64, n_pop, n_var, 2 ) )
) or directly in the GPU (e.g. _rand = CUDA.rand( Float64, n_pop, n_var, 2 )
), as long that is done outside the kernel it works fine.
Here is some example code:
Kernel with random number generation:
function kernel_update_pop!( pop, xB, xW, xN )
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
if i <= size( pop, 1 ) && j <= size( pop, 2 )
@inbounds xN[i, j] = pop[i, j] + rand( Float64 ) * ( xB[j] - abs( pop[i, j] ) ) - rand( Float64 ) * ( xW[j] - abs( pop[i, j] ) )
end
return nothing
end
Kernel using pre-generated random numbers (without random number generation):
_rand = CUDA.rand( Float64, n_pop, n_var, 2 )
# (...)
function kernel_update_pop2!( pop, xB, xW, xN, _rand )
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
if i <= size( pop, 1 ) && j <= size( pop, 2 )
@inbounds xN[i, j] = pop[i, j] + _rand[i, j, 1] * ( xB[j] - abs( pop[i, j] ) ) - _rand[i, j, 2] * ( xW[j] - abs( pop[i, j] ) )
end
return nothing
end
I’ve found a post on NVIDA developers forum talking about issues with random numbers in a massively parallel algorithm (i.e. inside kernels), which refers that “it would be quite likely that large groups of threads would get the same initial seed, and therefore use an identical random number sequence”. A possible solution offered was to “try to fix this problem by seeding each thread differently” although it is mentioned that “not all random number generators give independent sequences when initialized with different seeds”. (Source)
So my questions are:
- Is this an expected behavior of the
rand()
function when running in parallel on many threads (additionally, is this a Julia issue or a CUDA issue)? - Can this be fixed without pre-generating the random numbers (as this has a performance penalty)?
- How can I make a kernel run with a different seed per used thread? (I’ve tried using
CUDA.Random.seed!
but noticed that the entropy/randomness got worse - code bellow)
function kernel_update_pop!( pop, xB, xW, xN )
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
if i <= size( pop, 1 ) && j <= size( pop, 2 )
CUDA.Random.seed!(i+j)
@inbounds xN[i, j] = pop[i, j] + rand( Float64 ) * ( xB[j] - abs( pop[i, j] ) ) - rand( Float64 ) * ( xW[j] - abs( pop[i, j] ) )
end
return nothing
end
As this behavior is difficult to replicate (without this type of algorithms), I produced a test code to better exemplify the issue. Here is an output sample:
GPU -> Random num.: kernel -> rand() Mean result (10x): 27.395704634853313
GPU -> Random num.: kernel -> rand() + seed Mean result (10x): 86.46606525668491
GPU -> Random num.: pre-gen. -> CUDA.rand() Mean result (10x): 3.5951129883451625e-16
CPU -> Random num.: 'normal' -> rand() Mean result (10x): 3.4445293121154986e-16
Here is the source code:
cpu.jl (735 Bytes)
gpu.jl (3.0 KB)
main.jl (730 Bytes)