Hi all,
I’m implementing some optimization algorithms in the GPU (CUDA) that rely on, among other things, random numbers to set the exploration of the search space. I’m also comparing the results with the CPU implementation of the same algorithm in order to check the algorithm’s convergence and speedup.
The thing is that I was obtaining a huge difference in the convergence (to 0.0) from CPU VS GPU implementations. In the CPU it hoovers around 1.4e-19 to 1.4e-20 where in the GPU its around 2357.4 to 3462.1 in my GTX 970M, and about 20.1 to 30.2 in a friend’s RTX 3090…
I’ve tested for synchronization issues (inside and outside the kernels), racing conditions, variable types, variations on the implementations of the Julia functions in GPU kernels, large and small dataset to check for sync issues between GPU blocks, etc, and found nothing…
After a long comparison and debugging session I’ve found out that the issue relied in the rand() function in the GPU kernel, that for some reason was no as “random” as in the CPU! Here are some code snippets of the main instruction (that handles the algorithm convergence) to try to make sense of things:
## CPU code ##
# pop is a vector of vectors (main vector is of n_pop size and each sub vector is of prob.num_var size)
if LES
xNew = pop[p] + rand( Float64, prob.num_var ) .* ( Pu - pop[p] ) - rand( Float64, prob.num_var ) .* ( Pl - pop[p] )
else
xNew = pop[p] + randn( Float64 ) .* ( oldPop[p] - pop[p] )
end
## GPU kernel code ##
# pop is a matrix of n_pop x prob.num_var
if i <= size( data.pop, 1 ) && j <= size( data.pop, 2 )
@inbounds if LES[i] == 1
@inbounds xNew[i,j] = data.pop[i,j] + rand( Float64 ) * ( Pu[j] - data.pop[i,j] ) - rand( Float64 ) * ( Pl[j] - data.pop[i,j] )
else
@inbounds xNew[i,j] = data.pop[i,j] + randn( Float64 ) * ( oldPop[i,j] - data.pop[i,j] )
end
end
The solution I’ve found to have both codes (CPU and GPU) to have a similar convergence was to firstly create the necessary random numbers in the CPU, bring them to the GPU (kernel) and then used them in the computation. Like this:
# outside the GPU kernel
_rand = CuArray( rand( Float64, n_pop, prob.num_var, 2 ) )
_randn = CuArray( randn( Float64, n_pop ) )
###########
# inside the GPU kernel
if i <= size( data.pop, 1 ) && j <= size( data.pop, 2 )
@inbounds if LES[i] == 1
@inbounds xNew[i,j] = data.pop[i,j] + _rand[i,j,1] * ( Pu[j] - data.pop[i,j] ) - _rand[i,j,2] * ( Pl[j] - data.pop[i,j] )
else
@inbounds xNew[i,j] = data.pop[i,j] + _randn[i] * ( oldPop[i,j] - data.pop[i,j] )
end
end
Note that I’m not using random number seeding.
My question is if this is an expected behavior (as I’m not an experience Julia or CUDA programmer) or some other type of issue (something related to GPU parallelization, how random numbers are generated in the kernels, a bug, etc…)?
Thank you in advanced.