Kernel random numbers generation entropy / randomness issues

Edit: Below theory of mine is wrong, I misunderstood how CUDA.jl works. I’m leaving it up in order to not mess up the conversation, but feel free to skip reading.

I think it would be helpful to know how this is supposed to work at all, @maleadt ?

I may have failed to understand how CUDA.jl works on the language level, but my very naive view would be:

In the two bad cases (rand_type == 0 and rand_type == 0), you’re calling rand(Float64).

This uses the default random number generator (i.e. the task-local CPU-based xoshiro), since there is no real support for functions like stdlib Random.default_rng() to produce different results depending on whether we’re on host/CPU or device/GPU?

In that case I would expect that all the gpu threads try to grab the local RNG state from the host/cpu, generate random, and write back the updated state. But that will be a complete race-fest!

If that theory is wrong, it would be super helpful if you @maleadt could explain how this the cool behavior is achieved! That’s because this is a useful technique that people might want to copy.

If my theory is right, then:

  1. You should see that in performance / device utilization? (lots of tiny transfers between host and cpu)
  2. Every unqualified call rand() in device-context is a bug because it dispatches to base/stdlib Random.default_rng() instead of CUDA.default_rng(). I.e. a call that is not passed either the right RNG or a CUArray to tell the compiler/runtime that “device-semantics” instead of “host-semantics” are supposed to apply.
  3. Your problem should go away with the following:
function kernel_update_pop3!(pop, ub, lb, xB, xW, xN, rng = CUDA.default_rng())
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    if i <= size(pop, 1) && j <= size(pop, 2)
        xN[i, j] = pop[i, j] + rand(rng, Float64) * (xB[j] - abs(pop[i, j])) - rand(rng, Float64) * (xW[j] - abs(pop[i, j]))

        if xN[i, j] > ub
            xN[i, j] = ub
        end
        if xN[i, j] < lb
            xN[i, j] = lb
        end
    end
    return nothing
end

PS. Apart from that, the output distribution of julia rand() is statistically different from the output distribution of the curand library (I think amd rocrand does the same as curand?).

I’m not sure about whatever CUDA.jl does. There is a very lengthy active thread about that mess (it’s related to the question: How do you transform a random bitpattern (eg UInt64 or UInt32) to a random float).

I think these statistical differences should be irrelevant for your code, even if they exist for CUDA.jl? But I don’t understand the numerical / statistical properties of your code well enough to be 100% sure.