Kernel random numbers generation entropy / randomness issues

Hi all!

I’m implementing with some algorithms in the GPU (CUDA) that are stochastic in nature, meaning that they rely on random numbers (namely uniformly distributed random numbers in [0,1]) to achieve results.

When preforming controlled execution tests, by replacing the random numbers with fixed values for both the CPU and GPU implementations of the algorithm, the obtained results are basically the same (only found a rounding variation on the 12th decimal place). But when using random numbers, there is a difference of several orders of magnitude between the results of the GPU and CPU algorithms, especially when the result is a very small number. When running these types of algorithms it is expected to have diferente results but on average the values should be close.

I’ve traced down this issue to the random number generation inside GPU kernels that seems to have some entropy/randomness problems.

In fact, the issue disappears when using pre-generated random numbers, created outside the kernel into a multidimensional array and then sent to the kernel for the computation. Additionally, there is no difference if the pre-generation of the random numbers is done by the CPU (e.g. _rand = CuArray( rand( Float64, n_pop, n_var, 2 ) )) or directly in the GPU (e.g. _rand = CUDA.rand( Float64, n_pop, n_var, 2 )), as long that is done outside the kernel it works fine.

Here is some example code:

Kernel with random number generation:

function kernel_update_pop!( pop, xB, xW, xN )
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y

    if i <= size( pop, 1 ) && j <= size( pop, 2 )
        @inbounds xN[i, j] = pop[i, j] + rand( Float64 ) * ( xB[j] - abs( pop[i, j] ) ) - rand( Float64 ) * ( xW[j] - abs( pop[i, j] ) )
    end

    return nothing
end


Kernel using pre-generated random numbers (without random number generation):

_rand = CUDA.rand( Float64, n_pop, n_var, 2 )
# (...)
function kernel_update_pop2!( pop, xB, xW, xN, _rand )
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y

    if i <= size( pop, 1 ) && j <= size( pop, 2 )
        @inbounds xN[i, j] = pop[i, j] + _rand[i, j, 1] * ( xB[j] - abs( pop[i, j] ) ) - _rand[i, j, 2] * ( xW[j] - abs( pop[i, j] ) )
    end

    return nothing
end


I’ve found a post on NVIDA developers forum talking about issues with random numbers in a massively parallel algorithm (i.e. inside kernels), which refers that “it would be quite likely that large groups of threads would get the same initial seed, and therefore use an identical random number sequence”. A possible solution offered was to “try to fix this problem by seeding each thread differently” although it is mentioned that “not all random number generators give independent sequences when initialized with different seeds”. (Source)

So my questions are:

  • Is this an expected behavior of the rand() function when running in parallel on many threads (additionally, is this a Julia issue or a CUDA issue)?
  • Can this be fixed without pre-generating the random numbers (as this has a performance penalty)?
  • How can I make a kernel run with a different seed per used thread? (I’ve tried using CUDA.Random.seed! but noticed that the entropy/randomness got worse - code bellow)
function kernel_update_pop!( pop, xB, xW, xN )
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y

    if i <= size( pop, 1 ) && j <= size( pop, 2 )
        CUDA.Random.seed!(i+j)
        @inbounds xN[i, j] = pop[i, j] + rand( Float64 ) * ( xB[j] - abs( pop[i, j] ) ) - rand( Float64 ) * ( xW[j] - abs( pop[i, j] ) )
    end

    return nothing
end

As this behavior is difficult to replicate (without this type of algorithms), I produced a test code to better exemplify the issue. Here is an output sample:

GPU -> Random num.: kernel   -> rand()           Mean result (10x): 27.395704634853313
GPU -> Random num.: kernel   -> rand() + seed    Mean result (10x): 86.46606525668491
GPU -> Random num.: pre-gen. -> CUDA.rand()      Mean result (10x): 3.5951129883451625e-16
CPU -> Random num.: 'normal' -> rand()           Mean result (10x): 3.4445293121154986e-16


Here is the source code:

cpu.jl (735 Bytes)
gpu.jl (3.0 KB)
main.jl (730 Bytes)

We have our own device-side RNG, so any comments by NVIDIA developers (which likely refer to the CURAND device-side library) are probably not relevant.

Depends on what the exact issue is. The device-side RNG is expected to be a well-behaved RNG, which should provide each threads with high-quality random numbers. I’ve verified this by passing BigCrush, but it’s possible that there’s still some issues, as I’m not an expert in random number generation. So if you can reduce this to a MWE, please file an issue and I’ll have a look.

Well, apparently pre-generating the random numbers does fix your issue, so I guess yes it does? The performance depends on the characteristics of your application. Pre-generating them outside of the kernel using CUDA.rand will use CURAND, so will be fast (in fact, faster than the rand function inside a kernel, but without the opportunity for overlapping random number generation with other work of course).

By calling Random .seed!, as you did.

Edit: Below theory of mine is wrong, I misunderstood how CUDA.jl works. I’m leaving it up in order to not mess up the conversation, but feel free to skip reading.

I think it would be helpful to know how this is supposed to work at all, @maleadt ?

I may have failed to understand how CUDA.jl works on the language level, but my very naive view would be:

In the two bad cases (rand_type == 0 and rand_type == 0), you’re calling rand(Float64).

This uses the default random number generator (i.e. the task-local CPU-based xoshiro), since there is no real support for functions like stdlib Random.default_rng() to produce different results depending on whether we’re on host/CPU or device/GPU?

In that case I would expect that all the gpu threads try to grab the local RNG state from the host/cpu, generate random, and write back the updated state. But that will be a complete race-fest!

If that theory is wrong, it would be super helpful if you @maleadt could explain how this the cool behavior is achieved! That’s because this is a useful technique that people might want to copy.

If my theory is right, then:

  1. You should see that in performance / device utilization? (lots of tiny transfers between host and cpu)
  2. Every unqualified call rand() in device-context is a bug because it dispatches to base/stdlib Random.default_rng() instead of CUDA.default_rng(). I.e. a call that is not passed either the right RNG or a CUArray to tell the compiler/runtime that “device-semantics” instead of “host-semantics” are supposed to apply.
  3. Your problem should go away with the following:
function kernel_update_pop3!(pop, ub, lb, xB, xW, xN, rng = CUDA.default_rng())
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    j = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    if i <= size(pop, 1) && j <= size(pop, 2)
        xN[i, j] = pop[i, j] + rand(rng, Float64) * (xB[j] - abs(pop[i, j])) - rand(rng, Float64) * (xW[j] - abs(pop[i, j]))

        if xN[i, j] > ub
            xN[i, j] = ub
        end
        if xN[i, j] < lb
            xN[i, j] = lb
        end
    end
    return nothing
end

PS. Apart from that, the output distribution of julia rand() is statistically different from the output distribution of the curand library (I think amd rocrand does the same as curand?).

I’m not sure about whatever CUDA.jl does. There is a very lengthy active thread about that mess (it’s related to the question: How do you transform a random bitpattern (eg UInt64 or UInt32) to a random float).

I think these statistical differences should be irrelevant for your code, even if they exist for CUDA.jl? But I don’t understand the numerical / statistical properties of your code well enough to be 100% sure.

We use overlay method tables during GPU compilation to replace Random.default_rng() to a custom, GPU-friendly RNG: https://github.com/JuliaGPU/CUDA.jl/blob/2ae53761a6a254b98a6689ed0d39781176b245cf/src/device/random.jl#L97. Similarly, just calling rand() in a kernel just works and uses the correct RNG.

Specifically, we use Philox2x32, Switch to Philox2x32 for device-side RNG by maleadt · Pull Request #882 · JuliaGPU/CUDA.jl · GitHub, a counter-based PRNG. The seed is passed from the host, and the counters are maintained per-warp and initialized at the start of each kernel that uses the RNG, rand: seed kernels from the host. by maleadt · Pull Request #2035 · JuliaGPU/CUDA.jl · GitHub. The implementation isn’t fully generic, e.g. you can’t have multiple RNG objects, but it’s pretty close to how Random.jl works.

2 Likes

That is a super cool mechanism! It is so cool that I believe it deserves very prominent documentation / tutorials!

Thank you all for the valuable input, it helped me to make some progress.

I started by simplifying even more the algorithm to use only one random number (instead of the two of the first version) but did not noticed any behavior change.

Then I’ve conducted tests on a second system (with a different GPU) and the values obtained with the GPU code were very different from those of the first system. This should not happen!

As this could be system related, I started to do some sanity checks, namely more code debugging, different variations of the algorithm and finally decided to do a clean install, on the first system, of the latest version of Julia (to version 1.9.3, I was running 1.9.0) and reinstall all packages. Now the results obtained with the kernel using the rand() function was similar to the ones obtained with the CPU code!!!

In the second system I’ve used a different strategy. I opted to update Julia from 1.9.0 to 1.9.3 (instead of a reinstall) and then I performed a package update. I’ve noticed that CUDA was at its latest version but additional related packages were not (not sure which, but I believe that CUDA_Runtime, for instance, was not). Then the second system also started to working fine as well!

So, this problem was related to package / Julia version mismatch issues?!

Although one question still remains.

  • Why is the kernel using CUDA.Random.seed! continues to produce an output that is several orders of magnitude different that the expected result (tested on both systems)?

Here is the latest output (BTW, the “baseline value” is the one from the CPU):

GPU -> Random num.: kernel   -> rand()           Mean result (10x): 4.0253631686955504e-16
GPU -> Random num.: kernel   -> rand() + seed    Mean result (10x): 84.80211895716347
GPU -> Random num.: pre-gen. -> CUDA.rand()      Mean result (10x): 3.0520335084204817e-16
CPU -> Random num.: 'normal' -> rand()           Mean result (10x): 4.1633792593771655e-16

Maybe you were just not using the latest version of CUDA.jl? There’s been recent fixes to the device RNG, rand: seed kernels from the host. by maleadt · Pull Request #2035 · JuliaGPU/CUDA.jl · GitHub for example is only part of CUDA.jl 5.0.

You’ll have to be more specific here, what do you mean with ‘expected result’? The device RNG is a different RNG, so you should not expect that it generates the same random numbers as a CPU RNG, or as the CPU-side CUDA RNG. It only should be deterministic, and high-quality.

What I mean by the expected result would be an output mean value consistent with the one obtained from the CPU code. In the case of the test shown above, this would fall within the range of x.xxxxxe-16 (±2).

I believe that rephrasing my question will make it clearer what I mean.

Basically, it boils down to whether the entropy of a set of randomly generated numbers within an RNG should be the same as a set of random numbers, of the same size, generated using different seeds (i.e., different RNGs)?

Perhaps this is an expected result (I’m not an expert on RNGs), but the test code I’ve provided utilizes a stochastic algorithm that serves as a good test subject for this scenario. Although it depends on random numbers to function, it is built in such a way that it tends to produce a somewhat predictable result, on average. In this case, something in the range of x.xxxxxe-16 (±2)."


PS
While writing this, I also tested reseeding the RNG for every calculation in the CPU code, and I observed the same behavior as in the GPU, meaning that the result was not the expected x.xxxxxe-16 (±2).

So, perhaps this is the correct behavior after all?

GPU -> Random num.: kernel   -> rand()           Mean result (10x): 3.109238670442507e-16
GPU -> Random num.: kernel   -> rand() + seed    Mean result (10x): 91.88475583745569
GPU -> Random num.: pre-gen. -> CUDA.rand()      Mean result (10x): 3.964858687751397e-16

CPU -> Random num.: 'normal' -> rand()           Mean result (10x): 3.1654218159320164e-16
CPU -> Random num.: 'normal' -> rand() + seed    Mean result (10x): 86.88936142259378

This looks like a misunderstanding of how to seed an RNG. You are setting the seed to the same index-dependent value every time you invoke the kernel, which means you’re not generating random numbers, you’re producing the same numbers over and over (or you’re doing something less predictable, it’s not entirely clear to me if setting a different seed per thread like you’re attempting here is well-defined for the device RNG). I think this explains the unexpected results you’re seeing in your rand() + seed experiments.

If you want to seed the RNG manually, you should do it once at the start of your computation. Perhaps @maleadt could provide more clarity about the right way to do this? Does CUDA.default_rng() take a single global seed or are there independent streams with independent seeds, say, one per thread in a warp?

2 Likes

Just a brief comment on these tangents:

  • cuRAND and rocRAND both return samples in (0, 1], but do slightly different things: rocRAND returns (1 + v) / 2^{32} where v is a random UInt, implicitly rounded, while cuRAND returns something like (\frac{1}{2} + v) / 2^{32} (it’s not open source, but you can verify the sample space with a quick experiment).
  • As I tried to emphasize in the other thread, CUDA.jl’s RNG relies on the conversion functions in Base to map UInts to floats. This is one reason GPU compatibility could be worth having in mind when discussing whether to use UInt32 or UInt64 as raw data for a random Float32. But let’s not continue that discussion here.

Ok, this may sound silly, but on some linux/driver/gpu combos, the GPU / driver state gets messed up by suspend-to-ram. In a way that affects cuda but doesn’t affect most graphics, and may or may not be solved by reloading the kernel module.

Some people (ahem) don’t reboot their machines very often and instead suspend. This issue is super annoying for me. If more people than myself suffer from that, maybe that should go into the FAQ.

So… have you tried turning it off and on again?

1 Like

Ah yes, nicely spotted; I hadn’t paid enough attention to the code. This is invalid indeed; if you want to seed the RNG, you need to call it once at the start of your kernel, at least once per warp: https://github.com/JuliaGPU/CUDA.jl/blob/db854fc1f2b5d1e86fdb1d12ad0fe6ae806c9afd/src/device/random.jl#L99-L104. In this case, your’re effectively overwriting the warp-shared state with different seeds per lane, which is undefined behavior.

1 Like

We actually used to see something similar with the CUDA.jl device-side RNG, because we were relying on the fact that shared memory is zero when uninitialized (i.e. at the start of a kernel). That apparently didn’t hold on all platforms, and seemed to depend on the state of the hardware. We are not doing that any longer, and the latest version of CUDA.jl injects code that initializes the RNG at the start of the kernel: rand: seed kernels from the host. by maleadt · Pull Request #2035 · JuliaGPU/CUDA.jl · GitHub

1 Like

I don’t think it would be correct to do any kind of seeding within the kernel that executes the steps of the algorithm. If you look at the files attached in the OP, you’ll see that the kernel is invoked repeatedly in a loop. Resetting the seed for every iteration makes the sampling meaningless. Any manual seeding should be done outside the loop.

Are you supposed to use the same seed for every warp or can they be different? Is there a simple way to seed the RNG from the host side?

In that case, inspiration can be drawn from CUDA.jl’s host-side RNG (which we use when we can’t use CURAND): https://github.com/JuliaGPU/CUDA.jl/blob/db854fc1f2b5d1e86fdb1d12ad0fe6ae806c9afd/src/random.jl#L19-L81. It’s built in top of a kernel that invokes the device-side RNG, passing the seed and counter from the host into the kernel making it possible (and deterministic) to invoke the kernel multiple times. It’s based on the notion that the device-side RNG is a counter-based RNG, so the host-side counter just needs to be incremented by the amount of numbers that were requested during execution of the kernel.

I think it can be different per warp, but IIRC (it’s been a while since I wrote that code) the idea was to use a single seed for all warps, as we offset it using a counter that’s based on the global ID of the thread. That’s also what happens by default: a single seed is passed from the host and applied from every thread.

Do I understand correctly that if I invoke multiple kernels that call rand but don’t touch the seed/counter, there will be no seed/stream “coherence” across kernel invocations? So if I want a deterministic stream of numbers, I have to set the seed and update the counter at every kernel invocation, like the native host-side RNG is doing?

Yeah, I guess that’s what rand: seed kernels from the host. by maleadt · Pull Request #2035 · JuliaGPU/CUDA.jl · GitHub implies

That’s correct, or at least that’s the intended design.

Yeah. Having two numbers like that isn’t particularly satisfying, maybe we should pack a UInt32 seed and UInt32 counter so that it’s possible to extract a UInt64 “seed” that behaves like the CPU RNGs do, but I didn’t imagine people wanting multiple kernel invocations to “resume” the stream of random numbers, I only imagined that a single invocation would have to be deterministic wrt. previous invocations (hence the counter defaults to 0 when calling Random.seed!).