Device RNG - Passes BigCrush

MrBelette · October 6, 2024, 3:41pm

I resurrected the SharedTauswortheGenerator device RNG from CUDA.jl v3.0.0 and, after some adjustment, it appears to pass the TestU01 BigCrush battery of tests!
It would be nice if this can be independently tested.

The RNG is the LFRS113 generator from L’Ecuyer and is known to only fail the linear complexity tests of BigCrush on a CPU, so it should be ok for most purposes except cryptographic and Quasi-MC work (if I understand things correctly). It is theoretically superior to the LFRS88 generator used in the Combined Tausworthe generator shown in CPU Gems 3, due to a higher proportion of non-zero coefficients of its characteristic polynomial.

The CUDA.jl implementation unfortunately fails even the SmallCrush test and I guess it’s due to not holding the full state for every thread.
I thought I’d take a leaf out of the CPU Gems implementation and try to mix in some additional state using LCG/MCGs and, surprisingly, that seems to be enough to get the RNG to pass all of BigCrush. The test took just over 6.5 hours on an RTX 4090.

Changes:

Add additional per warp state using the CPU Gems LCG

        # mix-in the warp id to ensure unique values across blocks.
        # we have max 1024 threads per block, so can safely shift by 16 bits.
        # XXX: see comment in `initial_state`  
        z = xorshift(z ⊻ (warpId << UInt32(16)))
        for i=1:warpId
            z = UInt32(166425) * z + UInt32(1013904223)
        end
        sync_threads()

Use 3 good MCG multipliers from another L’Ecuyer paper (Tables of Linear Congruential Generators of Different Sizes and Good Lattice Structure) to mix up state during the warp shuffle phase. If I’ve understood the code correctly, by using 3 multipliers I believe that different threads will get different state.

# generate based on 4 bytes of state
        mask = active_mask()
        i1 = pow2_mod1(threadId+UInt32(1), UInt32(32))
        i2 = pow2_mod1(threadId+UInt32(2), UInt32(32))
        i3 = pow2_mod1(threadId+UInt32(3), UInt32(32))
        if active_mask() == typemax(UInt32)
            # we have a full warp, so we can safely shuffle.
            # get the warp-local states from neighbouring threads.
            s1 = shfl_sync(FULL_MASK, state, i1) * UInt32(741103597)
            s2 = shfl_sync(FULL_MASK, state, i2) * UInt32(1597334677)
            s3 = shfl_sync(FULL_MASK, state, i3) * UInt32(747796405)
        else
            # can't shuffle, so fall back to fetching block-global state.
            # this results is less entropy, as it reuses data from the first warp.
            s1 = rng.state[i1] * UInt32(741103597)
            s2 = rng.state[i2] * UInt32(1597334677)
            s3 = rng.state[i3] * UInt32(747796405)
        end
        state ⊻ s1 ⊻ s2 ⊻ s3

Compared to Philox2x32, the amended SharedTausworthe generator uses half the amount of shared memory (128 bytes) but twice the number of registers (32). Generation speed is about the same.

I’m not an expert on RNGs, so I can’t say that there aren’t potential flaws with this approach.
I thought I’d bring it to the attention of this community because of the desirability of having an on-the-fly, reduced state device RNG that can seemingly pass the BigCrush test.

maleadt · October 7, 2024, 12:02pm

I’m not too familiar with RNG design or implementation, so maybe I’m missing something, but what’s the issue with the current Philox-based device RNG? That one also passes BigCrush, last I tested.

MrBelette · October 8, 2024, 1:16pm

Well, I’m a bit embarrassed here. I thought I saw a clear fail using philox2x32, but it was probably due to my hacking around and not resetting my Julia environment before doing the testing. I also misled myself into thinking that it only passed SmallCrush, due to posts I saw and the comment in device\random.jl explaining why the 7 round version was chosen. Anyway, sorry about that!

I re-ran philox2x32 through BigCrush and I didn’t see and fails or suspicious values.
However, I believe that the rng produces 64 bit values (I’m not sure if the rand values always get converted to UInt32) and TestU01 only checks the lower 32 bits.
I was wondering if you had checked the upper 32 bits as well?
I have observed a suspicious p-value (0.99959) for a Knuth Permutation test when I swap ctr1 and ctr2 in rand.
I also observed another in the Gap test when I reversed the lower 32 bits.
I don’t think that this is something to be particularly concerned about, but I’m wondering if it would be worthwhile to test the 10 round version similarly and, if it passes, offer that as an option for users requiring 64-bit precision?
I’ll take a look when I get time.

maleadt · October 9, 2024, 8:38am

Ah interesting; because RNGTest.jl documented that the RNG needs to return Float64, I figured it tested the whole 64 bits. If that isn’t the case, and we fail the test on the upper 32-bits, it’s probably worth bumping the number of rounds.

MrBelette · October 9, 2024, 7:57pm

Seeing “suspicious” p-values doesn’t imply failure - only that further testing is warranted. If a greater number of p-values fall in the “suspicious” range than expected, then the quality of the rng should be questioned. If it was just a statistical fluke, then we can have more faith in the generator. For example, I re-ran the rng with reversed bits again, for both upper and lower 32 bits and didn’t observe any suspicious p-values this time. So, the 7 round version of the rng could be ok. Ideally, we’d run the BigCrush battery of tests multiple times, but this takes time. What would be good is for multiple people to perform the tests and report the output.

I’ve skim-read the original paper that introduced philox and implementations in other languages. The paper does show that 7 rounds is the minimum needed to pass BigCrush (for philox4x32) but suggest that using 10 provides a substantial safety margin without sacrificing much speed. They also say that they tested multiple streams through the BigCrush tests, corresponding to multiple keys. But I didn’t see any mention of testing the upper and lower 32 bits, or whether reversed bits were tested.
The Random123 github repo from DE Shaw Research uses 10 as the default number of rounds (also for philox2x32), and I’ve seen other implementations that also use 10.

Personally, I’d like to be able to use both versions. I can see the rationale for playing it safe with the number of rounds when you’re doing physical simulations but I’d be happy using 7 rounds in the financial domain.
Perhaps you can change the default_rng to the 10 round version but also expose the 7 round version?

I’ll run BigCrush on the 10 round version for both halves of the quadword, not bit-reversed, tonight, and will report the results back to you. Thanks.

MrBelette · October 15, 2024, 11:35pm

TLDR: Although the Philox2x32 device RNG with either 7 or 10 rounds can produce suspicious p-values in the BigCrush battery of tests, no failures were observed during my testing. Using the Philox2x32{7} generator in Random123.jl, suspicious p-values for Knuth’s Gap test were found experimentally by varying the seed. Following the advice from the authors of TestU01, the number of samples was increased and the p-values reverted back to the pass range. So, in this case, one would have confidence in the quality of the random numbers being generated. We can never fully test the quality of random numbers from a prng so it’s perhaps not a great idea to rely on other people’s testing. But, on the basis of my own testing, I have more confidence in the quality of the philox device rng. The comment in the following code should at least be changed to say “enough to pass BigCrush”:

# default to 7 rounds; enough to pass SmallCrush
@inline Philox2x32() = Philox2x32{7}()

Additional Info:
I’ve done a bunch of testing on the philox device rng, it’s cpu version in Random123.jl, as well as Julia’s standard cpu rng’s.
It was pretty easy to generate suspicious p-values using the SmallCrush battery of tests for any rng I looked at. It’s also possible to generate them in the Crush and BigCrush tests. For example, in the following code snippet, I use the 1998 version of Mersenne Twister directly from the TestU01 library, which shows a suspicious p-value in a random walk test. It also shows the expected failures in the linear complexity tests.

seed = UInt32(166425) * UInt32(41) + UInt32(1013904223)
unif01 =  ccall((:ugfsr_CreateMT19937_98, RNGTest.libtestu01), Ptr{Cvoid}, (Culong,), seed)
try 
    ccall(("bbattery_Crush", RNGTest.libtestu01), Cvoid, (Ptr{Cvoid},), unif01)
    Libc.flush_cstdio()
finally
    Libc.free(unif01)
end

Result:

========= Summary results of Crush =========

 Version:          TestU01 1.2.3
 Generator:        ugfsr_CreateMT19937_98
 Number of statistics:  144
 Total CPU time:   00:20:21.76
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-015):

       Test                          p-value
 ----------------------------------------------
 66  RandomWalk1 H (L = 90)         4.5e-05
 71  LinearComp, r = 0              1 - eps1
 72  LinearComp, r = 29             1 - eps1
 ----------------------------------------------
 All other tests were passed

The advice from the TestU01 paper is to increase the number of samples whenever a suspicious p-value occurs. But, for this to happen we’d have to know the rng state just before the suspicious test is run.
For counter-based prngs such as philox this is relatively simple - we just need to know the values of the counters and key before the start of the test that produced the suspicious p-values.
If we set the seed=90, counter1=0x89f82f00, and counter2=0x00000003 then a version of Knuth’s Gap test in BigCrush gives a suspicious p-value:

seed=90
n = 10_000_000
gen = Random123.Philox2x(UInt32, seed, 7)
gen.ctr1 = 0x89f82f00
gen.ctr2 = 0x00000003
rng = RNGTest.wrap(gen, UInt32)
pval = RNGTest.sknuth_Gap(rng,1,n,20,0,1/1024)
println("Philox2x32{7} - Knuth's Gap Test. Seed = ", seed, ", n = ", n, ": pval = ", pval)

Philox2x32{7} - Knuth's Gap Test. Seed = 90, n = 10000000: pval = 0.0008201313480390438

But the p-value reverts to the “pass” range when we increase n to, say, 20000000.

Further testing along these lines didn’t expose any failures in the 7 round version of Philox2x32. The authors of the prng suggest defaulting to 10 rounds, as a “safety margin”, but I didn’t observe any noticeable improvement in the Crush tests. For example, I could still generate suspicious p-values from the tests using this version.
I would still think about whether it would be better for CUDA.jl to default to the 10 round version, to be more in line with other implementations.

I’ve also had a chance to probe the rng code in CUDA.jl and I think I understand it better now. At first I was confused as to why you had set up shared memory for 32 warps when rand! only using a single warp (and hence there is only 1 warpId). Or why counter 1 is incremented in rand when counter 2 is being used to set the parallel substreams. Incrementing counter 1 for this version of rand! is superfluous. But, of course, a user could create their own kernel with up to 32 warps and rand could be called multiple times in a thread, so both items are needed. The user would need to increment counter 1 correctly after a kernel call, as you’ve done in rand! (although, in this case where only one call to rand per thread is being made, I think you could safely increment it by 1 rather than length(A)), to ensure that the same counter state isn’t used in subsequent kernel calls. I couldn’t spot any situation where state was being re-used.
Anyway, I think I’ve learned a lot from this exercise, so thank you very much Tim!

Palli · October 16, 2024, 12:55pm

I think you mean GPU Gems 3, from 2007, then it’s outdated on best RNG, and likely LFRS113 is also outdated.

Do you mean 128 bytes or bits? If actually bytes, then likely explained by vectorization, see last quote.

github.com

JuliaGPU/CUDA.jl/blob/a8fecea4b730c153dd79c8245dc1dc71fe8e095b/src/random.jl#L19-L21


      
          mutable struct RNG <: AbstractRNG
              seed::UInt32
              counter::UInt32

Is the state only the seed, i.e. 32 bits, or 64, including the counter? I’m not sure what 2x in philox2x32 refers to, maybe that. Was it chosen since some GPUs do not have 64-bit integers? That’s likely a very old no longer valid assumption, though might 64-bit operations be slower? Is 64-bit multiply single-cycle? Float64 capability is being sacrificed in GPUs by now, I’m not sure if there are also fewer functional units for 64-bit integer vs 32-bit multiply (and addition) units.

Julia’s default RNG (non-CUDA) state is (or was?) 256 bits, then I’ve seen the s4 added, so all of this is stored in task-local, but I still think only 256 bits used when generating?

github.com

JuliaLang/julia/blob/8a79822cbab84d8fe05e55cfb515b73c18db754c/stdlib/Random/src/Xoshiro.jl#L52-L57


      
          mutable struct Xoshiro <: AbstractRNG
              s0::UInt64
              s1::UInt64
              s2::UInt64
              s3::UInt64
              s4::UInt64 # internal splitmix state

Julia uses xoshiro256++ while Fortran (and Lua) chose xoshiro256** Rust: “The SmallRng from the rand crate is xoshiro256++ or xoshiro128++, depending on the platform.” Not to be confused with for “xorshift128+ is presently used in the JavaScript engines”.

Note:
https://prng.di.unimi.it/

If, however, one has to generate only 64-bit floating-point numbers (by extracting the upper 53 bits) xoshiro256+ is a slightly (≈15%) faster generator with analogous statistical properties. … low linear complexity of the lowest bits can have hardly any impact in practice, and certainly has no impact at all if you generate floating-point numbers using the upper bits

For GPUs, is it mainly important to generate floating point, and even then probably only important to generate Float32 or smaller, then the RNG could take that into account? Julia wants to generate Int64 too, I think why xoshiro256++ was chosen, but for GPUs fast [U]Int64 generation might not at all be important, Int32 max might be ok, and then xoshiro256+ is ok (you could generate twice for UInt64, or and once and then with slightly worse statistical properties).

https://prng.di.unimi.it/xoroshiro128plus.c

/* This is xoroshiro128+ 1.0, our best and fastest small-state generator
for floating-point numbers, but its state space is large enough only
for mild parallelism. We suggest to use its upper bits for
floating-point generation, as it is slightly faster than
xoroshiro128++/xoroshiro128**. It passes all tests we are aware of
except for the four lower bits, which might fail linearity tests (and
just those), so if low linear complexity is not considered an issue (as
it is usually the case) it can be used to generate 64-bit outputs, too;
moreover, this generator has a very mild Hamming-weight dependency
making our test (Testing Hamming-weight dependencies) fail after 5 TB of
output; we believe this slight bias cannot affect any application. If
you are concerned, use xoroshiro128++, xoroshiro128** or xoshiro256+.

xoshiro256+ is ≈8% slower than the dSFMT, but it has a doubled range of output values, does not need any extra SSE instruction (can be programmed in Java, etc.), has a much smaller footprint, and its upper bits do not fail any test.

https://prng.di.unimi.it/#intro

Their fastest option though is MWC192 at 0.42 ns/64 bits, 44% faster than Julia’s RNG, Sebastiano Vigna has outdone himself in 2023, and it would be interesting for someone to implement in Julia, but I’m not sure it’s appropriate for GPUs.

The even faster SplitMix64, seemingly also appropriate for GPUs, is the fastest option with simplest code (showing here all of it):

...

#include <stdint.h>

/* This is a fixed-increment version of Java 8's SplittableRandom generator
   See http://dx.doi.org/10.1145/2714064.2660195 and 
   http://docs.oracle.com/javase/8/docs/api/java/util/SplittableRandom.html

   It is a very fast generator passing BigCrush, and it can be useful if
   for some reason you absolutely want 64 bits of state. */

static uint64_t x; /* The state can be seeded with any value. */

uint64_t next() {
	uint64_t z = (x += 0x9e3779b97f4a7c15);
	z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9;
	z = (z ^ (z >> 27)) * 0x94d049bb133111eb;
	return z ^ (z >> 31);
}

Splitmix64 is not recommended for demanding random number requirements, but is often used to calculate initial states for other more complex pseudo-random number generators.

Why did you choose Philox2x32? I mean it probably was considered good/best at the time, and is still good, but it predates xoroshiro, and I suggested it for Julia and

Why NumPy didn’t update to xoroshiro, but rather to slower 48% PCG64-DXSM with version 1.17 I’m not sure. Maybe because of half the state-space, and it’s still a factor, or can be e.g. on GPUs?

Permuted congruential generator (64-bit, PCG64 DXSM) — NumPy v2.1 Manual Permuted congruential generator (64-bit, PCG64) — NumPy v2.1 Manual

In case you are interested in 64-bit PRNGs based on congruential arithmetic, I provide three instances of a Marsaglia’s Multiply-With-Carry generators, MWC128, MWC192, and MWC256, for which I computed good constants. They are some of the fastest generator available, but they need 128-bit operations.

https://prng.di.unimi.it/MWC192.c

/* This is a Marsaglia multiply-with-carry generator with period
approximately 2^191. It is faster than a scrambled linear
generator, as its only 128-bit operations are a multiplication and sum;
it is an excellent generator based on congruential arithmetic.

…

/* The state must be initialized so that 0 < c < MWC_A2 - 1.
For simplicity, we suggest to set c = 1 and x, y to a 128-bit seed. */
uint64_t x, y, c;

…

/* The following jump functions use a minimal multiprecision library. */

#define MP_SIZE 4
#include “mp.c”

static uint64_t mod[MP_SIZE] = { 0xffffffffffffffff, 0xffffffffffffffff, MWC_A2 - 1 };
…
Equivalent C++ Boost multiprecision code:
…
cpp_int(“0xdc2be36e4bd21a2afc217e3b9edf985d94fb8d87c7c6437”);

Some of the generators can be very easily vectorized, so that multiple instances can be run in parallel to provide fast bulk generation. Thanks to an interesting discussion with the Julia developers, I’ve become aware that AVX2 vectorizations

MrBelette · October 16, 2024, 11:11pm

The state consists of 2 32-bit counters and a 32 bit key. In the CUDA.jl implementation, the 2nd counter is implied, rather than stored. It is always equal to the thread ID. Moreover, counter 1 is only unique for each warp ID (there are 32 threads per warp). The combination, with appropriate increments, should however always produce a unique random number. The idea here, if I’ve understood correctly, is to used some shared state per warp to reduce the amount of registers being used so that more is available for user applications. Avoiding register “spilling” is something to be careful of in GPUs for performance reasons.
I think the CUDA.jl implementation is pretty neat!
If I understand correctly, 64 bit integer math in a gpu is still emulated using 2x 32 bit integers. For example, my RTX 4090 has 64 INT32 cores for each streaming multiprocessor (SM), compared to 128 FP32 cores and 2 FP64 cores. I see the same spec for the H100, except that it has 64 FP64 cores per SM.
The 64 bit versions of philox could be used, but performance would of course be worse.

MrBelette · October 16, 2024, 11:31pm

Is that 256 bits per state per thread?
Wouldn’t 1024 threads then require 32KB? Yes, GPU global memory can be large and supports this, but performance is unlikely to be great for other reasons.
Counter-based PRNGs such as philox require much less memory for holding state. If I’ve understood the CUDA.jl implementation correctly, 256 bytes only would be required for 1024 threads.

MrBelette · October 16, 2024, 11:49pm

I can’t speak for Tim on why this was chosen, but it’s characteristics in terms of minimal required state, high speed, and quality of randomness make it a sound choice for the GPU, I think. CPU implementations are a completely different matter.
Anyway, this thread started because I misinterpreted a comment in the code (philox2x32 7 rounds is sufficient to pass SmallCrush) and thought that it only passed SmallCrush and not the more stringent tests. I’ve spent days of GPU time producing and testing random numbers using this generator and I haven’t seen any flaws. That doesn’t prove a thing of course but it does provide some level of confidence. I have seen a Master’s thesis from Florida State University where a version of philox failed the Knuth Gap tests. I tried to produce the same results but without success.

MrBelette · October 16, 2024, 11:57pm

For the sake of completeness, here’s my simplistic summary of philox:
Here’s what I’ve understood from the philox paper (you should read it - it’s available online) and investigating the code of multiple implementations: Philox2x32 takes 2 32 bit unsigned integer counters and a 32 bit key (seed), and produces 2 transformed 32 bit unsigned integers. For 32-bit applications, these 2 transformed integers are the first 2 numbers in a random sequence. In CUDA,jl, the 2 integers are instead used to form 1 64 bit integer. The same counters and key will always produce the same numbers. In single-threaded calculations the counter is simply incremented by 1 enabling 2^64 different random numbers per key. There are different methods for producing random numbers in parallel. CUDA.jl uses the substream approach whereby counter 2 is set to be the thread index and counter 1 is increments every time a number is produced. The key remains the same throughout but is updated during the transformation process (i.e. every “round” updates the key). The alternative is to vary the key (the multistream approach) or a combination of both. The transformation uses cryptographic techniques. I don’t understand all the details, but I think that the idea is to use potentially complex mathematical operations to scramble the inputs and in a way that makes it difficult for an “attacker” to determine the sequence. The philox authors looked for transformations that used very simple mathematical operations. This reduces/removes cryptographic strength but was done for the purposes of speed. They tested multiple round/counter/key combinations against the BigCrush battery of tests. The paper didn’t report results for philox2x32, but the DEShaw github repository reports that they didn’t find any statistical flaws in the 7 and 10 round versions of philox2x32.

Oscar_Smith · October 17, 2024, 3:47am

I do think 64 bits of randomness is probably smaller than ideal. That is small enough that 4 billion random numbers will show birthday paradox problems (and 4 billion is a pretty small number on a GPU).

maleadt · October 17, 2024, 9:35am

Yeah, the whole RNG was initially designed and implemented only for device-side use, where we have to support any launch configuration. The host-side RNG wrapper around it only materialized after that.

Maybe I’m misunderstanding, but the user shouldn’t have to do anything. By default, when not explicitly setting the counter as the RNG kernel does, the generated code will seed the device-side RNG using a random seed that’s fed from the CPU side, and is unique for every kernel invocation.

It was suggested to me, in CUDA.jl v3.0 - #6 by oschulz. Being counter-based helped fix the issues we were having with the previous Tausworthe generator. Also, having an existing Julia implementation in Random123.jl helped implementing it.

MrBelette · October 17, 2024, 2:13pm

What I was trying to say here is that, if someone wanted to create their own kernel that calls rand (not rand!), perhaps multiple times within the kernel, then they just need to be careful to increment counter 1 appropriately after the kernel has run, in order to have unique state in any subsequent kernel calls (since shared memory doesn’t persist between kernel calls). I’m not sure that changing the key (seed) is strictly necessary here, though it may be prudent (and I could be wrong!). Also, for reproducibility purposes I might want to keep the key fixed and rely only on the counters to produce unique values.
As an example, let’s say that my cuda array is of length 1, but within a kernel I call rand twice, for whatever reason. The device counter state (ctr1, ctr2) will be set first to (0,1) then (1,1) for the second rand call. The host must then ensure that the state before the next kernel call is, say, (2,1) to avoid re-using already generated numbers. I think you’re saying that in the current rand! implementation, in addition to incrementing ctr1, you also change the key. I think that’s also valid, but I might also wish to have more control over the state i.e. (ctr1, ctr2, key)…

maleadt · October 17, 2024, 2:43pm

That’s not true though:

julia> using CUDA

julia> function kernel()
         @cushow rand()
         @cushow rand()
       return
       end

julia> @cuda kernel();
rand() = 0.529151
rand() = 0.889831

julia> @cuda kernel();
rand() = 0.446190
rand() = 0.624525

You’re right in that shared memory doesn’t persist between kernel calls, but that’s why I linked to the code that passes a seed to the kernel, unique for each kernel call, that’s used to initialize the shared memory at the start of the kernel (by calling Random.seed!). This only initializes the key, not the counter, but that’s sufficient in order to get new random numbers in each call of a kernel.

MrBelette · October 17, 2024, 4:02pm

Ah yes, I had forgotten about your call to kernel_state().random_seed in intialize_rng_state. That’s probably because I don’t understand kernel_state()!

But I was thinking more along the lines of the following incorrect calculation:

function kernel(seed::UInt32, counter::UInt32)
    @inbounds Random.seed!(Random.default_rng(), seed, counter)
    @cushow rand()
    @cushow rand()
    return
end

seed = UInt32(12345)
ctr1 = UInt32(0)
rng = CUDA.RNG(seed, ctr1)
@cuda kernel(rng.seed, rng.counter)
@cuda kernel(rng.seed, rng.counter)

CUDA.HostKernel for kernel(UInt32, UInt32)
rand() = 0.271762
rand() = 0.674062
rand() = 0.271762
rand() = 0.674062

One would need to do:

function kernel(seed::UInt32, counter::UInt32)
    @inbounds Random.seed!(Random.default_rng(), seed, counter)
    @cushow rand()
    @cushow rand()
    return
end

seed = UInt32(12345)
ctr1 = UInt32(0)
rng = CUDA.RNG(seed, ctr1)
@cuda kernel(rng.seed, rng.counter)
rng.counter += 2
@cuda kernel(rng.seed, rng.counter)

CUDA.HostKernel for kernel(UInt32, UInt32)
rand() = 0.271762
rand() = 0.674062
rand() = 0.684108
rand() = 0.201789

It may be a silly example, but it does highlight a potential pitfall.

MrBelette · October 17, 2024, 4:31pm

Interesting. Do you have a reference for this so I can maybe take a look? (I guess I could always test myself when time permits).
I have read a comment on Daniel Lemire’s blog that expecting only sqrt(period=22^(232)) values isn’t the case for counter-based rng’s. But, I’m not an expert. The comment also talked about it being problematic producing 64 bit values from a prng with a period of 2^64.
I’m more interested in fp32 calculations at the moment so I’m happy using the 2x32 version, but producing fp32 rather than fp64. I also feel that I’ve done my due diligence on this rng by running multiple tests.
One can always move to the 4x32 version if necessary for their application.

maleadt · October 17, 2024, 5:36pm

I don’t think this is a very realistic example. All I’m seeing here is a kernel that calls seed! with identical seed and counter inputs on every execution, so why wouldn’t you expect that to return the same numbers? Also, CUDA.RNG is purely a host-side abstraction, not meant to be passed to kernels (even though you’re only taking it apart and passing the seed and counter fields).

MrBelette · October 17, 2024, 7:50pm

Fair enough.
I want reproducibility of the generated random numbers - even across different kernel calls. I don’t think that’s an unreasonable thing to ask for, but I could be wrong. For example, if one of the TestU01 tests did in fact fail, I’d want to know the initial state of the generator so that the result could be reproduced. I thought the only way to do that was by controlling the seed and counter and that is what I was highlighting. Perhaps there is another mechanism in the current implementation that supports this, that I may have missed? Thanks.

maleadt · October 18, 2024, 8:10am

There isn’t an official mechanism because AFAICT the Random stdlib doesn’t have one either. I could imagine a way to extract the current state and being able to pass that to Random.seed!, but presumably that doesn’t work for all types of generators. For a relatively niche case like yours where you want reproducibility offset by the amount of numbers requested before (as opposed to simply calling Random.seed! with the same seed every time) I think it’s fair to expect to have to reach into internals like and have to manage the counter value too.

Topic		Replies	Views
Kernel random numbers generation entropy / randomness issues GPU question , kernel	25	1240	November 3, 2023
True random numbers Community random	14	1351	March 18, 2024
Parallel Mersenne Twister Internals & Design question	18	2415	January 19, 2020
CUDA.jl v3.0 Package Announcements	10	1367	April 16, 2021
Linear relationship between Xoshiro tasks Internals & Design rng	29	1573	March 6, 2024

Device RNG - Passes BigCrush

Related topics