I resurrected the SharedTauswortheGenerator device RNG from CUDA.jl v3.0.0 and, after some adjustment, it appears to pass the TestU01 BigCrush battery of tests!
It would be nice if this can be independently tested.
The RNG is the LFRS113 generator from L’Ecuyer and is known to only fail the linear complexity tests of BigCrush on a CPU, so it should be ok for most purposes except cryptographic and Quasi-MC work (if I understand things correctly). It is theoretically superior to the LFRS88 generator used in the Combined Tausworthe generator shown in CPU Gems 3, due to a higher proportion of non-zero coefficients of its characteristic polynomial.
The CUDA.jl implementation unfortunately fails even the SmallCrush test and I guess it’s due to not holding the full state for every thread.
I thought I’d take a leaf out of the CPU Gems implementation and try to mix in some additional state using LCG/MCGs and, surprisingly, that seems to be enough to get the RNG to pass all of BigCrush. The test took just over 6.5 hours on an RTX 4090.
Changes:
- Add additional per warp state using the CPU Gems LCG
# mix-in the warp id to ensure unique values across blocks.
# we have max 1024 threads per block, so can safely shift by 16 bits.
# XXX: see comment in `initial_state`
z = xorshift(z ⊻ (warpId << UInt32(16)))
for i=1:warpId
z = UInt32(166425) * z + UInt32(1013904223)
end
sync_threads()
- Use 3 good MCG multipliers from another L’Ecuyer paper (Tables of Linear Congruential Generators of Different Sizes and Good Lattice Structure) to mix up state during the warp shuffle phase. If I’ve understood the code correctly, by using 3 multipliers I believe that different threads will get different state.
# generate based on 4 bytes of state
mask = active_mask()
i1 = pow2_mod1(threadId+UInt32(1), UInt32(32))
i2 = pow2_mod1(threadId+UInt32(2), UInt32(32))
i3 = pow2_mod1(threadId+UInt32(3), UInt32(32))
if active_mask() == typemax(UInt32)
# we have a full warp, so we can safely shuffle.
# get the warp-local states from neighbouring threads.
s1 = shfl_sync(FULL_MASK, state, i1) * UInt32(741103597)
s2 = shfl_sync(FULL_MASK, state, i2) * UInt32(1597334677)
s3 = shfl_sync(FULL_MASK, state, i3) * UInt32(747796405)
else
# can't shuffle, so fall back to fetching block-global state.
# this results is less entropy, as it reuses data from the first warp.
s1 = rng.state[i1] * UInt32(741103597)
s2 = rng.state[i2] * UInt32(1597334677)
s3 = rng.state[i3] * UInt32(747796405)
end
state ⊻ s1 ⊻ s2 ⊻ s3
Compared to Philox2x32, the amended SharedTausworthe generator uses half the amount of shared memory (128 bytes) but twice the number of registers (32). Generation speed is about the same.
I’m not an expert on RNGs, so I can’t say that there aren’t potential flaws with this approach.
I thought I’d bring it to the attention of this community because of the desirability of having an on-the-fly, reduced state device RNG that can seemingly pass the BigCrush test.