No, it’s hand-rolled (hence the quality issues) because we want a fully-generic fallback.
We do have a pure-Julia implementation of Philox in Random123.jl. Could we run (a modified version) of that, maybe? Then we could also offer the same RNG on GPU and CPU. It’s a counter-based RNG, so fully parallel.
The tricky part though is that not every thread can afford to have its own RNG state
But with a counter-based RNG, the state is effectively just the seed and the current counter value, so 2+4 64-bit UInts, in the case of Philox4x. And one can partition the counter space between the threads, so everything can be initialized from a one common seed. The threads then just start in different points in the vast (4x 64-bit) counter space (by block/thread/…-id, so that scheduling order won’t matter).
That’s still too much state for each thread (if you store it globally you get synchronizing memory accesses, and with block-shared memory we can’t afford to store that much). But since it’s counter based the thread ID within the block could be used to skip-ahead and only store a single thread’s worth of state, right? So then it should be possible to use this, I’ll have a look!
But since it’s counter based the thread ID within the block could be used to skip-ahead and only store a single thread’s worth of state, right?
Yes, indeed - knowing only the seed, it’d ID and the number of random numbers it has produced already (so a thread-specific counter), the state of the counter-based RNG (e.g. Philox) could be produced on-the-fly. From what I’ve seen in the commit you linked, this is similar to what you do now.
Philox&friends are well tested (nice overview in https://arxiv.org/pdf/1204.6193.pdf) and provide very high quality random number (they actually perform better than Mersenne Twister in some testing categories).