So, just to give some inspiration.
The AMD library for GPU random uses this.
It generates floats excluding 0 and including 1.
The nvidia curand library documents
The curandGenerateUniform() function is used to generate uniformly distributed floating point values between 0.0 and 1.0, where 0.0 is excluded and 1.0 is included.
As far as I can tell, Cuda.jl
uses curand if available. I guess they use the same algorithm as rocRand.
As such, I don’t think we should sweat too much about GPU – it already produces a quite different distribution.
========================
In some sense this is an API problem: Generating random floats including 0 and excluding 1 is bloody stupid, considering how floating point numbers work.
========================
I guess I can comment on that, given that I wrote the original implementation (under the chethega account on github; Jeff had to clean up my PR because I dropped out for personal reasons; apologies for that).
Xoshiro is of block design, not counter design. As such it is fundamentally impossible to SIMD via auto-vectorizer (there’s a reason that counter-based designs are becoming popular!).
So we don’t need to worry about SIMD for rand
. Not possible anyways.
We do need to worry about SIMD for rand!
on large arrays. The way we currently do this is to take our single RNG and use it to seed a vector of 8 independent Xoshiro RNGs. And now we go cranking out numbers into our large array! Once the rand!
request is served, we discard out vector of xoshiro RNGs. (I am quite proud of that design for serving bulk requests of random numbers!)
So we can use entirely different strategies for the SIMD and non-SIMD cases. But it would be very desirable if both produced the exact same distribution.
========================
If we want to use 64 bits, this is 100% free in the sequential (rand
) case. For the bulk (simd / rand!
) case it is not free, but probably affordable.
Temporarily changing the rounding mode to round-to-zero is probably OK for the SIMD case.
Touching the rounding is absolutely not OK for the sequential case.
I really don’t know how to switch rounding modes across different CPU architectures (various arm? risc-V? powerPC? Is there mips support? I could drop some inline assembly for x86 and maybe arm, but I really don’t know power or risc-V arch, and I have no idea how to do this portably).
========================
PS, regarding branching. We don’t care about rare branches in the sequential case.
If the branch is rare enough, we can maybe do that in the SIMD case – we’d need to stop all 8 lanes at the same time, do our rare fix-up, and then continue. We would need to look very carefully at the generated code, to make sure that the branch induces no spilling to the stack on machines with 256 bit vectors and arm 128 bit vectors.