PhiloxRNG.jl is a package for generating random numbers on both the CPU and GPU. It will generate similar numbers on all devices (results are not exactly the same when sampling floating point distributions due to fast math differences).
The underlying algorithm is a 10 round Philox4x32 combined with a fast boxmuller transformation for sampling the normal distribution.
Performance is currently significantly better than randn!Float32, but slower in other cases.
Benchmarks
Julia 1.12.5, CUDA 5.11.0, AMD Ryzen 7 9800X3D, NVIDIA GeForce RTX 3080.
CPU (ns/value, N = 100,000,000)
Function
PhiloxRNG.jl
Random.jl
rand F32
0.791
0.522
rand F64
1.997
1.052
randn F32
1.009
2.114
randn F64
3.098
1.795
GPU (ns/value, N = 100,000,000)
Function
PhiloxRNG.jl
CUDA.jl
rand F32
0.006
0.006
randn F32
0.007
0.032
Random123.jl also implements the philox family of RNGs. The main difference is the API. Random123.jl uses a mutable struct AbstractRNG interface, while PhiloxRNG.jl uses pure functions. This can make PhiloxRNG.jl easier to use in some situations.
This is a more correct way of doing things and probably has more accurate tails. PhiloxRNG.jl uses exactly 32 bits of rng output per normal Float32, which means the absolute value of the distribution gets truncated to between 1E-14 and 7.
For the vectorized usecase, it does look like we could gain some perf by using box-muller like philoxrng.jl. The lack of branch divergence is quite nice.
Something else to try is using box-muller with UInt64 rng inputs and Float32 math to increase the maximum output to 9.5, this might be overkill for Float32.