[ANN] PhiloxRNG.jl: Generate random numbers on CPU and GPU using the Philox4x32 counter-based RNG

PhiloxRNG.jl is a package for generating random numbers on both the CPU and GPU. It will generate similar numbers on all devices (results are not exactly the same when sampling floating point distributions due to fast math differences).

The underlying algorithm is a 10 round Philox4x32 combined with a fast boxmuller transformation for sampling the normal distribution.

Performance is currently significantly better than randn! Float32, but slower in other cases.

Benchmarks

Julia 1.12.5, CUDA 5.11.0, AMD Ryzen 7 9800X3D, NVIDIA GeForce RTX 3080.

CPU (ns/value, N = 100,000,000)

Function PhiloxRNG.jl Random.jl
rand F32 0.791 0.522
rand F64 1.997 1.052
randn F32 1.009 2.114
randn F64 3.098 1.795

GPU (ns/value, N = 100,000,000)

Function PhiloxRNG.jl CUDA.jl
rand F32 0.006 0.006
randn F32 0.007 0.032

Random123.jl also implements the philox family of RNGs. The main difference is the API. Random123.jl uses a mutable struct AbstractRNG interface, while PhiloxRNG.jl uses pure functions. This can make PhiloxRNG.jl easier to use in some situations.

6 Likes

By replacing a sqrt_fast with a sqrt_llvm and manually unrolling a loop, the rng functions now have all clean effects! Improve effects by nhz2 · Pull Request #1 · medyan-dev/PhiloxRNG.jl · GitHub

Performance is now reasonably close to Random.jl on the CPU when generating large batches.

CPU (ns/value, N = 100,000,000)

Function PhiloxRNG.jl Random.jl
rand F32 0.679 0.528
rand F64 1.371 1.074
randn F32 0.898 2.103
randn F64 2.009 1.801

Very odd that Random.jl stdlib is slower for randn(::Float32) than randn(::Float64). Seems worth investigating.

From what I can tell Random.jl uses the Float64 method, and converts to Float32 at the end. julia/stdlib/Random/src/normal.jl at 942c262fa4002fad29e36e19ae1e052401680f2f · JuliaLang/julia · GitHub

This is a more correct way of doing things and probably has more accurate tails. PhiloxRNG.jl uses exactly 32 bits of rng output per normal Float32, which means the absolute value of the distribution gets truncated to between 1E-14 and 7.

1 Like

For the vectorized usecase, it does look like we could gain some perf by using box-muller like philoxrng.jl. The lack of branch divergence is quite nice.

Something else to try is using box-muller with UInt64 rng inputs and Float32 math to increase the maximum output to 9.5, this might be overkill for Float32.