[ANN] PhiloxRNG.jl: Generate random numbers on CPU and GPU using the Philox4x32 counter-based RNG

nhz2 · March 16, 2026, 4:26pm

PhiloxRNG.jl is a package for generating random numbers on both the CPU and GPU. It will generate similar numbers on all devices (results are not exactly the same when sampling floating point distributions due to fast math differences).

The underlying algorithm is a 10 round Philox4x32 combined with a fast boxmuller transformation for sampling the normal distribution.

Performance is currently significantly better than randn! Float32, but slower in other cases.

Benchmarks

Julia 1.12.5, CUDA 5.11.0, AMD Ryzen 7 9800X3D, NVIDIA GeForce RTX 3080.

CPU (ns/value, N = 100,000,000)

Function	PhiloxRNG.jl	Random.jl
`rand` F32	0.791	0.522
`rand` F64	1.997	1.052
`randn` F32	1.009	2.114
`randn` F64	3.098	1.795

GPU (ns/value, N = 100,000,000)

Function	PhiloxRNG.jl	CUDA.jl
`rand` F32	0.006	0.006
`randn` F32	0.007	0.032

Random123.jl also implements the philox family of RNGs. The main difference is the API. Random123.jl uses a mutable struct AbstractRNG interface, while PhiloxRNG.jl uses pure functions. This can make PhiloxRNG.jl easier to use in some situations.

nhz2 · March 17, 2026, 3:20am

By replacing a sqrt_fast with a sqrt_llvm and manually unrolling a loop, the rng functions now have all clean effects! Improve effects by nhz2 · Pull Request #1 · medyan-dev/PhiloxRNG.jl · GitHub

Performance is now reasonably close to Random.jl on the CPU when generating large batches.

CPU (ns/value, N = 100,000,000)

Function	PhiloxRNG.jl	Random.jl
`rand` F32	0.679	0.528
`rand` F64	1.371	1.074
`randn` F32	0.898	2.103
`randn` F64	2.009	1.801

Oscar_Smith · March 17, 2026, 3:26am

Very odd that Random.jl stdlib is slower for randn(::Float32) than randn(::Float64). Seems worth investigating.

nhz2 · March 18, 2026, 12:59am

From what I can tell Random.jl uses the Float64 method, and converts to Float32 at the end. julia/stdlib/Random/src/normal.jl at 942c262fa4002fad29e36e19ae1e052401680f2f · JuliaLang/julia · GitHub

This is a more correct way of doing things and probably has more accurate tails. PhiloxRNG.jl uses exactly 32 bits of rng output per normal Float32, which means the absolute value of the distribution gets truncated to between 1E-14 and 7.

Oscar_Smith · March 18, 2026, 1:23am

For the vectorized usecase, it does look like we could gain some perf by using box-muller like philoxrng.jl. The lack of branch divergence is quite nice.

nhz2 · March 18, 2026, 6:01pm

Something else to try is using box-muller with UInt64 rng inputs and Float32 math to increase the maximum output to 9.5, this might be overkill for Float32.

nhz2 · April 21, 2026, 5:53am

Version 1.1.1 greatly improves Float64 performance on GPUs.

It turns out CUDA.jl’s compiler has problems constant propagating ::Float64 ^ ::Int64 and is happier with ::Float64 ^ ::Int32.

GPU — NVIDIA GeForce RTX 3080 (ns/value, N = 100,000,000)

Function	PhiloxRNG	CUDA.jl
`rand` F32	0.0056	0.0056
`randn` F32	0.0058	0.0318
`rand` F64	0.0111	0.0195
`randn` F64	0.1332	0.2743

GPU — NVIDIA A100-SXM4-40GB (ns/value, N = 100,000,000)

Function	PhiloxRNG	CUDA.jl
`rand` F32	0.0028	0.0064
`randn` F32	0.0049	0.0359
`rand` F64	0.0053	0.0163
`randn` F64	0.0096	0.0762

Topic		Replies	Views
Device RNG - Passes BigCrush GPU	21	932	October 18, 2024
CUDA.jl v3.0 Package Announcements	10	1497	April 16, 2021
CUDA.jl random number generation GPU	2	120	May 20, 2026
Why is GPU kernel rand() not as "random" as CPU rand()? GPU question , cuda , kernel	10	696	May 17, 2023
Fortuna random number generator Numerics	2	939	January 6, 2020

[ANN] PhiloxRNG.jl: Generate random numbers on CPU and GPU using the Philox4x32 counter-based RNG

Benchmarks

CPU (ns/value, N = 100,000,000)

GPU (ns/value, N = 100,000,000)

CPU (ns/value, N = 100,000,000)

GPU — NVIDIA GeForce RTX 3080 (ns/value, N = 100,000,000)

GPU — NVIDIA A100-SXM4-40GB (ns/value, N = 100,000,000)

Related topics