Making RNG's fast

I am playing around with RNG’s, specifically this one:

function squares(ctr::UInt64, key::UInt64)::UInt32

    y = x = ctr * key; z = y + key
    x = x*x + y; x = (x>>32) | (x<<32)       # round 1
    x = x*x + z; x = (x>>32) | (x<<32)       # round 2
    x = x*x + y; x = (x>>32) | (x<<32)       # round 3

    return (x*x + z) >> 32                   # round 4

end

I know about julialang generic tips for performance but I wonder if there are specific tips for RNG’s typical operations.

I imagine there must be some because I did a quick benchmark comparing squares with Julia rand and, despite rand being more complex (Mersenne Twister I believe), it is still considerably faster than squares. See below.

using BenchmarkTools

key = 0xc8e4fd154ce32f6d
ctr = UInt64.(1:10^6);

@benchmark squares.(ctr, key)

BechmarkTools.Trial: 2069 samples with 1 evaluations.
 Range (min … max):  2.139 ms …   4.657 ms  ┊ GC (min … max): 0.00% … 10.45%
 Time  (median):     2.327 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.398 ms ± 259.592 μs  ┊ GC (mean ± σ):  1.76% ±  4.84%

  █  ▂   ▁▁                                                    
  █▄▄█▆██████▇▆▆▅▅▄▄▃▃▃▃▃▃▃▃▃▃▃▄▃▃▃▃▃▃▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂ ▃
  2.14 ms         Histogram: frequency by time        3.33 ms <

 Memory estimate: 3.81 MiB, allocs estimate: 4.
      


@benchmark rand(10^6)

BechmarkTools.Trial: 2766 samples with 1 evaluations.
 Range (min … max):  1.212 ms …   4.856 ms  ┊ GC (min … max): 0.00% …  0.00%
 Time  (median):     1.305 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.791 ms ± 869.906 μs  ┊ GC (mean ± σ):  7.98% ± 13.41%

  █▇▆▄▃▂▁▁▁               ▂▃▃▁          ▁▂▂▁       ▁▂▂▂▁      ▁
  ██████████▇▇▃▃▄▃▃▃▃▃▃▁▃▃█████▆▆▆▆▅▃▄▁▁█████▇▅▆▄▆▄█████▆█▇▆▅ █
  1.21 ms      Histogram: log(frequency) by time       470 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

The built in rng might be SIMD optimized for generating multiple random numbers at once, while your implementation might not result in machine code using SIMD instructions.

See also https://github.com/JuliaSIMD/VectorizedRNG.jl
for even faster rngs.

To see if your algorithm results in SIMD instructions, check out @code_llvm, or maybe try to mark it @inline which might enable SIMD when it’s broadcasted.

4 Likes

The simple trick to making them SIMD is to place each state in their own SIMD vector lane.
Here, everything is Int64, so a system with AVX2 could update and sample from 4 states in the same number of operations as they do 1. AVX512 would let them sample 8.

Also, note here that squares.(ctr, key) can probably SIMD very well, but it also doesn’t have the dependency chain of subsequent draws depending on the state of the former, so it doesn’t seem like a real RNG in that application.

3 Likes

That’s how the author of the RNG is using it, however I went for the smallest possible example to focus on speed.

Thank you @baggepinnen! I’ll check these options and see how much faster I can make it.