Making RNG's fast

viraltux · July 22, 2021, 6:04pm

I am playing around with RNG’s, specifically this one:

function squares(ctr::UInt64, key::UInt64)::UInt32

    y = x = ctr * key; z = y + key
    x = x*x + y; x = (x>>32) | (x<<32)       # round 1
    x = x*x + z; x = (x>>32) | (x<<32)       # round 2
    x = x*x + y; x = (x>>32) | (x<<32)       # round 3

    return (x*x + z) >> 32                   # round 4

end

I know about julialang generic tips for performance but I wonder if there are specific tips for RNG’s typical operations.

I imagine there must be some because I did a quick benchmark comparing squares with Julia rand and, despite rand being more complex (Mersenne Twister I believe), it is still considerably faster than squares. See below.

using BenchmarkTools

key = 0xc8e4fd154ce32f6d
ctr = UInt64.(1:10^6);

@benchmark squares.(ctr, key)

BechmarkTools.Trial: 2069 samples with 1 evaluations.
 Range (min … max):  2.139 ms …   4.657 ms  ┊ GC (min … max): 0.00% … 10.45%
 Time  (median):     2.327 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.398 ms ± 259.592 μs  ┊ GC (mean ± σ):  1.76% ±  4.84%

  █  ▂   ▁▁                                                    
  █▄▄█▆██████▇▆▆▅▅▄▄▃▃▃▃▃▃▃▃▃▃▃▄▃▃▃▃▃▃▃▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂ ▃
  2.14 ms         Histogram: frequency by time        3.33 ms <

 Memory estimate: 3.81 MiB, allocs estimate: 4.
      


@benchmark rand(10^6)

BechmarkTools.Trial: 2766 samples with 1 evaluations.
 Range (min … max):  1.212 ms …   4.856 ms  ┊ GC (min … max): 0.00% …  0.00%
 Time  (median):     1.305 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.791 ms ± 869.906 μs  ┊ GC (mean ± σ):  7.98% ± 13.41%

  █▇▆▄▃▂▁▁▁               ▂▃▃▁          ▁▂▂▁       ▁▂▂▂▁      ▁
  ██████████▇▇▃▃▄▃▃▃▃▃▃▁▃▃█████▆▆▆▆▅▃▄▁▁█████▇▅▆▄▆▄█████▆█▇▆▅ █
  1.21 ms      Histogram: log(frequency) by time       470 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

baggepinnen · July 22, 2021, 7:49pm

The built in rng might be SIMD optimized for generating multiple random numbers at once, while your implementation might not result in machine code using SIMD instructions.

See also https://github.com/JuliaSIMD/VectorizedRNG.jl
for even faster rngs.

To see if your algorithm results in SIMD instructions, check out @code_llvm, or maybe try to mark it @inline which might enable SIMD when it’s broadcasted.

Elrod · July 22, 2021, 8:39pm

The simple trick to making them SIMD is to place each state in their own SIMD vector lane.
Here, everything is Int64, so a system with AVX2 could update and sample from 4 states in the same number of operations as they do 1. AVX512 would let them sample 8.

Also, note here that squares.(ctr, key) can probably SIMD very well, but it also doesn’t have the dependency chain of subsequent draws depending on the state of the former, so it doesn’t seem like a real RNG in that application.

viraltux · July 22, 2021, 9:26pm

That’s how the author of the RNG is using it, however I went for the smallest possible example to focus on speed.

viraltux · July 22, 2021, 9:28pm

Thank you @baggepinnen! I’ll check these options and see how much faster I can make it.

Topic		Replies	Views
Possible inefficiency in randn! Performance benchmark	5	854	April 4, 2019
Infrastructure for SIMD random number generator General Usage	1	554	November 13, 2023
Fast random number generator for RGBs Performance	11	1856	October 9, 2021
Rand() vs rand(rng) Performance	4	483	September 12, 2020
Effective simulation of putting n-1 balls in n boxes uniformly at random Performance	12	1168	May 4, 2019

Making RNG's fast

Related topics