Infrastructure for SIMD random number generator

Imagine you want to find the average distance of two points in a disk with SIMD and without allocation, how would you do this with the current API?
Let’s assume a vector length of 8 for simplicity.
First, you start generating three batches of value, r1, r2, and angle. This could be done with inverse_transform sampling. You can then use the cosine law to find the distance between r1 and r2. You then finally add these numbers to the accumulators and move to a new batch.

However, the problem arises when you need to sample the random numbers in batches of 8. Pre-generating the random numbers would require memory allocation. Maintaining an RNG that can generate random numbers in the batch of 8 would require maintaining some extra states, meaning rand(rng, Float64,8) doesn’t work either, for example, with a linear congruential generator you could more efficiently do this…
[s1, s1*r, s*r^2, s1*r^3, s1*r^4, s1*r^5, s1*r^6, s*r^7] * (r^8) mod m.

This example illustrates the limitation in the current API of the rng.
So, for the sake of this silly problem, let’s optimize the hell out of it!

It’s worth looking at how Random generates values using SIMD. For the default RNG, at least, this is done by forking the RNG into (currently) 8 parallel streams that can be run via SIMD parallelization.

The challenge isn’t to do this without allocation. Just generate values as NTuples and off you go. There’s overhead in forking the RNGs, although that is amortized in the above by only using this for sufficiently large requests. For more interactive use, one could retain this forked state.

If I recall correctly (which is unlikely), back when the default was a Mersenne Twister, Julia carried a separate scalar and vector RNG stream per task with all the state initialized. However, with the swap to Xoshiro (which is much smaller and seems to be super nice) maintaining the vector stream was a lot of overhead for small tasks and was dropped for this reason. Thus, in the current implementation it is re-forked at every use.

Like you, I’ve come across an instance where I’ve wanted finer access to batch generation (as part of a discussion on rand for floats). The code I linked is only written to fill memory with the values, but I would like them returned as NTuples so I can sample them incrementally and do further processing without going through RAM.

An adaptation I was experimenting with

Since we’re talking about it, here’s something I was toying with by lightly modifying code from XoshiroSimd.jl to generates NTuples of UIntXX values using SIMD. It appears to be slower than the builtin for actual repeated generation, however, despite using the same code. I suspect this is because it’s writing-back the mutable RNG state to RAM every call, but I haven’t dug into it properly. For comparison, in XoshiroSimd.jl the state is only kept as immutables which might help.

using Random
using Random.XoshiroSimd: Xoshiro, forkRand, _plus, _rotl23, _shl17, _xor, _rotl45

mutable struct XoshiroBlock{N}

XoshiroBlock{N}(rng::Union{TaskLocalRNG, Xoshiro}) where N = XoshiroBlock{N}(forkRand(rng, Val(N))...)

@inline function Base.rand(rng::XoshiroBlock{N}, ::Type{T}) where {N,T}
    # based on Random.XoshiroSimd.xoshiro_bulk_simd

    s0 = rng.s0 # read
    s1 = rng.s1 # read
    s2 = rng.s2 # read
    s3 = rng.s3 # read
    res = _plus(_rotl23(_plus(s0,s3)),s0)
    t = _shl17(s1)
    s2 = _xor(s2, s0)
    s3 = _xor(s3, s1)
    s1 = _xor(s1, s2)
    s0 = _xor(s0, s3)
    s2 = _xor(s2, t)
    s3 = _rotl45(s3)
    rng.s0 = s0 # write
    rng.s1 = s1 # write
    rng.s2 = s2 # write
    rng.s3 = s3 # write
    restup = map(x->x.value, res) # result as a NTuple{N,UInt64}
    return @inline tuplecast(T, restup)

for m in 0:4
    u = 8*sizeof(UInt64)
    M = 2^m
    MVU = NTuple{M,Core.VecElement{UInt64}}
    for T in Base.uniontypes(Base.BitUnsigned)
        t = 8*sizeof(T)
        N = fld(M*u, t)
        if N > 0
            NVT = NTuple{N,Core.VecElement{T}}
            code = """
                %res = bitcast <$M x i$u> %0 to <$N x i$t>
                ret <$N x i$t> %res
            @eval @inline _tuplecast(::Type{$T}, x::$MVU) = Core.Intrinsics.llvmcall($code, $NVT, Tuple{$MVU}, x)

    tuplecast(T::Type, x::NTuple{N,U})

Reinterpret the bits of `x` as a NTuple{M, T} of matching size.
Currently only supports `U == UInt64` and `T` a member of `$(Base.BitUnsigned)`.
tuplecast(::Type{T}, x::NTuple) where {T} = map(x->x.value, _tuplecast(T, map(Core.VecElement, x)))

Some assessments:

julia> r4 = XoshiroBlock{4}(Random.default_rng());

julia> rand(r4,UInt32)
(0x22af0c6b, 0x9f95088b, 0x522c2b96, 0x0b381ff3, 0x1e9b3903, 0x390e3a44, 0x86dd33a3, 0xdcd069a1)

julia> @code_native debuginfo=:none rand(r4,UInt32)

julia> using BenchmarkTools

julia> @btime [rand($(Random.default_rng()),UInt32) for _ in 1:2^15];  # scalarized generation
  27.900 μs (2 allocations: 128.05 KiB)

julia> @btime rand($(Random.default_rng()),UInt32,2^15); # vectorized generation
  6.150 μs (2 allocations: 128.05 KiB)

julia> @btime [rand($r4,UInt32) for _ in 1:2^12]; # faster than scalar, slower than vector
  16.400 μs (2 allocations: 128.05 KiB)