Infrastructure for SIMD random number generator

It’s worth looking at how Random generates values using SIMD. For the default RNG, at least, this is done by forking the RNG into (currently) 8 parallel streams that can be run via SIMD parallelization.

The challenge isn’t to do this without allocation. Just generate values as NTuples and off you go. There’s overhead in forking the RNGs, although that is amortized in the above by only using this for sufficiently large requests. For more interactive use, one could retain this forked state.

If I recall correctly (which is unlikely), back when the default was a Mersenne Twister, Julia carried a separate scalar and vector RNG stream per task with all the state initialized. However, with the swap to Xoshiro (which is much smaller and seems to be super nice) maintaining the vector stream was a lot of overhead for small tasks and was dropped for this reason. Thus, in the current implementation it is re-forked at every use.

Like you, I’ve come across an instance where I’ve wanted finer access to batch generation (as part of a discussion on rand for floats). The code I linked is only written to fill memory with the values, but I would like them returned as NTuples so I can sample them incrementally and do further processing without going through RAM.

An adaptation I was experimenting with

Since we’re talking about it, here’s something I was toying with by lightly modifying code from XoshiroSimd.jl to generates NTuples of UIntXX values using SIMD. It appears to be slower than the builtin for actual repeated generation, however, despite using the same code. I suspect this is because it’s writing-back the mutable RNG state to RAM every call, but I haven’t dug into it properly. For comparison, in XoshiroSimd.jl the state is only kept as immutables which might help.

using Random
using Random.XoshiroSimd: Xoshiro, forkRand, _plus, _rotl23, _shl17, _xor, _rotl45

mutable struct XoshiroBlock{N}
    s0::NTuple{N,Core.VecElement{UInt64}}
    s1::NTuple{N,Core.VecElement{UInt64}}
    s2::NTuple{N,Core.VecElement{UInt64}}
    s3::NTuple{N,Core.VecElement{UInt64}}
end

XoshiroBlock{N}(rng::Union{TaskLocalRNG, Xoshiro}) where N = XoshiroBlock{N}(forkRand(rng, Val(N))...)

@inline function Base.rand(rng::XoshiroBlock{N}, ::Type{T}) where {N,T}
    # based on Random.XoshiroSimd.xoshiro_bulk_simd

    s0 = rng.s0 # read
    s1 = rng.s1 # read
    s2 = rng.s2 # read
    s3 = rng.s3 # read
    res = _plus(_rotl23(_plus(s0,s3)),s0)
    t = _shl17(s1)
    s2 = _xor(s2, s0)
    s3 = _xor(s3, s1)
    s1 = _xor(s1, s2)
    s0 = _xor(s0, s3)
    s2 = _xor(s2, t)
    s3 = _rotl45(s3)
    rng.s0 = s0 # write
    rng.s1 = s1 # write
    rng.s2 = s2 # write
    rng.s3 = s3 # write
    
    restup = map(x->x.value, res) # result as a NTuple{N,UInt64}
    return @inline tuplecast(T, restup)
end

for m in 0:4
    u = 8*sizeof(UInt64)
    M = 2^m
    MVU = NTuple{M,Core.VecElement{UInt64}}
    for T in Base.uniontypes(Base.BitUnsigned)
        t = 8*sizeof(T)
        N = fld(M*u, t)
        if N > 0
            NVT = NTuple{N,Core.VecElement{T}}
            code = """
                %res = bitcast <$M x i$u> %0 to <$N x i$t>
                ret <$N x i$t> %res
                """
            @eval @inline _tuplecast(::Type{$T}, x::$MVU) = Core.Intrinsics.llvmcall($code, $NVT, Tuple{$MVU}, x)
        end
    end
end

"""
    tuplecast(T::Type, x::NTuple{N,U})

Reinterpret the bits of `x` as a NTuple{M, T} of matching size.
Currently only supports `U == UInt64` and `T` a member of `$(Base.BitUnsigned)`.
"""
tuplecast(::Type{T}, x::NTuple) where {T} = map(x->x.value, _tuplecast(T, map(Core.VecElement, x)))

Some assessments:

julia> r4 = XoshiroBlock{4}(Random.default_rng());

julia> rand(r4,UInt32)
(0x22af0c6b, 0x9f95088b, 0x522c2b96, 0x0b381ff3, 0x1e9b3903, 0x390e3a44, 0x86dd33a3, 0xdcd069a1)

julia> @code_native debuginfo=:none rand(r4,UInt32)

julia> using BenchmarkTools

julia> @btime [rand($(Random.default_rng()),UInt32) for _ in 1:2^15];  # scalarized generation
  27.900 μs (2 allocations: 128.05 KiB)

julia> @btime rand($(Random.default_rng()),UInt32,2^15); # vectorized generation
  6.150 μs (2 allocations: 128.05 KiB)

julia> @btime [rand($r4,UInt32) for _ in 1:2^12]; # faster than scalar, slower than vector
  16.400 μs (2 allocations: 128.05 KiB)
4 Likes