It’s worth looking at how Random
generates values using SIMD. For the default RNG, at least, this is done by forking the RNG into (currently) 8 parallel streams that can be run via SIMD parallelization.
The challenge isn’t to do this without allocation. Just generate values as NTuple
s and off you go. There’s overhead in forking the RNGs, although that is amortized in the above by only using this for sufficiently large requests. For more interactive use, one could retain this forked state.
If I recall correctly (which is unlikely), back when the default was a Mersenne Twister, Julia carried a separate scalar and vector RNG stream per task with all the state initialized. However, with the swap to Xoshiro (which is much smaller and seems to be super nice) maintaining the vector stream was a lot of overhead for small tasks and was dropped for this reason. Thus, in the current implementation it is re-forked at every use.
Like you, I’ve come across an instance where I’ve wanted finer access to batch generation (as part of a discussion on rand
for floats). The code I linked is only written to fill memory with the values, but I would like them returned as NTuple
s so I can sample them incrementally and do further processing without going through RAM.
An adaptation I was experimenting with
Since we’re talking about it, here’s something I was toying with by lightly modifying code from XoshiroSimd.jl
to generates NTuple
s of UIntXX
values using SIMD. It appears to be slower than the builtin for actual repeated generation, however, despite using the same code. I suspect this is because it’s writing-back the mutable
RNG state to RAM every call, but I haven’t dug into it properly. For comparison, in XoshiroSimd.jl
the state is only kept as immutables which might help.
using Random
using Random.XoshiroSimd: Xoshiro, forkRand, _plus, _rotl23, _shl17, _xor, _rotl45
mutable struct XoshiroBlock{N}
s0::NTuple{N,Core.VecElement{UInt64}}
s1::NTuple{N,Core.VecElement{UInt64}}
s2::NTuple{N,Core.VecElement{UInt64}}
s3::NTuple{N,Core.VecElement{UInt64}}
end
XoshiroBlock{N}(rng::Union{TaskLocalRNG, Xoshiro}) where N = XoshiroBlock{N}(forkRand(rng, Val(N))...)
@inline function Base.rand(rng::XoshiroBlock{N}, ::Type{T}) where {N,T}
# based on Random.XoshiroSimd.xoshiro_bulk_simd
s0 = rng.s0 # read
s1 = rng.s1 # read
s2 = rng.s2 # read
s3 = rng.s3 # read
res = _plus(_rotl23(_plus(s0,s3)),s0)
t = _shl17(s1)
s2 = _xor(s2, s0)
s3 = _xor(s3, s1)
s1 = _xor(s1, s2)
s0 = _xor(s0, s3)
s2 = _xor(s2, t)
s3 = _rotl45(s3)
rng.s0 = s0 # write
rng.s1 = s1 # write
rng.s2 = s2 # write
rng.s3 = s3 # write
restup = map(x->x.value, res) # result as a NTuple{N,UInt64}
return @inline tuplecast(T, restup)
end
for m in 0:4
u = 8*sizeof(UInt64)
M = 2^m
MVU = NTuple{M,Core.VecElement{UInt64}}
for T in Base.uniontypes(Base.BitUnsigned)
t = 8*sizeof(T)
N = fld(M*u, t)
if N > 0
NVT = NTuple{N,Core.VecElement{T}}
code = """
%res = bitcast <$M x i$u> %0 to <$N x i$t>
ret <$N x i$t> %res
"""
@eval @inline _tuplecast(::Type{$T}, x::$MVU) = Core.Intrinsics.llvmcall($code, $NVT, Tuple{$MVU}, x)
end
end
end
"""
tuplecast(T::Type, x::NTuple{N,U})
Reinterpret the bits of `x` as a NTuple{M, T} of matching size.
Currently only supports `U == UInt64` and `T` a member of `$(Base.BitUnsigned)`.
"""
tuplecast(::Type{T}, x::NTuple) where {T} = map(x->x.value, _tuplecast(T, map(Core.VecElement, x)))
Some assessments:
julia> r4 = XoshiroBlock{4}(Random.default_rng());
julia> rand(r4,UInt32)
(0x22af0c6b, 0x9f95088b, 0x522c2b96, 0x0b381ff3, 0x1e9b3903, 0x390e3a44, 0x86dd33a3, 0xdcd069a1)
julia> @code_native debuginfo=:none rand(r4,UInt32)
julia> using BenchmarkTools
julia> @btime [rand($(Random.default_rng()),UInt32) for _ in 1:2^15]; # scalarized generation
27.900 μs (2 allocations: 128.05 KiB)
julia> @btime rand($(Random.default_rng()),UInt32,2^15); # vectorized generation
6.150 μs (2 allocations: 128.05 KiB)
julia> @btime [rand($r4,UInt32) for _ in 1:2^12]; # faster than scalar, slower than vector
16.400 μs (2 allocations: 128.05 KiB)