I think I’m almost there.
Here is some SIMDable code for generating infinite-precision round-down floats. It uses a “significand-exponent” algorithm like I described far above, generating the significand and exponent separately (but with appropriate distributions) and then combining them to produce the result. By using batched rejection sampling, it manages to SIMD. I have verified that it produces the correct distribution on Float16
, so if there are mistakes at other sizes they’re fixable.
float sampling
@inline randtup(rng::AbstractRNG, ::Type{T}, N::Val) where T = ntuple(_->rand(rng, T), N)
function block_randf_sigexp(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
N = Val(32 ÷ sizeof(T)) # number of values to generate
U = Base.uinttype(T) # same-width Unsigned
sigmask = Base.significand_mask(T)
estep = reinterpret(U, floatmin(T)) # least significant exponent bit
eincr1 = U(leading_zeros(sigmask) * estep) # first time we fail to produce a valid exponent, decrement by this
eincr = U(trailing_ones(sigmask) * estep) # additional times we fail to produce a valid exponent, decrement by this
ebase = reinterpret(U, T(1//2)) - eincr1 # exponent base that we start with - we'll offset upward from here
numscales,remscale = fldmod(ebase, eincr) # number of eincr steps to reach the bottom
rembits = fld(remscale, estep) # how many leftover bits to reach floatmin
remmask = U(2^rembits - 1) # mask including bottom rembits
# expoffset(u::U) where U = U(8sizeof(U) - leading_zeros(u)) * estep # valid for any input, but SIMDs poorly
expoffset(u::U) where U = reinterpret(U, convert(Base.floattype(U), u)) & ~sigmask - reinterpret(U, convert(Base.floattype(U), 1//2)) # good SIMD but only correct for 1 <= u <= maxintfloat
u = randtup(rng, U, N)
sig = u .& sigmask # we'll use this for the significand
uhi = u .>>> trailing_ones(sigmask) # get our remaining bits
ebits = ebase .+ expoffset.(uhi)
good = .!iszero.(uhi)
all(good) && return reinterpret.(T, ebits .| sig)
for _ in 1:numscales
ebase -= eincr
u = randtup(rng, U, N) .& sigmask # use to generate exponent
ebits = ifelse.(good, ebits, ebase .+ expoffset.(u))
good = good .| .!iszero.(u)
all(good) && return reinterpret.(T, ebits .| sig)
end
if iszero(rembits)
# move directly to subnormals
ebits = ifelse.(good, ebits, zero(U))
else
# still have space to fill down to subnormals
u = randtup(rng, U, N) .& remmask # use to generate exponent
ebits = ifelse.(good, ebits, ifelse.(iszero.(u), zero(U), expoffset.(u))) # finally need to handle iszero(u) correctly
end
return reinterpret.(T, ebits .| sig)
end
Upthread, we found that leading_zeros
did not SIMD well on all but the most recent x86’s, so this implementation doesn’t use it. It’s possible that we could do better still on machines with native SIMD leading_zeros
. In any case, this one uses int->float conversions to do that part at the cost of restricting the input range (such that we can only use significand_bits
bits per conversion instead of the full width). But the chances of redraws even at this rate are so vanishing that the waste is negligible.
I also took a crack at a SIMDing random number generator to feed this.
block RNG
using Random
using Random.XoshiroSimd: Xoshiro, forkRand, _plus, _rotl23, _shl17, _xor, _rotl45
mutable struct XoshiroBlock{N} <: Random.AbstractRNG
s0::NTuple{N,Core.VecElement{UInt64}}
s1::NTuple{N,Core.VecElement{UInt64}}
s2::NTuple{N,Core.VecElement{UInt64}}
s3::NTuple{N,Core.VecElement{UInt64}}
end
XoshiroBlock{N}(rng::Union{TaskLocalRNG, Xoshiro}) where N = XoshiroBlock{N}(forkRand(rng, Val(N))...)
@inline function Base.rand(rng::XoshiroBlock{N}) where {N}
# based on Random.XoshiroSimd.xoshiro_bulk_simd
s0 = rng.s0 # read
s1 = rng.s1 # read
s2 = rng.s2 # read
s3 = rng.s3 # read
res = _plus(_rotl23(_plus(s0,s3)),s0)
t = _shl17(s1)
s2 = _xor(s2, s0)
s3 = _xor(s3, s1)
s1 = _xor(s1, s2)
s0 = _xor(s0, s3)
s2 = _xor(s2, t)
s3 = _rotl45(s3)
rng.s0 = s0 # write
rng.s1 = s1 # write
rng.s2 = s2 # write
rng.s3 = s3 # write
return map(x->x.value, res) # result as a NTuple{N,UInt64}
end
@inline Base.rand(rng::XoshiroBlock, ::Type{T}) where T = @inline tuplecast(T, rand(rng))
@inline function randtup(rng::XoshiroBlock{N}, ::Type{T}, ::Val{M}) where {M,N,T}
M*sizeof(T) == N*sizeof(UInt64) || error("XoshiroBlock{$N} will only generate NTuple{$(N*sizeof(UInt64)÷sizeof(T))}")
return rand(rng,T)
end
for m in 0:4
u = 8*sizeof(UInt64)
M = 2^m
MVU = NTuple{M,Core.VecElement{UInt64}}
for T in Base.uniontypes(Base.BitUnsigned)
t = 8*sizeof(T)
N = fld(M*u, t)
if N > 0
NVT = NTuple{N,Core.VecElement{T}}
code = """
%res = bitcast <$M x i$u> %0 to <$N x i$t>
ret <$N x i$t> %res
"""
@eval @inline _tuplecast(::Type{$T}, x::$MVU) = Core.Intrinsics.llvmcall($code, $NVT, Tuple{$MVU}, x)
end
end
end
"""
tuplecast(T::Type, x::NTuple{N,U})
Reinterpret the bits of `x` as a NTuple{M, T} of matching size.
Currently only supports `U == UInt64` and `T isa $(Base.BitUnsigned)`.
"""
tuplecast(::Type{T}, x::NTuple) where {T} = map(x->x.value, _tuplecast(T, map(Core.VecElement, x)))
For UIntXX
, this block RNG is considerably slower than the builtin implementation and I haven’t fully investigated why. My guess is that the difference is that it’s constantly updating the mutable struct that holds the RNG. The builtin keeps the state in local immutables, which probably avoids memory access.
Here’s a performance comparison between these and the current RNG:
julia> using Random, BenchmarkTools
julia> r = Random.default_rng();
julia> r4 = XoshiroBlock{4}(r); # generate 4*64 bits at a time
julia> @btime rand($r, UInt64, 2^17); # raw RNG performance
42.500 μs (2 allocations: 1.00 MiB)
julia> @btime [randtup($r4, UInt64) for _ in 1:2^15]; # raw RNG performance
120.400 μs (2 allocations: 1.00 MiB)
julia> @btime rand($r, Float32, 2^18);
49.500 μs (2 allocations: 1.00 MiB)
julia> @btime [block_randf_sigexp($r4, Float32) for _ in 1:2^15];
130.000 μs (2 allocations: 1.00 MiB)
julia> @btime rand($r, Float64, 2^17);
94.000 μs (2 allocations: 1.00 MiB)
julia> @btime [block_randf_sigexp($r4, Float64) for _ in 1:2^15];
135.700 μs (2 allocations: 1.00 MiB)
One thing that stands out to me is that the current rand
is 1.8x slower for Float64
than Float32
for the same number of bytes. With proper SIMD, one would expect them to be comparable (which we see in my block_randf_sigexp
). I don’t know why this is, but it seems that Float64
could be improved.
In any case, this in-hand implementation is 2.5x slower for Float32
and 1.5x slower for Float64
. I’ll point out that time to produce floats is only slightly longer than the cost of UInt64
generation for all schemes (except the aforementioned rand(Float64,N)
slowness), so a better block RNG than the one I provided might shrink this gap considerably. A very optimistic compensation for the 10us extra to produce 1MiB Float32
or 15us extra to produce Float64
rather than UInt64
with my clumsy block generator suggests that we might produce infinite-precision Float32
in 1.1x the current time (55 vs 50us) and Float64
in 0.6x the current time (60 vs 95 us), although I suspect the actual numbers will be somewhat worse.