Below I offer a corrected version of my previous SIMD-able full-precision implementation.
Here is scalar pseudocode
scale = 2^-32
while scale > 0
u = rand(UInt32)
f = float(u, RoundDown) * scale
u >= 2^24 && return f
scale *= 2^-8
end
f = float(u >>> dropbits) * eps(0f0)
return f
and the actual code
using Random
function trunctofloat(u::U) where U<:Base.BitUnsigned
F = Base.floattype(U)
lomask = typemax(U) >>> (8sizeof(F) - Base.significand_bits(F) - 1) # masking with this number ensures exact representability
hi = F(u & ~lomask)
lo = F(u & lomask)
f0 = hi + lo # combine the parts
roundup = f0 - hi > lo # was the roundoff error of f0 upward?
f = reinterpret(F, reinterpret(U, f0) - roundup) # decrement if rounding was upward
return f
end
block_randf(t::Type,n::Val) = block_randf(Random.default_rng(),t,n)
function block_randf(rng::Random.AbstractRNG, ::Type{T}, N::Val) where {T<:Base.IEEEFloat}
U = Base.uinttype(T) # same-width Unsigned
mineps = eps(zero(T)) # minimum spacing between values that we want to achieve
umax = U(trunctofloat(typemax(U)))
sparebits = trailing_zeros(umax) # how many "extra" bits we have per draw
usebits = 8sizeof(U) - sparebits # how many bits we can possibly consume per draw
redrawbelow = U(exp2(usebits)) # values smaller than this lose precision and must be redrawn
redrawscale = T(exp2(-sparebits)) # if a value fails, scale by this and redraw
scale = T(inv(redrawbelow))
todrop = Int(log2(scale) - log2(mineps)) # how many octaves we need to drop via `redrawscale`
numscales,leftover_bottom = fldmod(todrop, sparebits) # how many times we must apply `redrawscale` to get within reach of `mineps`
remainingbits = leftover_bottom + usebits # bits remaining beyond the lowest usable scale
f = ntuple(Returns(zero(T)), N) # the values we plan to return
good = ntuple(Returns(false), N) # track which lanes are valid
for _ in 1:numscales
scale *= redrawscale # prepare to evaluate this scale
u = ntuple(_->rand(rng, U), N) # TODO: need a SIMD version of this
willbegood = u .>= redrawbelow # values that will be accepted at this scale
fnew = trunctofloat.(u) .* scale # compute the values
f = ifelse.(good, f, fnew) # replace bad values from previous iterations
good = good .| willbegood # mark values to keep
all(good) && return f # all values are valid
end
# if we made `scale` smaller it would underflow to 0
u = ntuple(_->rand(rng, U), N) # TODO: need a SIMD version of this
u = u .>>> (8sizeof(U) - remainingbits) # we can't use all the bits at this level
fnew = trunctofloat.(u) .* mineps # compute the values
f = ifelse.(good, f, fnew) # replace bad values from previous iterations
return f
end
I verified this on Float16
(that it produces all values and with appropriate frequencies), although Float16
should maybe be done slightly differently due to the small amount of entropy and high redraw probability of 2^{-5} (which is exacerbated by blockwise computations). Hopefully I have everything set to generalize to longer floats correctly, although they require too many samples to verify emprically.
That said, this algorithm is somewhat poor in its worst-case use of entropy. It discards values with probabilities 2^{-8} for Float32
, but when it does it utilizes only 8 bits of entropy from that draw (2^{-11} and 11 bits for Float64
). This is probably a tradeoff, in that it streamlines the logic at the final level (which any level is likely to be) while having a larger-than-necessary worst-case number of iterations. I tried a “sig-exp”-style algorithm but it was slower and the frequencies of the very least-likely values were incorrect.
There’s likely still room for appreciable optimization, but here are some very crude benchmarks. Note that this is still missing SIMD rand(UIntXX)
calls so I also compare it against our current scalar rand(FloatXX)
. But how this performs with actual SIMD RNG is the actual question.
julia> using BenchmarkTools
julia> @btime [block_randf(Float32,Val(16)) for _ in 1:2^14];
440.900 μs (2 allocations: 1.00 MiB)
julia> @btime [rand(Float32) for _ in 1:16*2^14];
294.200 μs (2 allocations: 1.00 MiB)
julia> @btime rand(Float32, 16*2^14);
61.900 μs (2 allocations: 1.00 MiB)