Output distribution of rand(Float32) and rand(Float64), thread 2

Below I offer a corrected version of my previous SIMD-able full-precision implementation.

Here is scalar pseudocode

scale = 2^-32
while scale > 0
	u = rand(UInt32)
	f = float(u, RoundDown) * scale
	u >= 2^24 && return f
	scale *= 2^-8
end
f = float(u >>> dropbits) * eps(0f0)
return f

and the actual code

using Random

function trunctofloat(u::U) where U<:Base.BitUnsigned
	F = Base.floattype(U)
	lomask = typemax(U) >>> (8sizeof(F) - Base.significand_bits(F) - 1) # masking with this number ensures exact representability
	hi = F(u & ~lomask)
	lo = F(u &  lomask)
	f0 = hi + lo # combine the parts
	roundup = f0 - hi > lo # was the roundoff error of f0 upward?
	f = reinterpret(F, reinterpret(U, f0) - roundup) # decrement if rounding was upward
	return f
end

block_randf(t::Type,n::Val) = block_randf(Random.default_rng(),t,n)

function block_randf(rng::Random.AbstractRNG, ::Type{T}, N::Val) where {T<:Base.IEEEFloat}
	U = Base.uinttype(T) # same-width Unsigned
	mineps = eps(zero(T)) # minimum spacing between values that we want to achieve

	umax = U(trunctofloat(typemax(U)))
	sparebits = trailing_zeros(umax) # how many "extra" bits we have per draw
	usebits = 8sizeof(U) - sparebits # how many bits we can possibly consume per draw
	redrawbelow = U(exp2(usebits)) # values smaller than this lose precision and must be redrawn
	redrawscale = T(exp2(-sparebits)) # if a value fails, scale by this and redraw
	scale = T(inv(redrawbelow))

	todrop = Int(log2(scale) - log2(mineps)) # how many octaves we need to drop via `redrawscale`
	numscales,leftover_bottom = fldmod(todrop, sparebits) # how many times we must apply `redrawscale` to get within reach of `mineps`
	remainingbits = leftover_bottom + usebits # bits remaining beyond the lowest usable scale

	f = ntuple(Returns(zero(T)), N) # the values we plan to return
	good = ntuple(Returns(false), N) # track which lanes are valid
	for _ in 1:numscales
		scale *= redrawscale # prepare to evaluate this scale
		u = ntuple(_->rand(rng, U), N) # TODO: need a SIMD version of this
		willbegood = u .>= redrawbelow # values that will be accepted at this scale
		fnew = trunctofloat.(u) .* scale # compute the values
		f = ifelse.(good, f, fnew) # replace bad values from previous iterations
		good = good .| willbegood # mark values to keep
		all(good) && return f # all values are valid
	end
	# if we made `scale` smaller it would underflow to 0
	u = ntuple(_->rand(rng, U), N) # TODO: need a SIMD version of this
	u = u .>>> (8sizeof(U) - remainingbits) # we can't use all the bits at this level
	fnew = trunctofloat.(u) .* mineps # compute the values
	f = ifelse.(good, f, fnew) # replace bad values from previous iterations
	return f
end

I verified this on Float16 (that it produces all values and with appropriate frequencies), although Float16 should maybe be done slightly differently due to the small amount of entropy and high redraw probability of 2^{-5} (which is exacerbated by blockwise computations). Hopefully I have everything set to generalize to longer floats correctly, although they require too many samples to verify emprically.

That said, this algorithm is somewhat poor in its worst-case use of entropy. It discards values with probabilities 2^{-8} for Float32, but when it does it utilizes only 8 bits of entropy from that draw (2^{-11} and 11 bits for Float64). This is probably a tradeoff, in that it streamlines the logic at the final level (which any level is likely to be) while having a larger-than-necessary worst-case number of iterations. I tried a “sig-exp”-style algorithm but it was slower and the frequencies of the very least-likely values were incorrect.

There’s likely still room for appreciable optimization, but here are some very crude benchmarks. Note that this is still missing SIMD rand(UIntXX) calls so I also compare it against our current scalar rand(FloatXX). But how this performs with actual SIMD RNG is the actual question.

julia> using BenchmarkTools

julia> @btime [block_randf(Float32,Val(16)) for _ in 1:2^14];
  440.900 μs (2 allocations: 1.00 MiB)

julia> @btime [rand(Float32) for _ in 1:16*2^14];
  294.200 μs (2 allocations: 1.00 MiB)

julia> @btime rand(Float32, 16*2^14);
  61.900 μs (2 allocations: 1.00 MiB)
1 Like