Output distribution of rand(Float32) and rand(Float64), thread 2

Since I keep advocating performance comparisons, here’s one. Since I already made a preliminary demonstration of SIMDable rejection sampling, I’m going to put that aside for a moment. Below is a series of scalar implementations of sampling schemes.

randf1 produces a XX-bit float using a full XX-bits of entropy. randf1_badrounding does the same but uses native round-nearest-ties-even rounding so is not correct, although it is indicative of performance if we could change the rounding mode or a wizard could give us a better implementation. randf produces a float with infinite precision and randf_badrounding does the same with the same flawed rounding as the finite precision version.

Scalar implementations
using Random

function trunctofloat(u::U) where U<:Base.BitUnsigned
	F = Base.floattype(U)
	lomask = typemax(U) >>> (8sizeof(F) - Base.significand_bits(F) - 1) # masking with this number ensures exact representability
	count_ones(lomask) == trailing_ones(lomask) >= 4sizeof(U) || error("invalid mask") # mask should be dense on the low bits and at least half the bitwidth
	hi = F(u & ~lomask)
	lo = F(u &  lomask)
	f0 = hi + lo # combine the parts
	roundup = f0 - hi > lo # was the roundoff error of f0 upward?
	f = reinterpret(F, reinterpret(U, f0) - roundup) # decrement if rounding was upward
	return f
end

randf1(t::Type) = randf1(Random.default_rng(),t)
function randf1(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
	U = Base.uinttype(T) # same-width Unsigned
	scale = T(exp2(-8sizeof(U)))
	u = rand(rng, U)
	return trunctofloat(u) * scale # compute the value
end

randf1_badrounding(t::Type) = randf1_badrounding(Random.default_rng(),t)
function randf1_badrounding(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
	U = Base.uinttype(T) # same-width Unsigned
	scale = T(exp2(-8sizeof(U)))
	u = rand(rng, U)
	return T(u) * scale # compute the value
end

randf(t::Type) = randf(Random.default_rng(),t)
function randf(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
	U = Base.uinttype(T) # same-width Unsigned
	mineps = eps(zero(T)) # minimum spacing between values that we want to achieve

	umax = U(trunctofloat(typemax(U)))
	sparebits = trailing_zeros(umax) # how many "extra" bits we have per draw
	usebits = 8sizeof(U) - sparebits # how many bits we can possibly consume per draw
	redrawbelow = U(exp2(usebits)) # values smaller than this lose precision and must be redrawn
	redrawscale = T(exp2(-sparebits)) # if a value fails, scale by this and redraw
	scale = T(inv(redrawbelow))

	todrop = Int(log2(scale) - log2(mineps)) # how many octaves we need to drop via `redrawscale`
	numscales,leftover_bottom = fldmod(todrop, sparebits) # how many times we must apply `redrawscale` to get within reach of `mineps`
	remainingbits = leftover_bottom + usebits # bits remaining beyond the lowest usable scale
	lastlevelmask = typemax(U) >>> (8sizeof(U) - remainingbits) # mask for values at lowest scale

	for _ in 1:numscales
		scale *= redrawscale # prepare to evaluate next scale
		u = rand(rng, U)
		if u >= redrawbelow
			return trunctofloat(u) * scale # compute the value
		end
	end
	# if we made `scale` smaller it would underflow to 0
	u = rand(rng, U)
	u = u & lastlevelmask # we can't use all the bits at this level
	return trunctofloat(u) * mineps # compute the value
end

randf_badrounding(t::Type) = randf_badrounding(Random.default_rng(),t)
function randf_badrounding(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
	U = Base.uinttype(T) # same-width Unsigned
	mineps = eps(zero(T)) # minimum spacing between values that we want to achieve

	umax = U(trunctofloat(typemax(U)))
	sparebits = trailing_zeros(umax) # how many "extra" bits we have per draw
	usebits = 8sizeof(U) - sparebits # how many bits we can possibly consume per draw
	redrawbelow = U(exp2(usebits)) # values smaller than this lose precision and must be redrawn
	redrawscale = T(exp2(-sparebits)) # if a value fails, scale by this and redraw
	scale = T(inv(redrawbelow))

	todrop = Int(log2(scale) - log2(mineps)) # how many octaves we need to drop via `redrawscale`
	numscales,leftover_bottom = fldmod(todrop, sparebits) # how many times we must apply `redrawscale` to get within reach of `mineps`
	remainingbits = leftover_bottom + usebits # bits remaining beyond the lowest usable scale
	lastlevelmask = typemax(U) >>> (8sizeof(U) - remainingbits) # mask for values at lowest scale

	for _ in 1:numscales
		scale *= redrawscale # prepare to evaluate this scale
		u = rand(rng, U)
		if u >= redrawbelow
			return T(u) * scale # compute the value
		end
	end
	# if we made `scale` smaller it would underflow to 0
	u = rand(rng, U)
	u = u & lastlevelmask # we can't use all the bits at this level
	return T(u) * mineps # compute the value
end

And a benchmark comparing them all on my 2020ish x86 laptop running Julia v1.9.1. I include vectorized rand for reference, but really the comparison should be made to the scalarized versions (those using a comprehension).

julia> using BenchmarkTools

julia> @btime rand(Float32,2^16); # vectorized performance, for reference
  15.700 μs (2 allocations: 256.05 KiB)

julia> @btime [rand(Float32) for _ in 1:2^16]; # current (24 bits)
  77.400 μs (2 allocations: 256.05 KiB)

julia> @btime [randf1_badrounding(Float32) for _ in 1:2^16]; # invalid rounding (32 bits)
  79.200 μs (2 allocations: 256.05 KiB)

julia> @btime [randf1(Float32) for _ in 1:2^16]; # valid (32 bits)
  137.800 μs (2 allocations: 256.05 KiB)

julia> @btime [randf_badrounding(Float32) for _ in 1:2^16]; # invalid (infinite bits, 149 in practice)
  192.200 μs (2 allocations: 256.05 KiB)

julia> @btime [randf(Float32) for _ in 1:2^16]; # valid (infinite bits, 149 in practice)
  265.800 μs (2 allocations: 256.05 KiB)

julia> @btime rand(Float64,2^16); # vectorized performance, for reference
  49.300 μs (2 allocations: 512.05 KiB)

julia> @btime [rand(Float64) for _ in 1:2^16]; # current (53 bits)
  74.300 μs (2 allocations: 512.05 KiB)

julia> @btime [randf1_badrounding(Float64) for _ in 1:2^16]; # invalid rounding (64 bits)
  95.100 μs (2 allocations: 512.05 KiB)

julia> @btime [randf1(Float64) for _ in 1:2^16]; # valid (64 bits)
  151.800 μs (2 allocations: 512.05 KiB)

julia> @btime [randf_badrounding(Float64) for _ in 1:2^16]; # invalid rounding (64 bits)
  206.100 μs (2 allocations: 512.05 KiB)

julia> @btime [randf(Float64) for _ in 1:2^16]; # valid (infinite bits, 1074 in practice)
  305.200 μs (2 allocations: 512.05 KiB)

Note that these runtimes were quite variable so we should only compare the numbers roughly. Benchmark more on your own, if you like.

Since the vectorized implementations are largely just parallizing this entire function at the register level (and I have shown that this is possible), I expect the scalar comparison to be somewhat indicative of a comparison of vectorized algorithms.

The _badrounding variants are only here to show a limit on how things might perform if a faster rounding scheme is used (either by changing the rounding mode or a better implementation). They are not suitable for current use, but they show that there is significant performance lost there.

I will call the scalarized rand calls (24 or 53 bits) the baseline. Relative to the baseline, the “full-width” versions carry a small-to1.5x (if fast rounding is achieved) to 2x (with my current rounding) performance penalty. Relative to the baseline, the “infinite precision” variants carry a 2.5x (fast rounding) to 4x (my current rounding) performance penalty.

These gaps could shrink as people make superior implementations.

A 4x regression is getting to be a lot. But I could be convinced to eat a 2x or maybe 3x penalty to end this discussion once and for all. Or, if that’s too much, we make this functionality a separate function and a user can choose whether they want speed or precision.

This also shows that there is very little space (roughly a 2x performance gap to attempt to bridge) for implementations with precision wider than native but less than infinite.