Since I keep advocating performance comparisons, here’s one. Since I already made a preliminary demonstration of SIMDable rejection sampling, I’m going to put that aside for a moment. Below is a series of scalar implementations of sampling schemes.
randf1
produces a XX-bit float using a full XX-bits of entropy. randf1_badrounding
does the same but uses native round-nearest-ties-even rounding so is not correct, although it is indicative of performance if we could change the rounding mode or a wizard could give us a better implementation. randf
produces a float with infinite precision and randf_badrounding
does the same with the same flawed rounding as the finite precision version.
Scalar implementations
using Random
function trunctofloat(u::U) where U<:Base.BitUnsigned
F = Base.floattype(U)
lomask = typemax(U) >>> (8sizeof(F) - Base.significand_bits(F) - 1) # masking with this number ensures exact representability
count_ones(lomask) == trailing_ones(lomask) >= 4sizeof(U) || error("invalid mask") # mask should be dense on the low bits and at least half the bitwidth
hi = F(u & ~lomask)
lo = F(u & lomask)
f0 = hi + lo # combine the parts
roundup = f0 - hi > lo # was the roundoff error of f0 upward?
f = reinterpret(F, reinterpret(U, f0) - roundup) # decrement if rounding was upward
return f
end
randf1(t::Type) = randf1(Random.default_rng(),t)
function randf1(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
U = Base.uinttype(T) # same-width Unsigned
scale = T(exp2(-8sizeof(U)))
u = rand(rng, U)
return trunctofloat(u) * scale # compute the value
end
randf1_badrounding(t::Type) = randf1_badrounding(Random.default_rng(),t)
function randf1_badrounding(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
U = Base.uinttype(T) # same-width Unsigned
scale = T(exp2(-8sizeof(U)))
u = rand(rng, U)
return T(u) * scale # compute the value
end
randf(t::Type) = randf(Random.default_rng(),t)
function randf(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
U = Base.uinttype(T) # same-width Unsigned
mineps = eps(zero(T)) # minimum spacing between values that we want to achieve
umax = U(trunctofloat(typemax(U)))
sparebits = trailing_zeros(umax) # how many "extra" bits we have per draw
usebits = 8sizeof(U) - sparebits # how many bits we can possibly consume per draw
redrawbelow = U(exp2(usebits)) # values smaller than this lose precision and must be redrawn
redrawscale = T(exp2(-sparebits)) # if a value fails, scale by this and redraw
scale = T(inv(redrawbelow))
todrop = Int(log2(scale) - log2(mineps)) # how many octaves we need to drop via `redrawscale`
numscales,leftover_bottom = fldmod(todrop, sparebits) # how many times we must apply `redrawscale` to get within reach of `mineps`
remainingbits = leftover_bottom + usebits # bits remaining beyond the lowest usable scale
lastlevelmask = typemax(U) >>> (8sizeof(U) - remainingbits) # mask for values at lowest scale
for _ in 1:numscales
scale *= redrawscale # prepare to evaluate next scale
u = rand(rng, U)
if u >= redrawbelow
return trunctofloat(u) * scale # compute the value
end
end
# if we made `scale` smaller it would underflow to 0
u = rand(rng, U)
u = u & lastlevelmask # we can't use all the bits at this level
return trunctofloat(u) * mineps # compute the value
end
randf_badrounding(t::Type) = randf_badrounding(Random.default_rng(),t)
function randf_badrounding(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
U = Base.uinttype(T) # same-width Unsigned
mineps = eps(zero(T)) # minimum spacing between values that we want to achieve
umax = U(trunctofloat(typemax(U)))
sparebits = trailing_zeros(umax) # how many "extra" bits we have per draw
usebits = 8sizeof(U) - sparebits # how many bits we can possibly consume per draw
redrawbelow = U(exp2(usebits)) # values smaller than this lose precision and must be redrawn
redrawscale = T(exp2(-sparebits)) # if a value fails, scale by this and redraw
scale = T(inv(redrawbelow))
todrop = Int(log2(scale) - log2(mineps)) # how many octaves we need to drop via `redrawscale`
numscales,leftover_bottom = fldmod(todrop, sparebits) # how many times we must apply `redrawscale` to get within reach of `mineps`
remainingbits = leftover_bottom + usebits # bits remaining beyond the lowest usable scale
lastlevelmask = typemax(U) >>> (8sizeof(U) - remainingbits) # mask for values at lowest scale
for _ in 1:numscales
scale *= redrawscale # prepare to evaluate this scale
u = rand(rng, U)
if u >= redrawbelow
return T(u) * scale # compute the value
end
end
# if we made `scale` smaller it would underflow to 0
u = rand(rng, U)
u = u & lastlevelmask # we can't use all the bits at this level
return T(u) * mineps # compute the value
end
And a benchmark comparing them all on my 2020ish x86 laptop running Julia v1.9.1. I include vectorized rand
for reference, but really the comparison should be made to the scalarized versions (those using a comprehension).
julia> using BenchmarkTools
julia> @btime rand(Float32,2^16); # vectorized performance, for reference
15.700 μs (2 allocations: 256.05 KiB)
julia> @btime [rand(Float32) for _ in 1:2^16]; # current (24 bits)
77.400 μs (2 allocations: 256.05 KiB)
julia> @btime [randf1_badrounding(Float32) for _ in 1:2^16]; # invalid rounding (32 bits)
79.200 μs (2 allocations: 256.05 KiB)
julia> @btime [randf1(Float32) for _ in 1:2^16]; # valid (32 bits)
137.800 μs (2 allocations: 256.05 KiB)
julia> @btime [randf_badrounding(Float32) for _ in 1:2^16]; # invalid (infinite bits, 149 in practice)
192.200 μs (2 allocations: 256.05 KiB)
julia> @btime [randf(Float32) for _ in 1:2^16]; # valid (infinite bits, 149 in practice)
265.800 μs (2 allocations: 256.05 KiB)
julia> @btime rand(Float64,2^16); # vectorized performance, for reference
49.300 μs (2 allocations: 512.05 KiB)
julia> @btime [rand(Float64) for _ in 1:2^16]; # current (53 bits)
74.300 μs (2 allocations: 512.05 KiB)
julia> @btime [randf1_badrounding(Float64) for _ in 1:2^16]; # invalid rounding (64 bits)
95.100 μs (2 allocations: 512.05 KiB)
julia> @btime [randf1(Float64) for _ in 1:2^16]; # valid (64 bits)
151.800 μs (2 allocations: 512.05 KiB)
julia> @btime [randf_badrounding(Float64) for _ in 1:2^16]; # invalid rounding (64 bits)
206.100 μs (2 allocations: 512.05 KiB)
julia> @btime [randf(Float64) for _ in 1:2^16]; # valid (infinite bits, 1074 in practice)
305.200 μs (2 allocations: 512.05 KiB)
Note that these runtimes were quite variable so we should only compare the numbers roughly. Benchmark more on your own, if you like.
Since the vectorized implementations are largely just parallizing this entire function at the register level (and I have shown that this is possible), I expect the scalar comparison to be somewhat indicative of a comparison of vectorized algorithms.
The _badrounding
variants are only here to show a limit on how things might perform if a faster rounding scheme is used (either by changing the rounding mode or a better implementation). They are not suitable for current use, but they show that there is significant performance lost there.
I will call the scalarized rand
calls (24 or 53 bits) the baseline. Relative to the baseline, the “full-width” versions carry a small-to1.5x (if fast rounding is achieved) to 2x (with my current rounding) performance penalty. Relative to the baseline, the “infinite precision” variants carry a 2.5x (fast rounding) to 4x (my current rounding) performance penalty.
These gaps could shrink as people make superior implementations.
A 4x regression is getting to be a lot. But I could be convinced to eat a 2x or maybe 3x penalty to end this discussion once and for all. Or, if that’s too much, we make this functionality a separate function and a user can choose whether they want speed or precision.
This also shows that there is very little space (roughly a 2x performance gap to attempt to bridge) for implementations with precision wider than native but less than infinite.