This is a fun problem. Here’s my crack at it:
function randf(
::Type{T},
u::UInt64 = rand(UInt64),
) where {T<:Union{Float16,Float32,Float64}}
# only depends on T
s = 8*sizeof(T)
p = Base._precision(T) - 1
U = T === Float16 ? UInt16 :
T === Float32 ? UInt32 :
T === Float64 ? UInt64 : error()
m = T === Float16 ? one(T) :
inv(reinterpret(T, U(65 << p)))
# real computation
z = leading_zeros(u)
if T !== Float16
b = (u << (z + 1) >>> (64 - p)) % U
e = 64 - z
else
# b = (u << min(14, z + 1) >>> (64 - p)) % U
b = (u % U) & 0x03ff
e = max(0, 14 - z)
end
b |= (e % U) << p
x = reinterpret(T, b)
x *= m
end
randf(T::Type, u::Integer) = randf(T, UInt64(u))
It looks like a lot of code but it boils down to not so much:
julia> @code_native debuginfo=:none randf(Float64, rand(UInt64))
clz x8, x0
add x9, x8, #1
lsl x9, x0, x9
lsr x9, x9, #12
cmp x0, #2
csel x9, xzr, x9, lo
sub x8, x9, x8, lsl #52
mov x9, #288230376151711744
add x8, x8, x9
fmov d0, x8
mov x8, #8921630861820952576
fmov d1, x8
fmul d0, d0, d1
ret
julia> @code_native debuginfo=:none randf(Float16, rand(UInt64))
clz x8, x0
mov w9, #14336
sub w8, w9, w8, lsl #10
tst x0, #0xfffe000000000000
csel w8, w8, wzr, ne
bfxil w8, w0, #0, #10
fmov s0, w8
ret
I edited out some junk asm comments before and after actual asm code. Code for Float32
is very similar to Float64
. This implementation is also not much of a slowdown compared to our current rand
for floats:
julia> using BenchmarkTools
julia> @btime rand(Float64);
2.708 ns (0 allocations: 0 bytes)
julia> @btime randf(Float64);
2.833 ns (0 allocations: 0 bytes)
julia> @btime rand(Float32);
2.791 ns (0 allocations: 0 bytes)
julia> @btime randf(Float32);
2.917 ns (0 allocations: 0 bytes)
julia> @btime rand(Float16);
2.666 ns (0 allocations: 0 bytes)
julia> @btime randf(Float16);
2.916 ns (0 allocations: 0 bytes)
So about 4% slower for Float64
and Float32
and 9% slower for Float16
. I didn’t benchmark vectorized versions because plugging this into our Sampler system was more complicated than I managed to get through, but I can’t see any reason why this wouldn’t vectorize as long as there’s vectorized clz instructions, which there are on ARM and x86 with AVX512 or newer. Older x86 machines can get by without this vectorizing.
What about behavior? I’ll address the Float64
and Float32
versions first and then the Float16
separately since it’s a bit different.
At the highest level, it uses the leading zeros to pick an exponent from 0-64 for Float64
and Float32
or from 0-14 for Float16
with a geometric distribution where the highest exponent has probability 1/2. It uses the bits after the leading one to fill the significand. For Float64
and Float32
it takes the bits right after the leading bit and fills the signficand from left to right; for Float16
it just uses the last 10 bits of the number. For the higher precisions where there aren’t always enough trailing bits to fill the significand, the trailing bits in the significand are zeros. It combines the exponent and significand to get a float in the range [0, w)
where w = reinterpret(T, U(65 << p))
for Float64
and Float32
and w = 1
for Float16
. Finally, it scales that value up to [0, 1)
via multiplication by m = inv(w)
which is exact because w
is a power of two, so m
is also a power of two and multiplication by it only changes the exponent. For Float16
this is a no-op which the compiler recognizes since m
is a compile-time constant depending only on the type.
One thing worth noting about this is that as a function from UInt64 --> Float{32,64}
randf
is monotonically non-decreasing: i.e. u ≤ v ⟹ randf(T, u) ≤ rand(T, v)
. So we can explore the smallest values it can possibly generate by looking at small “seeds”:
julia> randf(Float64, 0)
0.0
julia> randf(Float64, 1)
5.421010862427522e-20
julia> randf(Float64, 2)
1.0842021724855044e-19
julia> randf(Float64, 3)
1.6263032587282567e-19
The probability of getting exactly zero is 1/2^64 and you’ll note that exp2(-64) == 5.421010862427522e-20
whic is exactly the value of randf(Float64, 1)
. So if we do the Bernoulli thing with rand(Float64) < exp2(-64)
we get the exact right probability. At the low end of the outputs, we don’t have a lot of “random” bits to go in the significand:
0
and 1
there are no significand bits to use (different exponents)
2
and 3
have the same exponent and differ by only the leading significand bit
5-8
have the same exponent but 2 bits of significand to distinguish them
And so on. By the time you get up to u = 2^52
you have enough random bits after the leading bit to fill the entire significand.
We can also explore the largest values it can generate by looking at large seeds:
julia> randf(Float64, typemax(UInt64))
0.9999999999999999
julia> randf(Float64, typemax(UInt64)-1)
0.9999999999999999
julia> randf(Float64, (typemax(UInt64) - 0) << 11)
0.9999999999999999
julia> randf(Float64, (typemax(UInt64) - 1) << 11)
0.9999999999999998
julia> randf(Float64, (typemax(UInt64) - 2) << 11)
0.9999999999999997
At the low end, every change of the input produces a different output but at the high end, many contiguous inputs map to the same output—2^11 contiguous values at the top. This is necessary in order to produce the correct uniform distribution on [0, 1)
given the uneven quantization of floats. The probability of getting 0.9999999999999999
is 2^11/2^64, which is 1/2^53 and you’ll note that 1/2^53 == 1-0.9999999999999999
so the probabilities look right up at the top too.
Things are very similar for Float32
. Low end:
julia> randf(Float32, 0)
0.0f0
julia> randf(Float32, 1)
5.421011f-20
julia> randf(Float32, 2)
1.0842022f-19
julia> randf(Float32, 3)
1.6263033f-19
High end:
julia> randf(Float32, typemax(UInt64))
0.99999994f0
julia> randf(Float32, (typemax(UInt64) - 0) << 40)
0.99999994f0
julia> randf(Float32, (typemax(UInt64) - 1) << 40)
0.9999999f0
julia> randf(Float32, (typemax(UInt64) - 2) << 40)
0.9999998f0
The probability of getting exactly zero is still 1/2^64, just like Float64
and moreover randf(Float32, 1) == Float32(randf(Float64, 1))
, so the Bernoulli probabilities work out the same. It is not generally case that randf(Float32, u) == Float32(randf(Float64, u))
for all u
but they are always within an ULP of each other and the deviation is due to rounding when converting Float64
to Float32
. We’ve already seen that this is a problem with trying to generate Float32
random values from Float64
. Generating the Float32
values directly avoids the last bit bias caused by rounding.
The Float16
implementation is a little different, so let’s look at that as well. For Float16
I didn’t make randf(Float16, u)
a monotonic function of u
for a few reasons:
- In higher precisions, you might not have enough random bits after the leading one, so you want to shift the trailing bits to the front of the significand; this naturally makes
randf
monotonic.
- For
Float16
we don’t need to do this since there’s always enough bits: 15 bits to pick the exponent, 10 bits for the significand—heck, you’ve got 39 bits to spare.
- In
Float16
you generate all values, including subnormals which behave a bit differently and make the approach use for higher precisions break, so we do something simpler, but which doesn’t end up being monotonic.
Let’s look at some low inputs for Float16
:
julia> randf(Float16, 0)
Float16(0.0)
julia> randf(Float16, 1)
Float16(6.0e-8)
julia> randf(Float16, 2)
Float16(1.0e-7)
julia> randf(Float16, 3)
Float16(2.0e-7)
If randf
were monotonic in this case, this would be bad since Float16(6.0e-8)
is way bigger than 1/2^64—it’s about 1/2^24. Fortunately, there are a lot of seed values that produce Float16(0.0)
: we get zero as long as first 14 and last 10 bits of u
are zeros; there are 2^(64-14-10) = 2^40 such values. Lo and behold, 2^40/2^64 is exactly Float16(6.0e-8)
, so the chance of randf(Float16) < Float16(6.0e-8)
is Float16(6.0e-8)
as desired.
At the high end, we have this:
julia> randf(Float16, typemax(UInt64))
Float16(0.9995)
julia> randf(Float16, typemax(UInt64)-1)
Float16(0.999)
julia> randf(Float16, typemax(UInt64)-2)
Float16(0.9985)
julia> randf(Float16, typemax(UInt64)-3)
Float16(0.998)
Similar to the low end, non-monotonicity means we get many duplicates of these values. We want the probability of getting Float16(0.9995)
to be equal to 1 - Float16(0.9995) == Float16(0.0004883)
. The output is going to be Float16(0.9995)
as long as the first 1 and last 10 bits are all ones; there are 2^(64-1-10) = 2^53 such values, and 2^53/2^64 is exactly Float16(0.0004883)
as desired.
Interlude: Are we checking the uniformity of the distribution the right way?
I’ve been checking for P(rand() < p) = p
. This is a viable condition for uniformity with real numbers, but then again, so is this: P(rand() ≤ p) = p
. For real numbers these are the same condition because the probability of getting exactly p
is zero. But for floats, that’s not the case and P(rand() < p) = p
is a different condition than P(rand() ≤ p) = p
. This doesn’t matter much for moderately sized p
but for very small p
it can make a big difference. As we saw above, we have P(rand(T) < 1/2^64) = 1/2^64
for Float32
and Float64
which is what we want, but we also have P(rand(T) ≤ 1/2^64) = 1/2^32
which is twice as big as 1/2^64.
So which condition should we use for random floating point generation? By one condition randf
is giving exactly the right uniform distribution, by the other it’s overweighting some values by a factor of two! My argument is that <
is the right condition for [0, 1)
based on looking at the edge cases of zero. If you use P(rand() ≤ p) = p
and take p = 0
then you would have P(rand() ≤ 0) = 0
which is guaranteed to fail since we’ve stipulated that 0 can be produced by rand()
. If, on the other hand, we had a support of (0, 1]
then we should use ≤
by a symmetrical argument: using <
would give P(rand() < 1) = 1 - P(rand() = 1) < 1
since we assumed that 1 is a possible output.
There’s also a question of which p
values the identity can be expected to hold for. If ε
is the smallest non-zero value that your function can produce, then P(rand() < ε/2) = P(rand() < ε) = P(rand() = 0) = ε
which is twice as big as the identity requires. More generally, if a < x < b
where a
and b
are adjacent outputs of rand()
and x
is a real number between them (and therefore not a possible output since they’re adjacent), then we have the following:
a = P(rand() < a) = P(rand() < x) < x
What this tells us is that in between adjacent output points, we cannot possibly expect the uniformity condition to hold. So the best we can hope for is that the condition holds at all the output values of the random generator.
To summarize my take, we should check the uniformity condition at all the output values of the random value generator and the condition we should use depends on the interval:
- If you’re sampling from
[0, 1)
use P(rand() < p) = p
;
- If you’re sampling from
(0, 1]
use P(rand() ≤ p) = p
.
If you’re sampling from [0, 1]
then you’re kind of screwed as neither uniformity condition works and if you’re sampling from (0, 1)
then I’m not sure what you should do. In any case, I think we probably made a good choice by sticking with the traditional [0, 1)
, and it may not be an accident that this is the traditional support.
Somewhat bizarrely, defining rand() = 0.0
would be a perfectly good uniform RNG on [0, 1)
by this criterion. Similarly, rand() = rand(Bool) ? 0.0 : 0.5
would also be perfectly uniform. All that means is that we also have to consider what the support of the function is when evaluating an RNG, but given a support set, I think this is the right way to evaluate whether the distribution is uniform or not.
Proof of Uniformity
Now that we’ve settled a uniformity condition, can we actually prove that randf
is uniform by this definition instead of just spot checking some examples? I haven’t worked this out yet and this is long enough, so I’ll post it now and work on the proof…