Output distribution of rand(Float32) and rand(Float64), thread 2

This is a fun problem. Here’s my crack at it:

function randf(
    ::Type{T},
    u::UInt64 = rand(UInt64),
) where {T<:Union{Float16,Float32,Float64}}
    # only depends on T
    s = 8*sizeof(T)
    p = Base._precision(T) - 1
    U = T === Float16 ? UInt16 :
        T === Float32 ? UInt32 :
        T === Float64 ? UInt64 : error()
    m = T === Float16 ? one(T) :
        inv(reinterpret(T, U(65 << p)))

    # real computation
    z = leading_zeros(u)
    if T !== Float16
        b = (u << (z + 1) >>> (64 - p)) % U
        e = 64 - z
    else
        # b = (u << min(14, z + 1) >>> (64 - p)) % U
        b = (u % U) & 0x03ff
        e = max(0, 14 - z)
    end
    b |= (e % U) << p
    x = reinterpret(T, b)
    x *= m
end

randf(T::Type, u::Integer) = randf(T, UInt64(u))

It looks like a lot of code but it boils down to not so much:

julia> @code_native debuginfo=:none randf(Float64, rand(UInt64))
	clz	x8, x0
	add	x9, x8, #1
	lsl	x9, x0, x9
	lsr	x9, x9, #12
	cmp	x0, #2
	csel	x9, xzr, x9, lo
	sub	x8, x9, x8, lsl #52
	mov	x9, #288230376151711744
	add	x8, x8, x9
	fmov	d0, x8
	mov	x8, #8921630861820952576
	fmov	d1, x8
	fmul	d0, d0, d1
	ret

julia> @code_native debuginfo=:none randf(Float16, rand(UInt64))
	clz	x8, x0
	mov	w9, #14336
	sub	w8, w9, w8, lsl #10
	tst	x0, #0xfffe000000000000
	csel	w8, w8, wzr, ne
	bfxil	w8, w0, #0, #10
	fmov	s0, w8
	ret

I edited out some junk asm comments before and after actual asm code. Code for Float32 is very similar to Float64. This implementation is also not much of a slowdown compared to our current rand for floats:

julia> using BenchmarkTools

julia> @btime rand(Float64);
  2.708 ns (0 allocations: 0 bytes)

julia> @btime randf(Float64);
  2.833 ns (0 allocations: 0 bytes)

julia> @btime rand(Float32);
  2.791 ns (0 allocations: 0 bytes)

julia> @btime randf(Float32);
  2.917 ns (0 allocations: 0 bytes)

julia> @btime rand(Float16);
  2.666 ns (0 allocations: 0 bytes)

julia> @btime randf(Float16);
  2.916 ns (0 allocations: 0 bytes)

So about 4% slower for Float64 and Float32 and 9% slower for Float16. I didn’t benchmark vectorized versions because plugging this into our Sampler system was more complicated than I managed to get through, but I can’t see any reason why this wouldn’t vectorize as long as there’s vectorized clz instructions, which there are on ARM and x86 with AVX512 or newer. Older x86 machines can get by without this vectorizing.

What about behavior? I’ll address the Float64 and Float32 versions first and then the Float16 separately since it’s a bit different.

At the highest level, it uses the leading zeros to pick an exponent from 0-64 for Float64 and Float32 or from 0-14 for Float16 with a geometric distribution where the highest exponent has probability 1/2. It uses the bits after the leading one to fill the significand. For Float64 and Float32 it takes the bits right after the leading bit and fills the signficand from left to right; for Float16 it just uses the last 10 bits of the number. For the higher precisions where there aren’t always enough trailing bits to fill the significand, the trailing bits in the significand are zeros. It combines the exponent and significand to get a float in the range [0, w) where w = reinterpret(T, U(65 << p)) for Float64 and Float32 and w = 1 for Float16. Finally, it scales that value up to [0, 1) via multiplication by m = inv(w) which is exact because w is a power of two, so m is also a power of two and multiplication by it only changes the exponent. For Float16 this is a no-op which the compiler recognizes since m is a compile-time constant depending only on the type.

One thing worth noting about this is that as a function from UInt64 --> Float{32,64} randf is monotonically non-decreasing: i.e. u ≤ v ⟹ randf(T, u) ≤ rand(T, v). So we can explore the smallest values it can possibly generate by looking at small “seeds”:

julia> randf(Float64, 0)
0.0

julia> randf(Float64, 1)
5.421010862427522e-20

julia> randf(Float64, 2)
1.0842021724855044e-19

julia> randf(Float64, 3)
1.6263032587282567e-19

The probability of getting exactly zero is 1/2^64 and you’ll note that exp2(-64) == 5.421010862427522e-20 whic is exactly the value of randf(Float64, 1). So if we do the Bernoulli thing with rand(Float64) < exp2(-64) we get the exact right probability. At the low end of the outputs, we don’t have a lot of “random” bits to go in the significand:

  • 0 and 1 there are no significand bits to use (different exponents)
  • 2 and 3 have the same exponent and differ by only the leading significand bit
  • 5-8 have the same exponent but 2 bits of significand to distinguish them

And so on. By the time you get up to u = 2^52 you have enough random bits after the leading bit to fill the entire significand.

We can also explore the largest values it can generate by looking at large seeds:

julia> randf(Float64, typemax(UInt64))
0.9999999999999999

julia> randf(Float64, typemax(UInt64)-1)
0.9999999999999999

julia> randf(Float64, (typemax(UInt64) - 0) << 11)
0.9999999999999999

julia> randf(Float64, (typemax(UInt64) - 1) << 11)
0.9999999999999998

julia> randf(Float64, (typemax(UInt64) - 2) << 11)
0.9999999999999997

At the low end, every change of the input produces a different output but at the high end, many contiguous inputs map to the same output—2^11 contiguous values at the top. This is necessary in order to produce the correct uniform distribution on [0, 1) given the uneven quantization of floats. The probability of getting 0.9999999999999999 is 2^11/2^64, which is 1/2^53 and you’ll note that 1/2^53 == 1-0.9999999999999999 so the probabilities look right up at the top too.

Things are very similar for Float32. Low end:

julia> randf(Float32, 0)
0.0f0

julia> randf(Float32, 1)
5.421011f-20

julia> randf(Float32, 2)
1.0842022f-19

julia> randf(Float32, 3)
1.6263033f-19

High end:

julia> randf(Float32, typemax(UInt64))
0.99999994f0

julia> randf(Float32, (typemax(UInt64) - 0) << 40)
0.99999994f0

julia> randf(Float32, (typemax(UInt64) - 1) << 40)
0.9999999f0

julia> randf(Float32, (typemax(UInt64) - 2) << 40)
0.9999998f0

The probability of getting exactly zero is still 1/2^64, just like Float64 and moreover randf(Float32, 1) == Float32(randf(Float64, 1)), so the Bernoulli probabilities work out the same. It is not generally case that randf(Float32, u) == Float32(randf(Float64, u)) for all u but they are always within an ULP of each other and the deviation is due to rounding when converting Float64 to Float32. We’ve already seen that this is a problem with trying to generate Float32 random values from Float64. Generating the Float32 values directly avoids the last bit bias caused by rounding.

The Float16 implementation is a little different, so let’s look at that as well. For Float16 I didn’t make randf(Float16, u) a monotonic function of u for a few reasons:

  • In higher precisions, you might not have enough random bits after the leading one, so you want to shift the trailing bits to the front of the significand; this naturally makes randf monotonic.
  • For Float16 we don’t need to do this since there’s always enough bits: 15 bits to pick the exponent, 10 bits for the significand—heck, you’ve got 39 bits to spare.
  • In Float16 you generate all values, including subnormals which behave a bit differently and make the approach use for higher precisions break, so we do something simpler, but which doesn’t end up being monotonic.

Let’s look at some low inputs for Float16:

julia> randf(Float16, 0)
Float16(0.0)

julia> randf(Float16, 1)
Float16(6.0e-8)

julia> randf(Float16, 2)
Float16(1.0e-7)

julia> randf(Float16, 3)
Float16(2.0e-7)

If randf were monotonic in this case, this would be bad since Float16(6.0e-8) is way bigger than 1/2^64—it’s about 1/2^24. Fortunately, there are a lot of seed values that produce Float16(0.0): we get zero as long as first 14 and last 10 bits of u are zeros; there are 2^(64-14-10) = 2^40 such values. Lo and behold, 2^40/2^64 is exactly Float16(6.0e-8), so the chance of randf(Float16) < Float16(6.0e-8) is Float16(6.0e-8) as desired.

At the high end, we have this:

julia> randf(Float16, typemax(UInt64))
Float16(0.9995)

julia> randf(Float16, typemax(UInt64)-1)
Float16(0.999)

julia> randf(Float16, typemax(UInt64)-2)
Float16(0.9985)

julia> randf(Float16, typemax(UInt64)-3)
Float16(0.998)

Similar to the low end, non-monotonicity means we get many duplicates of these values. We want the probability of getting Float16(0.9995) to be equal to 1 - Float16(0.9995) == Float16(0.0004883). The output is going to be Float16(0.9995) as long as the first 1 and last 10 bits are all ones; there are 2^(64-1-10) = 2^53 such values, and 2^53/2^64 is exactly Float16(0.0004883) as desired.

Interlude: Are we checking the uniformity of the distribution the right way?

I’ve been checking for P(rand() < p) = p. This is a viable condition for uniformity with real numbers, but then again, so is this: P(rand() ≤ p) = p. For real numbers these are the same condition because the probability of getting exactly p is zero. But for floats, that’s not the case and P(rand() < p) = p is a different condition than P(rand() ≤ p) = p. This doesn’t matter much for moderately sized p but for very small p it can make a big difference. As we saw above, we have P(rand(T) < 1/2^64) = 1/2^64 for Float32 and Float64 which is what we want, but we also have P(rand(T) ≤ 1/2^64) = 1/2^32 which is twice as big as 1/2^64.

So which condition should we use for random floating point generation? By one condition randf is giving exactly the right uniform distribution, by the other it’s overweighting some values by a factor of two! My argument is that < is the right condition for [0, 1) based on looking at the edge cases of zero. If you use P(rand() ≤ p) = p and take p = 0 then you would have P(rand() ≤ 0) = 0 which is guaranteed to fail since we’ve stipulated that 0 can be produced by rand(). If, on the other hand, we had a support of (0, 1] then we should use by a symmetrical argument: using < would give P(rand() < 1) = 1 - P(rand() = 1) < 1 since we assumed that 1 is a possible output.

There’s also a question of which p values the identity can be expected to hold for. If ε is the smallest non-zero value that your function can produce, then P(rand() < ε/2) = P(rand() < ε) = P(rand() = 0) = ε which is twice as big as the identity requires. More generally, if a < x < b where a and b are adjacent outputs of rand() and x is a real number between them (and therefore not a possible output since they’re adjacent), then we have the following:

  • a = P(rand() < a) = P(rand() < x) < x

What this tells us is that in between adjacent output points, we cannot possibly expect the uniformity condition to hold. So the best we can hope for is that the condition holds at all the output values of the random generator.

To summarize my take, we should check the uniformity condition at all the output values of the random value generator and the condition we should use depends on the interval:

  • If you’re sampling from [0, 1) use P(rand() < p) = p;
  • If you’re sampling from (0, 1] use P(rand() ≤ p) = p.

If you’re sampling from [0, 1] then you’re kind of screwed as neither uniformity condition works and if you’re sampling from (0, 1) then I’m not sure what you should do. In any case, I think we probably made a good choice by sticking with the traditional [0, 1), and it may not be an accident that this is the traditional support.

Somewhat bizarrely, defining rand() = 0.0 would be a perfectly good uniform RNG on [0, 1) by this criterion. Similarly, rand() = rand(Bool) ? 0.0 : 0.5 would also be perfectly uniform. All that means is that we also have to consider what the support of the function is when evaluating an RNG, but given a support set, I think this is the right way to evaluate whether the distribution is uniform or not.

Proof of Uniformity

Now that we’ve settled a uniformity condition, can we actually prove that randf is uniform by this definition instead of just spot checking some examples? I haven’t worked this out yet and this is long enough, so I’ll post it now and work on the proof…

11 Likes