Output distribution of rand(Float32) and rand(Float64), thread 2

Indeed, structure from breaking ties to even in Float32(rand(Float64)) is present for all values larger than 2^{-29}, and most pronounced in the first octave between 2^{-29} and 2^{-28}. Notice the 3:1 ratio of values that round to “even” vs. “odd”:

julia> using UnicodePlots

julia> f32s = Float32.([0x1.0p-29 + i * 0x1.0p-53 for i in 0:31]);

julia> histogram(f32s; nbins=length(unique(f32s)))
                            ┌                                        ┐
   [1.86265e-9, 1.86265e-9) ┤████████████████████████▋ 2
   [1.86265e-9, 1.86265e-9) ┤████████████▍ 1
   [1.86265e-9, 1.86265e-9) ┤█████████████████████████████████████  3
   [1.86265e-9, 1.86265e-9) ┤████████████▍ 1
   [1.86265e-9, 1.86265e-9) ┤█████████████████████████████████████  3
   [1.86265e-9, 1.86265e-9) ┤████████████▍ 1
   [1.86265e-9, 1.86265e-9) ┤█████████████████████████████████████  3
   [1.86265e-9, 1.86265e-9) ┤████████████▍ 1
   [1.86265e-9, 1.86265e-9) ┤█████████████████████████████████████  3
   [1.86265e-9, 1.86265e-9) ┤████████████▍ 1
   [1.86265e-9, 1.86265e-9) ┤  0
   [1.86265e-9, 1.86265e-9) ┤█████████████████████████████████████  3
   [1.86265e-9, 1.86265e-9) ┤████████████▍ 1
   [1.86265e-9, 1.86265e-9) ┤█████████████████████████████████████  3
   [1.86265e-9, 1.86265e-9) ┤████████████▍ 1
   [1.86265e-9, 1.86265e-9) ┤█████████████████████████████████████  3
   [1.86265e-9, 1.86265e-9) ┤████████████▍ 1
   [1.86265e-9, 1.86265e-9) ┤████████████▍ 1
   [1.86265e-9, 1.86265e-9) ┤  0
                            └                                        ┘
                                             Frequency
1 Like

Flipping through the manual, it appears that TestU01 can only test Float64 (i.e., double) generators, not Float32. Certainly the RNGTest.jl wrapper only supports Float64.

Oh, I didn’t know someone already did! Nice!

Yes, but that shouldn’t be a problem for testing the statistical distribution of the PRNG for correlation, right? Float64 can hold all Float32 after all, and if I followed the discussion above correctly, any improvement in the density of rand(Float32) around small values would also be applicable (in principle) to rand(Float64).

In any case, their paper is also quite thorough and bitsize agnostic as far as I can tell, so it shouldn’t be too difficult to apply the same style of tests to Float32, even if the exact implementation in TestU01 can’t be used for some reason.

1 Like

Rounding can also be problematic if the rounding mode of the FPU is changed. If a specific rounding mode is required to give a correct result, it ought to be specified explicitly.

Simple and fast way to produce rand(Float32) values:

function randf32()
    u = rand(UInt64)
    lz = leading_zeros(u)
    res = (UInt32(126-lz)<<23) | UInt32((u ⊻ ((UInt64(0x1) << (63-lz)))) >> max(0,62-lz-22))
    return reinterpret(Float32, res)
end

This method is fast:

julia> @btime randf32()
  3.077 ns (0 allocations: 0 bytes)
0.03150859f0

and possibly quite SIMD friendly.
Also, obviously no rounding shenanigens and a (0,1) range. The code itself might be beautified a bit, and at a cost of a conditional, the near zero values can be fixed (adding subnormals/zero for ‘perfection’).

1 Like

Here’s the results of your PRNG on a small subset of the tests:

julia> f() = Float64(randf32())
f (generic function with 1 method)

julia> RNGTest.smallcrushJulia(f)
10-element Vector{Any}:
 0.0
 0.02831837983124974
 0.0
 0.0
 0.0
  (0.33372515690509996, 0.20834464696251898)
 0.0
 0.0
 0.0
  (0.0, 0.0, 0.0, 0.0, 0.0)

The results are various p-values obtained from a subset of the tests of the full testsuite. The null hypothesis that’s being tested is whether the RNG is “perfectly random”.

Definition from the paper

Here, we suppose that the goal is that the successive output values of the
RNG, say u0, u1, u2, . . ., imitate independent random variables from the uniform distribution over the interval [0, 1] (i.i.d. U[0, 1]), or over the two-element set {0, 1} (independent random bits). In both cases (independent uniforms or random bits) we shall denote the hypothesis of perfect behavior by H_0.

If I remember my statistics courses right, a p-value of 0.0 here means that we should reject the null hypothesis for your RNG, that is, the RNG is not random (or rather, the outputs of your RNG seem correlated)?

For comparison, this is what the regular rand produces:

julia> RNGTest.smallcrushJulia(rand)
10-element Vector{Any}:
 0.6098493790112907
 0.47477576740181004
 0.7499185910394682
 0.39189698849202037
 0.9863496089055992
  (0.28978340071294184, 0.018153485309939654)
 0.038676431829711366
 0.22365980295280385
 0.24631514274962396
  (0.6935142971894694, 0.40240018668164956, 0.34008902221806514, 0.0831391268076066, 0.9295955954065195)

The values vary, but are never exactly 0.0 as above.

Admittedly, I’m very shaky when it comes to p-values and their interpretation, but if my interpretation is the wrong way around, that would mean our current PRNG is extremely broken :grimacing: This sort of issue is exactly why we can’t just throw a “looks good & is fast” solution at the problem though.

2 Likes

Yeah, 0.0 not good. Something is wrong. Will check it out.

Practically speaking, 2^-53 is good enough – I am no fan, but would not raise a stink about it. I’m only objecting to including zero in the output distribution. That’s a 1 / core-year event, so your tests won’t find it until it blows up your run on a cluster.

2^-24 is not good enough – that’s a 60 / core-second event that is visible in non-hypothetical situations. So my primary issue is with Float32.

More poignantly BFloat16 with the same algorithm as Float32 / Float64 would be ridiculous – 1 / 128 outputs would be exact zero. That’s presumably why the relevant library doesn’t do that and instead uses rand(BFloat16) = convert(BFloat16, rand(Float64)).

There is a special SIMD codepath (large array rand!) that doesn’t throw away half of the random bits.

Outside of SIMD, it is a misapprehension of modern CPU to believe that smaller datatypes are faster for compute. Smaller datatypes make sense for non-CPU devices like embedded or GPU, SIMD and memory transfers. Of these, memory transfers are the big one: This is why Float32 is super important!

Outside of memory transfers, the vast majority of time / energy is spent on scheduling overhead (instruction decode, dependency trees, register renaming, dispatch to execution units, the speculative execution dance). This is why SIMD is so powerful!

Bulk random generation is plenty fast in SIMD:

julia> using Random, BenchmarkTools

julia> arr = rand(1<<30);

julia> t = @belapsed fill!(arr, 0.0); t * 1e9 / (8*length(arr))
0.053693858440965414

julia> t = @belapsed fill!(arr, 0.0); (8*length(arr)) * 1e-9 / t
18.612377560634947

julia> t = @belapsed rand!(arr); (8*length(arr)) * 1e-9 / t
11.760787855454568

shell> cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

As you see, we can almost saturate memory bandwith.

If we’re willing to play dirty, I think it should be possible to make a reasonably fast SIMD Float32 generator with an OK distribution. But I am not qualified to say something about GPU.

Playing dirty could mean:

Our current output distribution has a bad internal correlation: If the exponent is small (number close to 0) then the trailing mantissa bits are 0.

It should be cheap to instead have: If the exponent is small (number close to 0) for two subsequent output numbers, then their trailing mantissa bits are correlated.

There exists broad community consensus on the desired output distribution of rand(UInt64) (random independent bits).

There is no community consensus on what the correct output distribution of rand(Float32) should be (random linearly spaced fixed point number, losslessly converted to Float32? Random real (infinite-precision) number rounded down to next Float32?).

As such, the test suite TestU01 is meaningless when applied to floating point. The paper does not contain the string float (sorry for the paywalled link, it’s also available on scihub).

In my view, this is a failure by the wider scientific community. This may not be really our fault, but it is our problem.

The paper lays the theoretical groundwork for their implementation by looking at either bitstreams or words; their implementation requires Float64/double output. I’m not sure you can conclusively say “it’s meaningless” when their implementation actually requires us to output Float64 to be used as input for their tests.

All they’re doing, as I understand it, is check whether the generated numbers seem correlated/biased in some way, by applying well-known statistical tests from various parts of the existing literature (which looks at more than just integers too).

Can you link me some papers?

Does the following look like a good distribution to you?

julia> function testrand(N)
       hits = 0
       for i=1:N
       hits += (reinterpret(UInt32, rand(Float32)) & 0x01) == 1
       end
       @show N, hits, hits/N
       hits
       end
testrand (generic function with 1 method)

julia> testrand(1<<32)
(N, hits, hits / N) = (4294967296, 1073806067, 0.25001495773904026)
1073806067

julia> function testrand64(N)
       hits = 0
       for i=1:N
       hits += (reinterpret(UInt64, rand(Float64)) & 0x01) == 1
       end
       @show N, hits, hits/N
       hits
       end
testrand64 (generic function with 1 method)

julia> testrand64(1<<32)
(N, hits, hits / N) = (4294967296, 1073750961, 0.25000212737359107)
1073750961

I got curious how Julias random f32 would do, and it seems like it is not very good either. Maybe this is from using an f64 test for f32s?

julia> f1() = Float64(Float32(rand()))
f1 (generic function with 1 method)

julia> f2() = Float64(rand(Float32))
f2 (generic function with 1 method)

julia> RNGTest.smallcrushJulia(f1)
10-element Vector{Any}:
 0.0
 0.03207623089925982
 0.0
 0.0
 0.0
  (0.8513016848191678, 0.481245522840259)
 0.0
 0.0
 0.0
  (0.0, 0.0, 0.0, 0.0, 0.0)

julia> RNGTest.smallcrushJulia(f2)
10-element Vector{Any}:
 0.0
 0.7963828042776983
 0.0
 0.0
 0.0
  (0.45951642982173935, 0.8146179005918128)
 0.0
 0.0
 0.0
  (0.0, 0.0, 0.0, 0.0, 0.0)

I found the references of their paper very enlightening for that. You can also take a look at the references of dieharder, the tests of which are also contained in TestU01.

It should also be noted that merely searching for “float” is not a good indicator for whether or not a paper is talking about your usecase; the TestU01 paper for example also says this:

Tests are available for bit strings as well as for sequences of real numbers in the interval (0, 1).

right in the introduction. Coupled with requiring Float64/double input, I’d say they have done their due diligence in that regard.

I’m not sure what you’re asking; you’re showing me a single ratio of hits when looking at the last bit of a Float32, interpreted as a UInt32. A “good distribution” for what, exactly? There is no distribution here.

It’s possible - IMO that’s one more argument for being thorough with this and grounding it in something tangible/provable, not just “it’s fast”.

2 Likes

Fair enough, my fault, sorry for that.

Since you’re faster at trawling that literature – are you aware of any discussion by the authors about what a “uniform random floating point” number is supposed to mean?

In other words, where is generator of uniform random floats in a random oracle model that has ideal output distribution (by definition, i.e. this generator defines “ideal output distribution”)?

As far as I could see, there exists no real awareness that this is a potentially contentious point that requires clarification, but I’d be very happy to be proven wrong on that.

Apologies. This was meant to showcase how egregiously different the existing rand output distribution is from what I consider to be ideal.

After taking more of a look at the TestU01 paper, the above observation should be considered against the following from the TestU01 paper

  1. QUALITY CRITERIA FOR RANDOM NUMBER GENERATORS
    RNGs for all types of applications are designed so that their output sequence is a
    good imitation of a sequence of independent uniform random variables, usually
    over the real interval (0, 1) or over the binary set {0, 1}. In the first case, the
    relevant hypothesis H0A to be tested is that the successive output values of the
    RNG, say u0 , u1 , u2 , . . . , are independent random variables from the uniform
    distribution over the interval (0, 1), that is, i.i.d. U (0, 1). In the second case,
    H0B says that we have a sequence of independent random bits, each taking the
    value 0 or 1 with equal probabilities independently of the others.
    These two situations are strongly related, because under the i.i.d. U (0, 1)
    hypothesis, any prespecified sequence of bits (e.g., the bit sequence formed by
    taking all successive bits of u0 , or every second bit, or the first five bits of each
    ui , etc.) must be a sequence of independent random bits. So statistical tests for
    bit sequences can be used as well (indirectly) for testing the null hypothesis
    H0A .

And

  1. TESTS FOR A SEQUENCE OF RANDOM BITS
    Here we suppose that the generator produces a stream of bits b0 , b1 , b2 , . . . , and
    we want to test the null hypothesis that the bi are independent and take the
    values 0 or 1 with equal probability. These bits can come from any source. In par-
    ticular, they can be extracted from a sequence of real numbers in (0, 1) by taking
    specific bits from each number. The standard procedure for doing this in TestU01
    is to take bits r + 1, . . . , r + s from each number, that is, skip the first r bits and
    take the s bits that follow, for some integers r ≥ 0 and s ≥ 1, and concatenate
    all these bits in a long string. The values of r and s are parameters of the tests.

=============

Since I talked about “ideal float generation” so much, I’ll repeat my two definitions. Both are equivalent, i.e. define exactly the same distribution, and are parametrized by the usual floating point parameters (number of exponent bits, number of mantissa bits, exponent bias):

Definition 1: A uniform normal random float in (0,1) is generated by the following procedure: Generate a uniform real infinite precision number. Round it down to the next non-subnormal floating point number. If the result is zero, try again (rejection sampling).

Definition 2: The mantissa bits are sampled uniformly random (each bit is set with independent probability 0.5). The exponent is chosen according to geometric distribution, with truncation. An easy source if geometric distribution is to count the leading zeros of a uniform random bitstring. Truncation can be achieved via e.g. rejection-sampling if the number of leading zeros is too large to fit into the exponent.

Given that conception of “ideal non-subnormal float in (0,1)”, you can begin to judge quality of an RNG.

Where is the ideal float generator that TestU01 judges against?

Now that I finally have a firm understanding of what’s going on here, this whole “this distribution is bad” narrative just completely fails to resonate with me.

You have to discretize the possible values between 0 and 1, from which you uniformly choose one. The exact number of possible discrete values is a choice. There are some obviously bad choices (with too few discrete values or multiples that do not evenly divide into floating point values) and there are some “saturating” values beyond which it’s unobservable for a given datatype, but the line between “good” and “bad” is never going to be sharp or definite.

Your definitions change the perspective from “a discrete sampling from a continuous distribution” to “a sampling from all representable values of a datatype” — which is not necessarily a need for rand(). There could be a use-case for getting all random bit patterns of a floating point number, but that’s a different thing than drawing from a uniform [0,1) distribution.

6 Likes

That is correct. However, the fact that we return an ieee floating point value is suggestive of one very specific discretization, namely the one that underlies the definition of floating point values.

The word “uniform” is unfortunately very overloaded. If you’re saying that we MUST choose a set of N possible output values, each of which has equal probability 1/N, then – how on earth did you come to the idea that this is without alternative? If you mean by “uniform” that we want to approximate the uniform measure on real numbers in [0,1] then – how do you operationalize that? (I mean, this is what this discussion is about!)

That’s the issue though - you cannot actually do that on a computer. We don’t have space for “infinity”, there has to be a cutoff at some point, otherwise it’s possible that the call to rand never actually terminates, even for a single sample; before you even get to rejection sampling.

The way this is done is by sampling a uniformly random, iid integer that provides enough bits to discriminate all desired output values, and then map that integer to your set of output values. For the current RNG, we have chosen the set of output values to be “bins of equal width on the interval [0, 1)”, which means not producing all possible floating point values (since they have non-equal bin widths in that interval). When we say that the value that is spit out by this transform is “uniform”, we mean that no value is more likely to be output than any other of the possible outputs.

We could switch the RNG to use a mapping of unequal bin-sizes, provided we can still generate the same intervall (i.e., [0, 1)) and it’s reasonably fast - this must not and cannot have an effect on uniformity though; the bins we decide to split must, taken together, get as many “hits” as before we’ve done the split.

To circle that back around to my objection to removing 0 - the probability of actually getting 0.0 being (almost) 0 is not the same as saying that it’s impossible to get 0.0 at all - understanding that took quite a bit of thinking over things like this video.

It doesn’t need to physically exist; it only needs to discriminate whether the values produced by your RNG has a non-uniform distribution over the possible output values, i.e., whether some of the output values are more likely to occur than others.

Yeah, that was the “ideal” distribution. You need to mathematically define it – otherwise you lack the language to even talk about how good an implementation is. Now that we have defined the ideal, there is

An easy source of geometric distribution is to count the leading zeros of a uniform random bitstring. Truncation can be achieved via e.g. rejection-sampling if the number of leading zeros is too large to fit into the exponent.

A reasonable approximation of this procedure is given by terminating the program with a halt-and-catch-fire message if the number of leading zeros is too large to fit into the exponent. This guarantees program termination, and does not impact program correctness for BFloat16, Float32 or Float64. This is because the probability of obtaining more leading zeros from a random bitstring than fit in the exponent of these types is smaller than the probability of a random bitflip due to e.g. cosmic rays or radioactive decay leading to a wrong result anyways.

I’d actually go one step farther — I’m saying that every single algorithm for sampling from a continuous space can be described as dividing that space into N divisions and picking one with \frac{1}{N} probability. This is wholly independent of floating point particularities.

While it’s fun to do the “perfect” bit-twiddlings with Float32’s exponents and mantissas to ensure all possible Float32 values might get returned, doing so is exactly identical to having 2^{126} possible values and choosing one, rounding down upon conversion to Float.

4 Likes

I’ve been fiddling more with implementations. One family we’ve discussed is FloatXX(u) * exp2(-XX) (possibly with some slight adjustments for bias, and almost certainly with round-to-zero arithmetic) where u = rand(UIntXX). I’ll call these “scaling algorithms.” Another is to choose a random significand and a power-law exponent. And, if we wanted the support to extend to 0, we’d except the iszero(u) case to produce 0.0. I’ll call these “sigexp algorithms.”

Here’s an example implementation of a sigexp

function uint_to_ieeefloat_sigexp(u::U) where U<:Union{UInt16,UInt32,UInt64}
	# generates values in [0,1)
	F = Base.floattype(U)
	half = reinterpret(U, F(1//2))
	explsb = Base.exponent_mask(F) & ~(Base.exponent_mask(F) << 1) # LSB of exponent bits
	sigmask = Base.significand_mask(F)
	expshift = U(min(leading_zeros(u), leading_zeros(sigmask))) # for every leading zero, shift exponent down one
	usig = u & sigmask
	f = (half-expshift*explsb) | usig
	f *= !iszero(u) # extend range to include 0
	return reinterpret(F, f)
end

It was not intuitive to me, although I am sure it was obvious to others, that these produce “very” different supports. Consider the delta between the two smallest nonzero values. For scaling algorithms, it’s exp2(-XX). For sigexp algorithms, it’s eps(exp2(-2-exponent_bits)) == exp2(-XX-1) in the target float format. For Float32, these are 2^-32 for scaling and and 2^-33 for sigexp. Fine. But more significant is the dynamic range. The smallest nonzero scale is 2^-XX while the smallest nonzero sigexp is 2^-(2+exponent_bits) (or the nextfloat if we reserve a value for 0), 2^-32 vs 2^-10 for Float32. Notice that the step between 0 and the smallest sigexp nonzero is giant compared to the smallest nonzeros. This should impart a “notable” bias. Notice that the sigexp algorithm has much smaller dynamic range than the current significand-only scheme. The significand scheme’s smallest nonzero is 2^-(1+significand_bits). Despite consuming extra entropy, the sigexp scheme squanders this on representing all values rather than extending range – in fact it reduces it.

While sigexp algorithms represent every float within their range, scaling algorithms extend much lower but with much sparser support. Given that the initial concern here is with extending the dynamic range, a scaling type algorithm seems much better. Although it may be preferable/necessary to implement it with some form of twiddling if we can’t control rounding modes.

The moral of this story is that we cannot hope to support all floats within an interval with a reasonable dynamic range without consuming more entropy. To match the range of a scaling algorithm, sigexp requires an extra significand (minus 1 bit, I think) worth of entropy.

There’s a chance I’ve made some major mistake in this analysis. There’s also a chance that the sigexp algorithm (in any form resembling this) is not something anyone (other than me) seriously considered. But the sigexp algorithm as I’ve conceived it seems to make very bad tradeoffs.

1 Like