Output distribution of rand(Float32) and rand(Float64), thread 2

So on the [0,1) vs (0,1] thing, it has everything to do with how you interpret and work with the results of rand() — and I finally have a good mental model here.

Forget binary for a moment; let’s pretend our datatype of interest is decimal-based float with two digits and has a random function defined as randpct() = rand(0:99)/100. Given that definition, how do I implement a test that returns true 5% of the time? I know I just instinctively reach for randpct() < 0.05. But why <? And why not <=? And why not on the other side against 0.95?

I think the best mental model for a RNG with a [0,1) interval is that it approximates choosing a real number by binning and identifying each bin by its leftmost value. If you were to use randpct() <= 0.05, you’d count the results from six bins, not five. And if you wanted to use 0.95, you’d need to test this as randpct() >= 0.95. Of course, you cannot observe finer gradations than hundreds — and if you try, you just count the number of left edges on those hundredths bins. So randpct() < 0.045 is true 5% of the time just the same.

A (0,1] interval would flip that interpretation. You’re now identifying each bin by its rightmost value. So tests like these should be written as randpct() <= 0.05 or randpct() > 0.95.

The (0,1) interval as was proposed long ago would identify each bin by its centerpoint. Now either comparator would equivalently select 5 bins when the threshold is 0.05, but what about randpct() < 0.045? How many bins does that select? How do you succinctly describe that behavior? And how do you describe that bin between 0.99 and 1.0? It’s 0.995, but we only have two decimal places! So now do you round that value up (to a right edge) or down (to a left one)? So this means that floating point semantics have forced us into a left- or right-labeled bin anyways. (nb. when the actual change here was considered, we had a free bit to do this, but we no longer do!). It’s so very good we didn’t do this!

A closed-closed [0, 1] interval would also identify each bin by its centerpoint but it shifts the bins themselves, making the bins for 0 and 1 half as big. Practically, this “feels” right: it’s “just” the correct rounding of all theoretical values! But it’s surprisingly difficult to get a correct definition of randpct, especially in the face of limited entropy and round-to-even. No matter what you do, it requires an extra bit of randomness — you need to do the closed-open thing, uniformly choosing a value in 0:99, and then you still need to split that 0 to be 50% 0 and 50% 1.

So is a [0, 1) randpct() really biased by -0.005? It depends on how you use the thing! If you try to use a threshold test with greater precision than the step size of 0.01, then yes it’ll give you “biased” answers. But as long as you only use < tests and stay within the supported precision of randpct() there is no bias.

So now what about other arithmetic? Multiplications are “just” changing the decimal point — as long as don’t saturate the supported exponents everything works as expected. But of course additions of floating point values can reduce your precision. Both [0, 1) and (0, 1] definitions will rapidly collapse to a poorly-rounded [a, b] distribution following a translation where eps(a) > 2^-N (where N is from my rand(0:2^N-1)/2^N definition). And of course 1/rand() and log(rand()) will error on 0s. Here it’s interesting that the common idiom 1 - rand() to transform from one semi-closed interval to the other will throw away any extra precision we throw at rand here.

Somewhat interestingly, when 2^-N < eps(T(0.5)), there may be a slight advantage to [0, 1) distributions in that it’s closed endpoint is observed less frequently than it would be in a (0, 1] distribution. I’m not sure if this is valuable, though, as few algorithms will have singularities at 1 (and many do at 0).

Those two cases map to N=0 and N=1 from my definition above. Very bad choices indeed, but living on that same well-defined spectrum.

8 Likes