Output distribution of rand(Float32) and rand(Float64), thread 2

If there was no consideration of performance, and this was pure math, I would do the following:

Sample a real number in [0,1], uniformly according to standard measure. Then round it down to the next float.

In other words: P(rand() == x) = |[x, nextfloat(x)]| for x: x == round(x), and P(rand() < p) = p for p: p == round(p).

We now have a distribution that includes zero and subnormals. However, for Float64, Float32 and BFloat16, the probability of hitting exact zero or a subnormal number is smaller than ldexp(1.0, -120). This probability means ā€œnever in our finite universeā€.

More precisely, it means ā€œonly if the random generator is badly seededā€. On the other hand, exact zeros and subnormals are liable to trigger bugs that can mess up worse than bad statistics – e.g. security bugs.

So my principled distribution-with-safety-belt has P(rand() < p) = p for p: p == round(p) and p > minnormal, where minnormal is the smallest positive non-subnormal float. Then we clamp P(rand() < minnormal) = 0. This safety-belt is statistically invisible in our finite universe, but saves the bacon if somebody is bad at seeding the RNG.

======================
Ok, we now need to do compromises with performance. E.g. your @StefanKarpinski proposed randf.

I am saying that we should retain the clamping close to zero from the ideal-with-safety-belt distribution. Clamp at something small like e.g. ldexp(1.0, -60). This clamping is no more or less principled than the choice how to distribute our subset of floats where we have the desired equality P(rand() < p) = p

======================

All these clamping arguments do not apply to Float16. This is because nextfloat(Float16(0)) == Float16(6.0e-8) is a pretty large number with a probability that is significant! Clamping that introduces real bias that can mess up real statistics!

Hence, I have no opinion on Float16.

======================

PS. I think rounding Bernoulli(1.0e-100) to Bernoulli(1.0e-20) is a much worse mistake to make than e.g. rounding Bernoulli(1.0e-18) to Bernoulli(0.0).

1 Like

I have seen you say this multiple times, but like… how?? isn’t that the whole point of this thread, to come up with algorithms to sample uniformly from unit interval?

1 Like

He’s giving the thing we’re supposed to model (he says ā€œand this was pure mathā€) and then describes the suggested compromises…

I think this is the key. Yes, excluding 0 induces a bias, but the bias is ā€œessentially unobservableā€ and I’m with @foobar_lv2 and also with him about Float16. You really can’t expect much from Float16 when it comes to accurately modeling low probability events.

This is a fundamental disagreement between our world views. Floats are not approximate. They exactly identify a real value.

The current rand() < p is exactly correct for all ps that are multiples of eps(0.5).

3 Likes

That is neither uniform under the P(rand() < p) = p criterion since P(rand() < min(S)) = 0 < min(S) nor under the P(rand() ≤ p) = p criterion since P(rand() < max(S)) = 1 > max(S). More generally, if _rand() is uniform in $[0, 1)& then your rand() will have probability too small at the smallest value in the support and too big everywhere else. This means that doing rand() < p to get an event with probability p is always wrong and not fixable in any obvious way.

1 Like

Except that no computer RNGs have that level of accuracy/uniformity in bit generation or even close, and as @foobar_lv2 says, if you ask for a Bernoulli(1f-50) you’ll get a Bernoulli(0.0)

More to the point, what’s the observable consequences of cutting off an interval between [0.0,nextfloat(0.0)] or [0.0f0,nextfloat(0.0f0)]?

Those are of size:

julia> nextfloat(0.0)
5.0e-324

julia> nextfloat(0.0f0)
1.0f-45

The first one is unobservable in our finite universe as @foobar_lv2 says, the second one might be observable… lemme think, a cluster of 1 million cores each operating at 1 billion operations per second would take on average …

julia> 1.0/nextfloat(0.0f0)/(1e9*1e6*3600*24*365)
2.2628863722506974e22

So 10^22 years to see the first consequence.

Again, unobservable in our universe.

However, computers aren’t perfect, bugs in RNG implementations or seeding could induce bias towards 0.0 dramatically more than they should have, so we’re talking about protecting against a known special case which will only be hit if the RNG is already ā€œbrokenā€ in some sense. Which is the more important source of bias? the risk of broken software or the risk that you’ll miss out on 1 event in 10^22 years of massive supercomputing?

This thread has multiple points. We at first need to reach consensus what ā€œsample uniformly from unit intervalā€ means, mathematically speaking, in the context of floating point numbers. I had hoped that this consensus was reached by now?

Hence I answered with a mathematical definition, not with an algorithm.

Then we need to come up with fast algorithms to do that, approximately (there will be trade-offs between perf and precision).

And we may want appropriate safety rails. An important point I tried to make is that users are incapable of dealing with precise zeros in the rand output. I brought empirical evidence to the table.

This strongly suggests that the following property of the ā€œrealā€ distribution should not be traded away for performance: Exact zeros don’t happen with the computational resources available to mankind (unless the RNG is ill-seeded).

I was a little hyperbolic on that – ldexp(1.0, -1000) is cosmologically impossible, but ldexp(1.0, -120) is merely humanly impossible.

Wait, so the stance is that we should exclude zero for some float types but not others? That’s so much worse.

4 Likes

IMHO we shouldn’t by default provide rand for Float16’s but instead provide a function that errors with a message about please define rand(Float16) yourself, according to whichever types of biases you prefer and here’s a palette of several implementations you could choose from.

Are you trolling or serious?

1 Like

No I’m serious, the tradeoffs should be explicit and only a small number of people should be using Float16 I think? Or maybe I am not aware of people using Float16 like crazy in some current applications?

Or let me put it this way, the probability to get Inf in float16 sampling of Exponentials is… 6e-8 so ā€œ10 times a secondā€

You and @foobar_lv2 are continuing to conflate the number of available bit patterns with the values themselves. Ironically, those two things are only the same in the [1, 2) interval.

Once you have floating point numbers that have varying epsilons, you must be more considered in what it means to draw one such number.

3 Likes

No, the size of the region that rounds to 0 is | [0,6e-8] | = 6e-8 (or within a factor of 2 of that, depending on rounding mode I guess)

I can’t seem to reproduce that:

julia> using Dates, Random

julia> try
       println(now())
       while true
           isinf(randexp(Float16)) && println("gotcha")
       end
       finally
       println(now())
       end
2023-10-28T23:41:26.450
^C2023-10-28T23:41:42.422
ERROR: InterruptException:
Stacktrace:
 [1] _randexp(rng::TaskLocalRNG, ri::UInt64)
   @ Random ~/julia/usr/share/julia/stdlib/v1.11/Random/src/normal.jl:129
 [2] randexp
   @ ~/julia/usr/share/julia/stdlib/v1.11/Random/src/normal.jl:127 [inlined]
 [3] randexp
   @ ~/julia/usr/share/julia/stdlib/v1.11/Random/src/normal.jl:198 [inlined]
 [4] randexp
   @ ~/julia/usr/share/julia/stdlib/v1.11/Random/src/normal.jl:199 [inlined]
 [5] top-level scope
   @ ./REPL[7]:4

Interesting. On the other hand if you ask it to print gotcha when rand(Float16) gives Float16(0.0) you get Gotcha 10 times a second. So evidently there’s a difference in the implementation for Exponential?

Are we posting in the same thread? Am I that bad at communicating?

julia> function foo(p, N)
       smaller = 0
       smallerEqual = 0
       for i=1:N
       r = rand(typeof(p))
       r <= p && (smallerEqual += 1)
       r < p && (smaller += 1)
       end
       @show N, p, smaller, smallerEqual, smaller/N, smallerEqual/N
       end
foo (generic function with 1 method)

julia> foo(ldexp(eps(0.5f0), -30), 1<<32)
(N, p, smaller, smallerEqual, smaller / N, smallerEqual / N) = (4294967296, 5.551115f-17, 276, 276, 6.426125764846802e-8, 6.426125764846802e-8)
(4294967296, 5.551115f-17, 276, 276, 6.426125764846802e-8, 6.426125764846802e-8)

This is what I’m complaining about. This result is wrong by 9 orders of magnitude.

1 Like

This literally made me laugh out loud.

Yes, and as I pointed out in message #37 of this thread, this is expected, because the current generation scheme does not allow for the precision you’re asking for.

Could it be improved? Maybe, if performance of such a scheme was good AND followed the currently guaranteed distribution of possibly generating 0.0. So far, you’ve either generated a different distribution, generated a biased one, or had much worse performance.

2 Likes

My point above is that generating a biased distribution when the bias is at the level of ā€œglobal scale supercomputing can’t observe the bias in billions of times longer than the duration of the universeā€ is the right thing to do because errors in computer code are much much more of a problem than biases of this size. From that perspective, excluding 0.0 is the right thing to do for Float64 and Float32 because it’s unobservable bias. (and note, it’s what R does when generating rexp)

Float16 is problematic as a model for probability.