Sure, but at the same time you can say the same thing of any floating point number that we currently can generate. Or of any Int64, for that matter; the vast majority of them will be large, simply because the upper bits have more “weight” to them. That doesn’t mean that it’s ok to just remove any one number completely though; you’re changing the distribution. Consider changing the example from the OP, only testing for 0.5f0 instead of 0.0:
julia> function func()
N = 0
while true
N += 1
rand(Float32) == 0.5f0 && return N
end
end
func (generic function with 1 method)
julia> func()
1532588
julia> func()
4351377
julia> func()
39273158
Curious, isn’t it? The documented promise we make is that these numbers are all (roughly) in the same ballpark, no matter which x you choose to compare to. The example works for any Float32 outside of subnormals (because we don’t generate those). Whether that distribution is uniform or not cannot be checked by this kind of spot-checking though - you actually have to go in and create a histogram of the numbers; shouldn’t be too difficult, since that’s only about 17GB of data (32 bits per Float32 means ~4.3 billion numbers, if you store the counts in a UInt32 you end up with ~17GB of data). The claim in the OP that this is a “correctness footgun” is so far (to my eyes at least) unfounded.
So to me at least, the question posed in the OP is less about “should we generate 0.0f0” or “is the default rand uniform” and more about “should we generate subnormals” - and for that I say “not in the default, but we ought to provide a generator specifically for those” (maybe in a package?), because once you do, you very easily get a dramatic performance hit due to bad hardware support for them. This can also be a security problem [1].
Additionally, I don’t think we ever guaranteed to generate every possible Float32 or Float64 in the real interval [0, 1), and I really wouldn’t expect us to by default, primarily for these performance reasons down the line (and not because generating the numbers itself is a tad bit slower).
If I’m understanding you here that’s not the case, and no, this isn’t about subnormals at all as neither scheme generates them. Your func will never terminate, for example, if you look for prevfloat(0.5f0).
It would exist in the output (but be half as likely to occur) in the scheme proposed by @foobar_lv2 and the above blog post. But this makes some sense if you think about how floats represent rounded “bins” of real numbers, and the bins in [0.25, 0.5) are half as big.
That’s a good point, and would have been better as an example in the OP than 0.0f0. Regardless, my point stands - we haven’t guaranteed to generate every possible floating point number (neither normal nor subnormal, I don’t think) in the real interval [0, 1).
That’s true - but changing that to be bin-aware and actually generating every (normal) number possible in the bin definitely changes the distribution (as I understand it), and that’s not something we can do. The guarantee is on the distribution of the real line, not the floating point output, because that’s the distribution actually relevant when thinking about statistical guarantees of the desired randomness.
I don’t understand your point, then. Those last two sentences there really seem to be at odds with each other, no? If the guarantee is on the distribution of the real line and not the floating point output, then the likelihood of an exact value being generated isn’t guaranteed. And, if you take the perspective of all the real numbers from [0, 1), the chances of a random one being closer to 0f0 than any other nonzero value are on the order of 10^{-45} — astronomically unlikely. So are we really changing the distribution? Or are we making it a more faithful approximation the uniform real numbers?
Admittedly I agree, the blog post and exposition here weren’t the easiest to follow, and I definitely wouldn’t call the current generator the result of catastrophic cancellation. But the core idea is quite straightforward:
The status quo is that we (effectively) sample from 0:eps(T)/2:1-eps(T)/2 with equal probability.
This change would sample from 0.5:eps(0.5):1-eps(0.5) with probability \omega, 0.25:eps(0.25):0.5-eps(0.25) with probability \frac{\omega}{2}, 0.125:eps(0.125):0.25-eps(0.125) with probability \frac{\omega}{4}, and so on… with 11 * I think such regions for Float64 and 8* for Float32. This means that all values greater than 1.3234889800848443e-23 and 1.1641532f-10 are possible outputs.
On a slight tangent, Uniform(a,b) seems to need a documentation fix, as it says:
Uniform(a,b)
The continuous uniform distribution over an interval [a, b] has probability density function
f(x; a, b) = \frac{1}{b - a}, \quad a \le x \le b
Uniform() # Uniform distribution over [0, 1]
Uniform(a, b) # Uniform distribution over [a, b]
yet, it uses a + rand()*(b-a) which is kind of [a, b).
125 such regions if we stick to non-subnormal Float32.
I’d be super happy if you could find better words to explain the issue, or maybe describe what makes it hard to follow?
What is the conceptual difficulty that makes this hard to grok?
Are people just unaware how floating point numbers work, conceptually? That’s why I added the reminder on floating point numbers, but I admit that my post will be much too concise for people who need an intro-to-float instead of a reminder-of-float.
— not saying these are the options, just setting up a strawman —
if a change to rand() were proposed that made the distribution more uniform, or with more bits of randomness, or more performant
I don’t think changing the precise categorical distribution over the set of floats is the kind of “breaking” change that will get (m)any complaints, besides those cases where known bias corrections are now over-correcting
Hm, maybe this can illustrate my point; the bin between 1-2 we currently use is a whole bin, there are no smaller subdivisions. That is, the distance between each successive number in that bin is equal. We effectively then “shift” that bin towards zero on the real number line by subtracting one, keeping the distance between numbers the same. With the approach from the article, that doesn’t happen anymore; due to truly picking from the available floating point numbers, the numbers end up denser the closer you are to zero. There simply are more numbers available, and that’s not necessarily a good thing.
The numbers are still all equally likely to be picked (weighted by the size of the rounding bins they represent), but if you actually plot the numbers you get from that random generator, you’ll notice a cluster at the bottom (irrespective of higher bins being more likely and those going to be chosen more often! In a histogram, all numbers from the possible set of floating point numbers are perfectly equally likely). That is the shift in distribution I’m talking about.
I may very well have something wrong in my logic, and I admittedly haven’t plotted the results (which would be easiest way to convince people either way), but this is what I’m talking about.
Another way to put it - keeping the current local density around any given number we generate the same is IMO more important than being able to generate all possible floating point values.
This seems like a misunderstanding. Nobody is suggesting that each floating point value in [0, 1) should have equal weight. Rather, each value in sample space should be weighted by the width of the bin it represents on the real line. This width is determined by the distance to the nearest neighbors in sample space.
Nope, but the OP explains why it would be good to do so, or at least to improve on the current situation. It probably doesn’t matter in cases where U([0, 1)) is the actual distribution you’re interested in (e.g., for the if rand() < p trick), but it makes a huge difference for rare event statistics if you’re actually trying to sample a heavy-tailed distribution by feeding your uniform rand through the inverse CDF.
Actually, if you sample enough values and study them closely enough, you’ll notice that they are more clustered in the top end, not the bottom, because the sample space is sparser up there. But any histogram with bin size larger than the largest rounding bin will be flat (given enough samples).
Ah, yes, of course it couldn’t be 8 regions (as that’d have a minimum value of 2^{-8}). But it’s also not 125 (at least not in your implementation that I played with yesterday). This is representative of my own confusion here — things are a bit of a mishmash of bit twiddling with multiple implementations, floating point shenanigans, passionate pushes about good and bad, and theoretical probabilities.
On the bit twiddling front: with 23 bits for the significand, there are only 8 or 9 bits left to use for the exponent, no? It appears you’re using all 32 bits in the lzcnt (the minimum value is 2^{-33})… which would impart quite the bias… so I’m obviously missing something here.
Yes, I’m aware. I’m not talking about that though.
I’m not talking about a histogram. Yes, if you plot a histogram there will be a cluster at the top (which is also bad). I’m talking about “did a number occur there at all” which has a cluster close to zero. That is not a good clustering, because it indicates that the numbers closer to 1 we can actually represent are less diverse than the numbers close to 0 - i.e., they don’t represent a uniform sampling, but a biased one.
You can also see exactly that in the histogram you describe.
Ultimately, the question here boils down to “sample from the floating point values representable in [0, 1)” or “sample from the real number line with uniform distances (== bin sizes) in [0, 1)”. My point is that we currently assume/guarantee the latter, while the article and OP advocate for the former.
Floats are logarithmically rather than linearly spaced because this is the useful and appropriate representation for arithmetic across orders of magnitude. It’s the reason you can do 1/x with impunity. This is still true if x originated from an RNG. In general, increasing the entropy of random numbers can only be a good thing, apart from performance regressions and xkcd: Workflow issues. Of course, any proposed conversion algorithm should be carefully analyzed for statistical bias.
Other thoughts. If X is uniform on (0,1) and Y is Normal(0, 1e-6) then X+Y restricted to (0,1) (ie. Rejection sampling) is also uniform on (0,1) but it’s more densely distributed on the floats.
Sure, but again, that then changes the question of what is sampled from. Are we sampling from the representable floating point values in [0, 1) (i.e., picking one of a logarithmically sized bin), or are we sampling (mathematically) from the real number line and binning to uniform bin sizes afterwards? I think the “uniform distribution” in the docs is talking squarely about the latter, and that should be represented in the binning of the actual values we generate.
If you take the sampled-from-representable-floating-point-values distribution and re-bin it to have uniform bin sizes you of course get back a nice flat histogram - but then you could have just generated that in the first place by using the existing method. Not doing that on the other hand skews your histogram due to the uneven binning, and some actually representable values are then more likely to occur than others
Yes, and I’m worried about bias being introduced due to the difference in bin sizes used across the sampling domain.
Good catch - the former certainly doesn’t imply the latter (though I had the latter in mind while writing the former - sorry about that!).
Again though, it comes down to where & how to do the binning. I think sticking to uniform bin sizes is a better default because it keeps the output likelihood of any single floating point value we are able to generate constant. It’s just one less potential source of bias.
as its accuracy, yes. I agree though that requiring to know this is neither ergonomic nor good for a user On the other hand, one could also argue that this is inevitable depending on which value you pick; don’t you get the same “overweighing” on the other end of the interval, no matter whether this is fixed for small numbers? After all, the “proper” bin has a higher precision than the interval we currently generate too:
julia> eps(prevfloat(1f0))
5.9604645f-8
Either way, if this is changed, it should definitely be clarified that rand() samples from the representable floating point values in the interval [0, 1), and not from the real number line.
A user who writes Bernoulli(1.0f0 - 1.0f-8) or more generally Bernoulli(p) for p close to 1 is very hard to help – on some level this is a question of basic computer literacy and the user “deserves” what’s coming to them.
My example (p close to 0) bites a user who does everything correctly. The expected deviation for Bernoulli(1.0f-8) is on order eps(1.0f-8) == 8.881784f-16.
In other words, the implementation sample(rng, b::Bernoulli{T}) where T = rand(rng, T) < b.p is the bug. This implementation looks intuitively right, also when considering all the documentation of rand. Nevertheless, this implementation is incorrect, because either the implementation or documentation of rand is bad.
Btw @adienes , Rust has a similar issue: There is no mismatch between docs and impl in rust-rand; but the docs are not energetic enough about warning users about the lacking dynamic range. I mean, not even rust-rand itself understands the (documented!) problem with its uniform rand numbers, to the point that pareto misuses it and creates completely wrong (fat) tails.
Suppose we sample a float v \in [0,1). Define the smallest nonzero value we may observe as \eta. Afterward, we dutifully verify that v \notin \{0\} so that v \in (0,1). We might now reasonably expect that 1-v \in (0,1). But if 1-\eta rounds to 1 (e.g., because we apply “extra” precision to small v), we instead get 1-v \in (0,1]. Obviously, this is now the domain of floating point pitfalls more than random number generation, but it’s a sharp edge that a casual user might reasonably trip on. I imagine there are a number of related inconsistencies, possibly more insidious than this.
It’s usually not difficult to construct failure modes with floating point math, and a careless implementation will often run afoul at least one. My example doesn’t seem intrinsically more or less an issue than the “generate small-probability Bernoulli random variables” issue raised above. I don’t see how the two can be simultaneously resolved by any adjustment to the set of values returned by rand. Neither resolution avoids pitfalls for all possible uses.
This reminds me of a snippet of code I had in a recent project. I needed to compute \log(1-\exp(x)) for x\le0. My implementation was x > -1 ? log(-expm1(x)) : log1p(-exp(x)). Note that the first calculation is unsuitable when x \ll -1 and the second is unsuitable when x \approx 0. No single code path was suitable for the entire domain (at least not with standalone log/exp-like functions), so extra care was necessary.
If you truly want accurate Bernoulli(1f-40) random variables, applying a few bits of extra precision won’t fix a rand() < p implementation. The sampler itself must find a way to produce a reasonable value (or error) for such a challenging request. This probably means switching to a different algorithm in certain regimes.
By the reasoning I’ve discussed here, and status-quo bias, I’m marginally more sympathetic to the current “fixed-point” output precision. Though I’m willing shift my stance given further examples.