Output distribution of rand(Float32) and rand(Float64), thread 2

A recent hackernews post lead me to necro https://discourse.julialang.org/t/output-distribution-of-rand/12192. If we want to reopen that can, I think a new thread is in order.

Summary

rand(), and much more acutely rand(Float32) generate a uniformly distributed random floating point number in the unit interval. In actuality, these functions generate a uniformly distributed fixed-point number in the unit interval, and convert it to floating point. This means that the output distribution close to zero is quite counter-intuitive, and very much not what one would naively expect from “random floating point number in the unit interval”.

This is a footgun. Do we want to do something about it? What should we do about it?

Fixed point vs floating point numbers

Fixed point numbers have the property that adjacent numbers have constant distance. Floating point numbers have the property that adjacent numbers have roughly constant quotients. In the unit interval, this means that representable floating point numbers are much denser close to zero. For floating point numbers, one typically measures errors in quotients, not differences (relative error, in order to be scale-free; if one cared about absolute errors only, there is conceptually no point in using floating point numbers in the first place!).

If one adds a small and a large floating point number, then the excess precision of the small number is lost. This is not a problem in itself – it creates neither large relative nor absolute errors.

If one subtracts two numbers of the same rough magnitude, then this is called “cancellation”, and it can be highly problematic: The result has still small absolute error, but can have very large relative error. If we happened to use floating point numbers for any other reason than convenience or CPU support, then this is called catastrophic cancellation.

How julia generates random floats

Julia generates random floats by first generating a random number r = rand12() in the interval [1, 2). In this interval, floating point numbers are the same as fixed point numbers (literally!). This is therefore very easy to do and fast – just fill the significand/mantissa with a random bitstring.

Next we subtract 1, :tada:

rand() = rand12() - 1

At this point we should be very afraid: rand12() - 1 contains a catastrophic cancellation if our random number in [1,2) happened to be close to 1. In other words, we see that our random numbers are very bad for results close to zero.

When does it matter

This does not matter at all if we add our random numbers to something nonzero of constant size.

It does matter if we e.g. compute 1/rand(). Or to give a less contrieved example: Suppose we want to sample from a random distribution that gives numbers in [1, Inf); say from a power law, with probability density ~ 1/x^a with parameter a>1. The way this is typically done is by considering the inverse cumulative distribution function, F(u) = x such that probability(sample() > x) = u, where u \in [0,1]. For the power law, this means u ~ 1 / x^(a - 1), or F(u) ~ 1 / u ^ (1/(a-1)) = pow(u, -1 / abs(a-1)). In other words, we will have code like e.g. with a = 2:

sample() = constant / rand()

This is code that looks perfectly fine from a precision standpoint. We have a singularity at zero; but this is not an issue for floating point numbers. If we decided to instead write the real-arithmetic equivalent

sampleBad() = constant / (1 - rand())

i.e. decided to place the singularity of our mapping at 1 instead of 0, then alarm bells would start ringing – this is catastrophic cancellation. The statistics of extreme events will be completely wrong. But extreme events are one of the most important things when looking at fat-tailed distributions like power laws!

And, with the current julia random number generation, this is what we do:

sample() = constant / (rand12() - 1)

In other words, the output of rand() is by construction unusable for applications that care about precision close to zero.

An extreme example: Exact zeros in Float32

Consider: Take a uniform random real number u \in [0,1], and ask: What is the probability that convert(Float32, u) == 0.0f0?

Well, a good guess would be nextfloat(0.0f0) == 1.0f-45 , but depending on rounding mode intricacies something like roughly 1.0f-37 is also a valid guess (smallest non-subnormal positive float).
It does not matter. The age of the universe is of rough order 1.0f27 nanoseconds, i.e. the correct answer should be “never observed in all past or future computations done by mankind”.

Will you observe convert(Float32, rand(Float64)) === 0.0f0? probably not today. Will you observe rand(Float32) === 0.0f0?

yeah, all the time:

julia> function func()
       N = 0
       while true
       N += 1
       rand(Float32) == 0 && return N
       end
       end
func (generic function with 1 method)

julia> func()
22098446

julia> func()
3274928

julia> func()
19735810

It is egregiously misleading to say that rand(Float32) generates a uniform distributed random floating point number from the unit interval.

What to do about it: Nothing

We can chose to ignore the problem. We can say: This is well-known among experts and they know how to work around it; this is how everyone does it; almost nobody is complaining.

What to do about it: Document it

One way to approach this is to document this footgun and provide an alternative function that does the right thing :tm:

An example is log, which links to log1p for avoiding cancellation when evaluating logs of numbers close to 1. So we could have a rand0p that is linked in the docs, and it is up to users to know / read the docs to use the right random function, on penalty of incorrect statistics of extreme events close to zero.

What to do about it: Fix it for real

The complementary approach to documenting this would be to provide an alternative fastrand function that produces the current counter-intuitive distribution, and change rand to produce the Right Thing :tm:

Thus we would replace a correctness footgun with a performance footgun: it is up to users to know / read the docs to use the right random function, on penalty of mysterious performance losses.

This has large compat issues (is there code that relies on the counter-intuitive cancellation?) and performance issues, depending on how we implement that: how fast can we generate the Right Thing? Can we SIMD it? With which instruction sets (i.e. does it need avx512 or is there a avx2 / neon way of doing this right)? On GPU? Will this lead to problems when people implement/port algorithms from papers that assume random numbers to be fixed-point-distributed instead of floating-point-distributed?

This is a can of worm that requires lots of discussion.

What to do about it: Document it, partially fix it

We can split it in the middle: I.e. do something reasonably fast that generates a distribution that is less egregiously misleading, and provide document and link two alternatives fastrand and rand0p. An example would be: Exclude zero from rand(Float64) output but otherwise keep it, and implement rand(Float32) = convert(Float32, rand(Float64)).

What do other languages do?

Java generates the same flawed distribution as julia. Numpy generates the same flawed Float64 distribution as julia, cf https://github.com/numpy/numpy/blob/85d956917903624289c451bd0fcfa4743ec86988/numpy/random/_common.pxd#L67 , I think they do something better for Float32, though?

Shockingly zig does the right thing (or something close to it), cf Zig. In fact it was this link on the related hackernews discussion that renewed my hope that something can be done at all!

28 Likes

what does Rust do?

1 Like

I don’t see why fixing it is not possible and done by an external specialised package for this purpose, rather than changing how rand() behaves.

I suspect “possible” in this case refers more to social obstacles than technical

1 Like

Actually, specifically for Xoshiro, we generate floats slightly differently (https://github.com/JuliaLang/julia/blob/4ef353c6e4ad5156ad65cb40c3a2a61fbd324b59/stdlib/Random/src/Xoshiro.jl#L296), by generating uniformly an integer in [0, 2^53-1], and multiplying it by 2.0^-53. How does this approach compare to “get random float in [1, 2) and subtract 1”, or to the slower but more correct approach?

7 Likes

The issue is the hidden footgun. If you are a person who would never write 1/rand() or log(rand()) or let it pass in code review, because you know in your heart that it suffers from catastrophic cancellation, and if all your upstream (i.e. authors of packages you use) are the same kind of expert, then there is no issue for you and there is nothing to fix.

It is fundamentally the same approach: It samples uniformly from a set of equispaced candidates. Floating point numbers are not equispaced, they have a lot of room close to zero, that’s the point of floating the point (sorry for the pun).

3 Likes

That’s roughly equivalent, as I understand it, as it returns 2^{53} evenly spaced unique values between 0 and 1. The thing that makes the approach in Output distribution of rand() (thread 1) and the linked blog post clever is that it returns 2^{53} evenly spaced unique values in [0.5, 1), 2^{53} twice-as-closely-spaced unique values in [0.25, 0.5) (with half the probability), and so on…

they aren’t quite the same. the multiply approach gets 1 extra bit of randomness

Newbie question:

Does Distributions.jl make the proper corrections for the rand() behavior when sampling from, e.g. an exponential distribution? (Or for whichever distributions it is relevant.)

3 Likes

As far as I can tell, they have the same issue, cf https://github.com/rust-random/rand/blob/9a02c819cc1e4ec6959ae25eafbb5cf6acb68234/src/distributions/float.rs

Rust gets a big :+1: because the source file documents quite clearly what they are doing. I know no rust and had to read no code, the comment explains it all.

They get a small :-1: because Open01 is even more egregiously misleading for f32 if you don’t read the fine print: Using Open01 implies that you care about not getting an exact zero result, presumably because you want to apply a mapping with singularity at zero. Well, you’re not getting a zero, so you won’t get a warning shot with crashes / Inf / NaN appearing in your statistics. But the statistics of extreme events are still egregiously counter-intuitive, just silently.

Sometimes? As far as I understand, distributions.jl simply refuses to deal in Float32. I would accept “it’s good enough, it’s impossible to distinguish in Float64 by sampling”. Using 40 bits of randomness instead of 52 would also be OK by the same standard.

Stdlib already provides Random.randexp, so log(rand()) is already an RTFM-failure anyways.

But sure, somebody ought to audit that package. It’s probably a good canary in the “experts are already aware of this” coalmine.

1 Like

One question is how much entropy we intend to generate these numbers from? Do we generate a rand(Float32) from a rand(UInt32), a rand(UInt64), a system rand(UInt) (system-specific seems awful), or something bigger?

Let’s consider the case of Float32. I might be off by a bit in a spot or three here but forgive me that. We have 23 bits of mantissa. If we attempt to generate this from a UInt32, we have 9 “spare” bits we can use to extend the dynamic range beyond just rand12()-1 (or rand(0:2^23-1) .* 2^-23). “Unfortunately,” those 9 bits are uniformly distributed but our exponent bits are not. Each smaller octave is half as likely to be utilized as the one above it. So really, those 9 bits means we can extend our range an extra 9 octaves (certainly not 2^9=512 octaves).

An easy way to turn those top 9 bits into an appropriately distributed exponent would be to use leading_zeros/leading_ones, since they produces 0 with 50% probability, 1 with 25%, etc. We’ll need to clamp the result of leading_X so that it does not read into the significand bits.

That said, isn’t it possible to just use a native machine conversion of a UIntXX to a FloatXX (except for the 16 bit case, since it can overflow), then scale by an appropriate power of two to get a value in [0,1)? The hardware should truncate the bits properly and maintain the distribution, no?

I mean, it’s all a matter of tradeoffs. As I see it, there are really three key questions that’d need to be considered:

  • Is it breaking? It changes the interval from closed-open to open-open, and it’s a change in the documented behaviors. We have said, though, that changing RNG results is ok. It’s worth reviewing the time we looked into unbiasing rand() by making just the change from half-open to open: make rand unbiased for floats #33222. My read there is that maybe but the weighting of tradeoffs in that change (which is different than this) weren’t worth it.
  • Complexity and correctness. It’s quite clear what the distribution of (0:2^53-1) ./ 2^53 looks like, flaws and all — and there’s additionally very well-established literature and norms in other languages. Apparently zig didn’t quite get its newfangled implementation right: Oh, cute. It looks like they ever-so-slightly overweight the probability of valu... | Hacker News.
  • Performance.
5 Likes

Even if there is no change to code, I do think that an edited version of this post deserves a spot in the manual. I had no idea about these problems, and this post was about the perfect level of complexity to educate me.

13 Likes

For this reason, I made a suggestion a while back to make this conversion the fallback option for other RNGs as well: Fallback `rand(Float32/64)` discards more entropy than necessary · Issue #44887 · JuliaLang/julia · GitHub. But a replacement that increases entropy even more and improves rare event statistics in the small floats would of course be even better.

We guarantee the distribution, so no, we cannot remove 0.

See also here:

The statistical distribution from which random samples are drawn is guaranteed to be the same across any minor Julia releases.

Could we introduce this as a new random number generator?

2 Likes

Thanks for bringing this up. I discovered this rabbit hole a while ago while debugging an issue, as documented here: Overflow in `randn` using CUDA.jl's native RNG · Issue #1464 · JuliaGPU/CUDA.jl · GitHub, but didn’t take it any further and I am glad that someone more knowledgeable is on it.

Just as a point of comparison, it’s clear based on my experiments that Nvidia’s cuRAND API is doing something smarter than the simplest conversions, at least for Float32, but it’s not going quite as deep as the algorithm proposed on HN. Specifically, based on the smallest value I could observe being 2^{-33} and its recurrence rate about the same, my guess is that it takes 32 random bits r::UInt32 and rounds (2r + 1) / 2^{33} to the nearest representable float without any intermediate rounding.

Note also that cuRAND is explicitly documented to return values in (0, 1]:

The curandGenerateUniform() function is used to generate uniformly distributed floating point values between 0.0 and 1.0, where 0.0 is excluded and 1.0 is included.

These discrepancies are perhaps relevant to the discussion of how much people rely on the low-level specifics of rand and how breaking a change would be. Clearly, commonly used APIs for different hardware choose different conventions.

(Note if you’re doing your own experiments: CUDA.jl does not use cuRAND by default, but makes it available through CUDA.CURAND.default_rng().)

3 Likes

If you interpret “statistical distribution” here as specifying the exact Float32/Float64 values on which the distribution has support, this was already broken when the Xoshiro RNG was introduced in 1.7 with the slightly improved bits->float conversion whose support is twice as dense:

4 Likes

I think it’s meant as the [0, 1) interval, not the exact output set. That is, it’s meant as an interval on the real number line, not on floating points.

1 Like

That makes sense. But as explained in the OP, if you exploit the full density of small floats (even excluding subnormals) you would never actually observe zero even when sampling over [0, 1)—the subset of this interval that would round to zero rather than to a finite float would be so vanishingly small. But you could of course maintain a practically dead code path returning zero to formally preserve backward compatibility.

1 Like