Output distribution of rand(Float32) and rand(Float64), thread 2

Please do not forget that this is Julia, and everyone can program their ideal RNG and have it live in a package. It can even be a thin wrapper on an existing RNG that generates random bits, eg UInt64 or whatever is desired, and shuffles them to some desired pattern. From then on, one’s ideal random number is just a using BestRNGPackageEver away.

I apologize if the notation was unclear. In my reply above, \varepsilon(y) is just a function (that we characterize), and it has nothing to do with the machine epsilon eps intrinsically. I recognize now that I should have used a different letter — sorry for the confusion.

Possibly. My claim is normative: if you are using output of an RNG in nonlinear transformations, you have to understand its properties.

Nope, that person is still computer illiterate when it comes to numerical computation. One should of course use randexp.

I am wondering if you were aware of randexp before you started this topic. It solves the problem you propose. It has been available since 0.4.

Given that rand is not the right tool for the job, I fail to see the point of this whole discussion. One can tweak rand endlessly, but there is already a better solution — why not just use that?

Why is that numerically illiterate? It is OK to use Float32 if you know that you will have modeling errors of order e.g. ~1e-3 anyways and if you don’t plan to squander precision in cancellations. The construction will lead to cdfPareto(x^-2) = cdfUniform(x); you will expect that your tail is correctly resolved with error 2*eps(y) up to outputs of size y < ldexp(1.0, 60) (from squaring). This is a kind of precision that is perfectly reasonable for many applications, and I don’t see the illiteracy? (except of course that our current rand doesn’t work that way)

On the other hand, suppose you do what you what distributions.jl is doing, i.e. exp( -randxexp() * 2). Then you expect errors of order C * eps(log(y)) where y > 1 is the sampled value. If you care about the near tail, e.g. y ~ 2^30, then it is not obvious to me that this is a better algorithm?

I’m not saying that exp(C*randexp())) is an inferior way of sampling, just that the rand()^C class of algorithms is not necessarily numerically illiterate unless you read the fine print of the rand() implementation.

That is correct, but the statistical argument (if you buy it) suggests that we should not really care about whether cdf(p) = p exactly anywhere; just that |cdf(p) - p| < sqrt(p * (1-p) / N) for some large N (N between 2^40 and 2^120 is the number of samples for which our distribution is good).

Cool! I like that blog. I am going to accept your authority on stats questions then, and I’d be very happy if you could correct me where I’m wrong or where I’m using very non-standard concepts or language.

Statistics-wise I’m barely literate, and the questions posed by “how should we study the quality of a sampling scheme” reach the point where I have to sit down and think and am quite unsure, instead of knowing this by heart because I do this every day.

Then why not sample directly from there? (A truncation of a Pareto distribution is also Pareto). It is much more straightforward.

I am very confused by your use case (if there is one). Apparently you are expecting that heavily nonlinear transformations (using the inverse transform method and Float32) to give you correct results. While you can always reverse-engineer an implementation for rand to provide this, it will not be correct in general.

Because textbooks on random numbers discuss various special cases of generating univariate random draws for heavy-tailed distributions, for a good reason. You can make the inverse transform method work with Float64 in some cases, but for Float32 you need specialized methods.

2 Likes

Ok, I am pretty convinced by this point (if your summary of the literature is correct, but I’m going to believe you on this). A user who samples Pareto with rand()^C is almost surely naive.

Can you recommend me such a textbook?

I’d like to see how they discuss the issue of “y = rand() gives fixed-point eps(1.0) precision, actually”. E.g. is it obvious for people steeped in the literature that Probability( rand() < p) ~ 1.0 - (1.0 - p) and they would be shocked by an “ideal” rand() output with “y = rand() gives floating point eps(y) accuracy”?

The literature is huge, and it depends on what you are looking for. If you want to work with Float32, then the older literature is perhaps of interest, eg Knuth’s Seminumerical algorithms. Also a lot of papers from that era, when even Float32s were expensive.

I personally like

@book{gentle2003random,
  title={Random number generation and Monte Carlo methods},
  author={Gentle, James E},
  volume={381},
  year={2003},
  publisher={Springer}
}

Sorry, I don’t think that sentence is actually meaningful. How can a random number have “precision”?

1 Like

Ok, I took a look at that book. Thanks for linking me a textbook for dummies instead of an “introduction to the fundamentals of xyz, vol. 1 / 4” monograph that would leave me hopelessly lost.

I grabbed the 2nd edition from scihub/libgen.

That being said, the book appears to be hopelessly confused? Or am I failing at reading comprehension?

It does suggest to use the inverse cdf method from uniform U(0,1) (i.e. rand()^-2) for Pareto (p 191f).

The book is aware of the “ideal” distribution on floating point numbers (p 6f, “An ideal uniform generator …”). More poignantly (p7),

In numerical analysis, although we may not be able to deal with the num-
bers in the interval (1 − b^−p , 1), we do expect to be able to deal with numbers
in the interval (0, b^−p ), which is of the same length. (This is because, usually,
relative differences are more important than absolute differences.) A random
variable such as X above, defined on the computer numbers with e = 0, is not
adequate for simulating a U(0, 1) random variable.

But then the book suggests algorithms that only work well with ideal U(0,1) variables, and at the same time the exercises suggest using non-ideal standard library generators for U(0,1) that generate a fixed-point approximation of U(0,1)?

1 Like

A zero-bit precise random number is rand() = 0.5. A 10 bit precise random number would be rand(0 : ((1<<10) - 1 )) / 2^10.

If you want a more formal discussion, we did that up-thread at lengths. The way I was using the words here is the following:

The random generation process can be described by its cdf. The precision of the output tells us how well this approximates the abstract mathematical cdfMath(x) = x, with respect to pertubations of the input to cdf, i.e. cdf(x) compared to cdf(x'), i.e. with respect to pertubations of the x-axis of the cdf, i.e. with respect to pertubations of the output of rand(). That is subject to numerical analysis / floating-point representations / etc.

The accuracy (with respect to probability) corresponds to pertubations in the y-axis of the cdf, and it is subject to statistical analysis, not floating point considerations.

This split in language between precision and accuracy is somewhat non-unique because differences between two cdfs can be equivalently viewed as either a precision or an accuracy problem.

On the one hand I agree with this perspective. Usually we do things like the ziggurat method for normals or sampling from the tails of standard distributions by special casing them. On the other hand we’ve already seen that Julia samples Exponentials by transformation method and does so in a way that limits the right tail.

On the other hand,

I think the more concerning case is where you’re not doing some standard transformation but rather something like a molecular dynamics simulation or an agent based simulation or a nonstandard transformation related to some special application. Especially people are likely to do something like rand(Float32) < eps and maybe eps is calculated deep in each iteration of some simulation routine and turns out to be something like 1e-11

In general, I’d probably be using Float64 if I needed to do that though. These days are Float32 even faster anymore? I suspect only if you’re using GPU math.

The strongest evidence that we have IMHO is that Julia randexp is limited in its right tail and that probably means lots of other people are doing the naive stuff too.

This should produce a linter warning.
The correct way is to do:

rand(Bernoulli(eps))

No mention of Float32 and the implementation depends on eps.

As an extreme example, if eps is 0.5 or 1//2 it can do a rand(Bool), if eps is some BigFloat, it has to work appropriately.

I think this further argues for a belt and suspenders approach… Try to make the rand() have more of a floating point distribution

I don’t disagree, but ultimately the left tail of the uniform is special precisely because people DO treat it special and textbooks fail to teach the right thing and there are no linters that can catch this stuff.

I agree. I guess the upshot is the following:

  1. Provide an ideal Open01 generator.
  2. Provide a FastClosedOpen01 generator that produces fixed-point precision (same alg as current default).
  3. Point people to these two alternatives in the documentation, with big fat scary warnings. Document what the default random does, and document that this is subject to change (aka “implementation defined behavior”).
  4. Take a shot at improving the default rand(), subject to performance-precision tradeoffs. After some grace period, all remaining users of the default rand() can be assumed to particularly care about neither precision nor performance; so the result doesn’t need to be perfect on either metric.

There will invariably be much gnashing of teeth about whatever tradeoff we take in (4). But since that is subject to some grace period anyway we should try to go forward with 1-3 asap?

That statistical argument assumes that the only use case for rand() values is generating Bernoulli samples, which is of course not the case. Because of that I’m unconvinced that satisfying that criterion is sufficient—it’s definitely necessary though. Ironically, much of this thread has been about taking the output of rand and applying arbitrary non-linear transforms to it and expecting the output to have a reasonable random distribution! That puts arbitrarily tight requirements on the accuracy of the CDF.

Even if we assume that the Bernoulli statistical criterion is good enough, we have sqrt(p*(1-p)/N) ≈ 2^-32 for p = prevfloat(Float32(1)) and N = 2^40, so don’t have a ton of wiggle room for making the probability wrong (2/^-32 is two orders of magnitude smaller than eps(p)). And of course, it really feels like a cop out to produce values with incorrect probabilities when we can—and currently do—produce them with exactly the correct probability. I do find it somewhat compelling to shift the CDF in the lower end of the rand() output so that it goes both above and below the diagonal instead of just above. But not at the cost of screwing up the CDF at the top of the output range.

3 Likes

The alternative is to just make the deviation smaller, something like

Cuts the size of those deviations to essentially 0 by comparison and is trivially easy to verify correctness of compared to dealing with specialized instructions for different processors etc.

Also suddenly the probability of generating 0.0f0 is much much lower, since Float32 can represent ldexp(1.0f0,-74) = 5.29e-23 so the probability of 0.0 is now about that much, and it’s only observable by a million cores operating for 218 days.

The bias essentially goes away in terms of observability because it becomes so much smaller. And the cost of doing it is 1.35 times the cost of rand(Float32)

julia> @btime(rand32fallback()) ; @btime(rand(Float32))
  3.837 ns (0 allocations: 0 bytes)
  2.845 ns (0 allocations: 0 bytes)
0.0863567f0

julia> 3.837/2.845
1.3486818980667838

I don’t think we can use FloatXX in the course of generating a FloatYY for XX > YY. What about some embedded system or GPU where Float64 support is slow or nonexistent? I’m pretty sure I’ve never used rand(Float32) and I probably never would except on such a system. I don’t even like using UIntXX wider than FloatYY for related reasons, but at least over-wide ints can be emulated somewhat effectively.

There is absolutely zero performance difference for sequential code that doesn’t meaningfully touch memory on modernish CPU.

There is a difference for SIMD, and memory consumption, and memory bandwidth.

I tend to use Float64 and Int64 as default in sequential code that doesn’t meaningfully touch memory, as well as for code where I know that Float32 has inadequate precision or range (e.g. if I’m doing naughty stuff where I know my algorithm to be numerically unstable, but I’m too lazy to do the right thing or not clever enough to come up with a right thing). Otherwise I try to default to Float32 (and Int32).

Right. I still think that my statistical argument has some merit, but it is definitely less watertight than I thought :frowning:

I like your proposal on grounds of pragmatism. But I think the issue is that your proposal is neither “close enough to ideal” to replace an ideal sampler, nor SIMD-able enough to replace the current default.

That being said, maybe a branchy ideal SIMD variant can be done by very careful coding after all?
The probability of needing a branch in Float32 generation is ldexp(1.0, -9); we’re generating 16 Float32 in a single block, so the chance of a block failing is ldexp(16.0, -9) which is 3%, i.e. once per 2 kB of output. Branch miss penalty is ~ 20 cycles, so there could be a way?

On Float64, the probability of needing a branch is ldexp(1.0, -12) and we generate 8 numbers in a block, so we need to stall the pipeline with a probability of ldexp(8.0, -12) which is 0.2% (once per 32 kB of output). That sounds definitely like something like an acceptable performance overhead!

Writing the code such that the compiler doesn’t spill any registers in the “good case” will suck, but maybe fast simd exact ideal uniform samples are achievable after all?

Although I would struggle to write such code, one could conceivably check whether all SIMD lanes produced “valid” values, and if not then generate an entire new set at lower range, blending with the previous result to replace invalid lanes. This could be iterated repeatedly, especially since the branches would be well-predicted. I’m pretty sure all the vector instructions exist to do this (in x86, at least).

I’ll also remark that our default RNG is only 256 bits of state. This means that values less than about 2^{-256} (or maybe more like 2^{-309}, I’d have to think harder) may be literally unrealizable. But we shouldn’t count on that since the RNG is a customizable and orthogonal component.

1 Like

So, I’ve been following this thread out of curiosity since for the most part I don’t truly understang a lot of the finer details. Given that, I hope it’s not impertinent of me to make a suggestion in the interest of finding a solution.

There is a ton of code flying around with various implementations of functions, statistical tests, plots, benchmarks, etc. Perhaps it is time for someone to take the lead and create package that incorporates all the current best implementations of the various choices, some statistical tests to ensure validity of the distributions, and benchmarks to compare the proposed changes to the current status quo. Then hopefully a “best candidate” will emerge.

Enough thought, effort, and time has been put into this thread that a nice comparison package wouldn’t be too heavy a lift, perhaps?

edit: such a package would also nicely organize and preserve all the fantastic knowledge being served up here.

2 Likes

I am not certain that there even exists an ideal floating point representation of an open-open \text{Uniform}\big((0, 1)\big). The cdf is easy (once you choose your comparator) — and we need to get this right, especially in the “ideal”

\text{cdf}(p) = P(\texttt{rand()} < p) := p

But we’re sampling from a set of discrete values; so each value itself has some probability of being drawn. Ideally, we’d just use the spacing between floats to account for the fraction of the real numbers it “covers” to determine this probability:

P(\texttt{rand()} = p) = \begin{cases} \texttt{eps(p)},& \text{if } 0 \leq p \lt 1\\ 0, & \text{otherwise} \end{cases}

These two definitions work together and form a wholly consistent [0, 1) universe in which the \text{cdf} is accurate for all floating point values. But P(\texttt{rand()} = 0) is necessarily nonzero. So what’s a definition of the discrete probability that removes zero and preserves the \text{cdf} over all floating point values? You can’t do it. No matter what you choose, \text{cdf}(\texttt{nextfloat(T(0))}) will always be wrong! Interestingly, if your \text{cdf} is defined with an <= operator, this problem hits you at the other endpoint (where it’s so much worse!).

If you do the naive thing and just change the definition above to 0 < p < 1, then the CDF is underestimated by eps(T(0)). But the interesting thing is that this underestimation is only observable in subnormal floating point numbers.


But I don’t know that this is particularly relevant because I agree that the elephant in the room here is how you want that “ideal” distribution to behave under arbitrary transformations. And I here posit that even the “ideal” fully open (0, 0) distribution is going to immediately fall on its face in ways that cannot be programmed against. Simply add one to it and it will collapse to a poor approximation of a fully closed [1, 2].

And this is where I’ve become quite certain that having a rigorous understanding of what rand() actually means is important, and why I think that only half-open intervals are the only intervals make sense. Specifically, I’m going to claim that our current shoddy rand() can be perfectly transformed by any monotonically increasing function f if you perform the transformation as Float64(round(f(big(rand())), RoundDown; bits=2, sigdigits=54)). If it’s monotonically decreasing, use RoundUp. Now the \text{cdf} of that transformed rand() will have the same guarantees as our rand currently makes, specifically that \text{cdf}(p) will be perfectly accurate for p in f.(0:eps(0.5):1-eps(0.5)).

2 Likes

shouldn’t this be ldexp(1.0,-10) ? so then for 16 of them it’s closer to .016

I think open-open isn’t so important if you have an “ideal” generator since exact 0 has probability nextfloat(0.0f0) = 1.0f-45 and that’s not really observable (would take 10^22 years for a million core cluster)

But if you wanted to do a check for

if f == 0.0f0 
  f= nextfloat(f)
end

I don’t think this breaks the CDF property in a meaningful way. It puts a point mass on an unobservable place and the CDF property is still correct for all floats bigger than 2*nextfloat(0.0f0)

2 Likes