Output distribution of rand(Float32) and rand(Float64), thread 2

That output distribution looks very strange. It’s monotonic and has a bag of 2^64 possible values, of which many are repeated. Looking just at the low-end of the range and looking at the numbers as multiples of 2^64 (to make them intelligible):

julia> @. rand64(UInt64($(0:12))) * $(2.0^64)
13-element Vector{Float64}:
  1.9999999999999998
  1.9999999999999998
  1.9999999999999998
  1.9999999999999998
  5.999999999999999
  5.999999999999999
  5.999999999999999
  5.999999999999999
  9.999999999999998
  9.999999999999998
  9.999999999999998
  9.999999999999998
 13.999999999999998

I can’t figure out for which values of p the identity I think we all want want — that is, P(\texttt{rand64()} < p) = p — is true.

There’s a 4-in-2^64 chance of getting values less than 2*2^{-64}. There’s an 8-in-2^64 chance of getting values less than 6*2^{-64}. Maybe the values for which it’s correct to test lie in-between its observable outputs? But then it’s not obvious how that applies at the other side of the range where it’s impossible to place a p between its observable outputs. It just seems… confused and fraught.

1 Like

My proposal was based on the observation that the distribution given by ldexp(convert(FloatT, rand(UInt64)), -64) is pretty OK if we must avoid leading_zeros / trailing_zeros due to vectorization concerns on avx2 machines.

Conceptually, this is bog standard naive linearly spaced 2^64 points on the unit interval.

There are only three issues with this:

  1. This can emit 1.0, because we can’t set the rounding mode.
  2. This can emit 0.0, which I don’t like for a lot of reasons.
  3. convert(Float32, UInt64) doesn’t actually vectorize well on avx2. But there is a juicy instruction to convert signed Int64 to Float32.

So I fixed all three issues:

  1. We multiply with prevfloat(1.0f0), which prevents the forbidden emission of 1.0
  2. I dislike emitting zero, so I exclude it by setting the lowest bit. We probably could also add ldexp(1.0, -64) or something (this addition doesn’t change largish output values, a typical float case of x + y == x with nonzero y).
  3. Can’t convert UInt64 to Float32 in a single instruction? Yeah, let’s just unset the leading bit. The compiler understands this and uses the signed-Int64 to Float32 instruction.
  4. Oh, we still need to shift the result into the unit interval, i.e. ldexp(..., -63). I can fuse that with the multiplication, by shifting the constant factor.

This doesn’t mean that the sampling is good, just that it should not be bizarre.

1 Like

Just on the observability of things between 0 and 2^-64,

julia> 2.0^64/(1e9*1e6*3600)
5.124095576030431

A million cores at a billion cycles per second takes on the order of 5 hours to observe this. It’s really really small, but not unobservable in some distributed cooperative SETI@home type thing (say with 50 million participants)

on the other hand the smallest non-subnormal float is ldexp(1.0f0,-127) right? that’s gonna take the same million cores: 4.7e19 hours to observe.

The big downside to something like rand32_for() or similar is speed. not because of the rare path where you go through the loop 2 or 3 or 4 times, but because of it not being vectorizable and/or taking some extra instructions in the hot path. Still, rand32_for() takes only: 4.067/2.865 = 1.42 times as long as rand(UInt32)

I’m not sure where the right tradeoff is. Is leaving the smallest possible nonzero Float32 at ldexp(1.0f0,-64) sufficient? I’d argue that’s the upper limit on that number. It might be acceptable but even smaller is better.

Sure, it’s just that in doing so you’ve broken the one thing that I’d consider the most fundamental property of a rand() on [0, 1), specifically that

P(\texttt{rand()} < p) = p

That identity holds for our current rand() and all p that are multiples of eps(T)/2. That identity holds for @mikmoore’s and my N=32 algorithms for all p that are multiples of 2^{-32}. That identity holds for @StefanKarpinski’s N=64 algorithm for all p that are multiples of 2^{-64}.

Even moreso, it’s possible to give exact error bounds for when p is not in that fully-supported set.

7 Likes

I think the speed of that is totally fine. A factor of 1.5 is a non-issue for me.

The problem is that we want the same output distribution in the SIMD case (up to statistical distinguishability with maybe 2^80 samples).

So we need to choose a flawed compromise output distribution that admits fast sequential and fast SIMD implementations (these are permitted to use completely different implementation strategies and algorithms!).

A slowdown factor of 1.5 for rand!(Vector{Float64}) and maybe 2.5 for rand!(Vector{Float32}) is probably OK.

A slowdown factor of 10x for rand!(Vector{Float32}) is completely unacceptable.

And we need fast SIMD variants to exist for all three of avx2, avx512, and arm (both apple and the nowadays common arm server chips).

I am really not into the wide world of arm vectorization, so I can’t even say what u-archs are sine-qua-non for us.

Sure, but eps(Float32) = 1.1920929f-7

So conceptually the cdf(p) = 1.19e-7 at p = nextfloat(0.0) and remains equal to that constant all the way up to around 2.3e-7 so the maximum discrepancy is on the order of 1e-7 whereas for his thing what’s the max discrepancy? 10^-19 or something?

also:

agreed 100%

I just don’t have low level cpu instruction knowledge to understand the issues.

Please please please read my post on descriptive vs. prescriptive. I’m not saying these errors are particularly small or good with the status quo, but I am saying that it’s good that we can compute them from first-principles — and I have. And it’s hugely important that there’s a very clear set for which there is zero inaccuracy.

7 Likes

So, it occurs to me that there’s a nice way to visualize the questions we’re discussing and visualizations can help

e = ecdf((1:2^10)/ldexp(1.0,32))
plot(x -> ldexp(e(x),-22),xlim=(0,ldexp(1.0,-28)),ylim=(0,ldexp(1.0,-28)))

The cdf is a staircase function if we generate on a grid of size ldexp(1.0,-32) and this staircase hits exact values at the corners (red is the line y = x)

Your argument is that even though the cdf is too big almost everwhere, it’s important that it’s exact at those corners.

My argument is that I would be fine with it being non-exact at ANY representable float, as long as the discrepancy is small, and it’s easy to make this discrepancy smaller than our current scheme.

rand32_for() only fixes the smallest step… but honestly maybe we should be fixing things more like on the range 0 to eps(1.0f0) which is more like 1e-7

edit: in fact just adding ldexp(1.0f0,-33) to the output of rand() cuts the maximum discrepancy in half and eliminates 0.0

2 Likes

Yes, that’s exactly my argument. But because floating point values are not real numbers, there’s one extra twist: once you move far enough out along that line only the corners exist! This is what happens in [0.5, 1) region with the status quo.

Now this is the fun part, we can also reduce the size of those stair-steps in in fully-defined mechanism. Adding just one extra bit means that only corners exist in [0.25, 0.5), and only every other corner exists in [0.5, 1). This is what happens as N increases. Increase it enough, and there are only corners everywhere (or perhaps just everywhere that’s probable).

3 Likes

True, so ldexp(1.0f0,-9) + ldexp(1.0f0,-33) == ldexp(1.0f0,-9)

So, if we add ldexp(1.0f0,-33) to the output of rand(Float32) it cuts the maximum discrepancy in the CDF in half near 0.0 and makes no difference for values larger than about 0.001953125f0 it also eliminates the possibility to output 0.0 exactly.

(not that I’m advocating this, just pointing out what happens).

magenta is the shifted cdf

Yes, that’s what make rand() unbiased for floats (#33222) attempted to do. (@foobar_lv2 that’s the reference you had asked for). The trouble is that now those “fully accurate” mid-points land between representable floating point numbers.

2 Likes

I don’t consider this a problem myself.

But let’s imagine rand32_for() for example. This fixes the entire problem with the CDF in the first step on my graph to within a factor that is unobservable on that graph… ie. the CDF would look like the line y=x up to the first step, and then look like the steps afterwards (the first step would be divided into 2^32 tiny steps)

Now, when I think about it, I feel like we should probably do better than that. Obviously the “perfect” sampling makes every possible floating point number in [0,1) one of the possible outputs. That’d be “as smooth as possible”. I think @foobar_lv2 has proposed something that does that except he’s got a few fixups in the range 2^-64 which you don’t like because they produce discrepancy from cdf(x) = x which is on the order of 2^-64 but that’s on the order of 2^-32 the size of our current discrepancy…

on the other hand a modification of rand32_for() using smaller

function rand32_for2()
    local exponent, a
    for outer exponent in -32:-12:-96
        (a = rand(UInt32)) < 1<<12 || break
    end
    return ldexp(Float32(a), exponent)
end
histogram([rand32_for2() for i in 1:1e7])

@btime(rand32_for2())

takes about 4.3 ns

This means stepsizes for rand floats in the 2^-32 range would be around 2^-44

Pretty sure I might have had some bugs in that code but I’m running around with my kids

The problem is that it breaks P(\texttt{rand()} < p) = p for all p >= 0.5, underestimating all such values by that offset you added. So you’re once again off that ideal CDF line, but now in such a way that’s harder to describe. To use your diagram, adding dots for the representable floating point values at this transition region:

f513eec10078962e834cc687334061f5b09ccfd0_2_690x456

You’re just now under-estimating the CDF in a different region (that is currently exact in the status quo!). There’s an additional problem that there are also floating point numbers in-between the ones I marked on the left-hand side… and they’ll also be under-estimates!

1 Like

How much precision is enough? I have 4 answers I like:

  • Full-significand: 53 bits for Float64. Very fast via native int->float instructions on many architectures. Currently implemented.
  • Full-bitwidth: 64 bits for Float64. Use non-default rounding (e.g., round-to-zero via hardware or emulation) to produce a float with precision matching a same-width integer.
  • Full-representation: 1074 bits for Float64, but often realized with much less (average 55 bits). Uniform distribution supporting every nonnegative float less than 1 except maybe subnormals.
  • Full-representation-full-range: 2098 bits for Float64, but often realized with much less (average 55 bits). Uniform distribution supporting every nonnegative finite float except maybe subnormals (semantically equivalent to T(rand(InfinitePrecisionFloat) * 2^N) where 2^N is the next float larger than the largest finite T) delivering full precision across the entire range of T.

Analogous definitions can be made for other Base.IEEEFloat types.

Full-significand and full-bitwidth are justified by performance, although if we ignore status-quo bias then the full-significand might be unjustifiably crude. The full-representation versions are the more “correct” versions, but have no hope of good performance because either they will not vectorize (branching code) or because they require too much entropy (branch-free).

Anything between same-bitwidth and full-representation is sacrificing performance to take a possibly-inadequate stance on “sufficient” precision. This isn’t being proposed a lot, but it does come up when people say “Float64 is fine but Float32 is not.” Remember, we don’t provide Float32 unless specifically asked – the default is Float64.

I don’t think there’s a one-size-fits-all rand we can offer. I think the full-representation versions are too non-performant for default use in vectors and I don’t want scalar and vector versions to have different output ranges. I think (perhaps wrongly) that making rand do better than full-bitwidth is ultimately untenable.

In my opinion, if we actually care about rand() < p, we (or a library) should provide a function to compute this accurately for any representable p. But we shouldn’t regress every possible use of rand to enable the reckless application of this specific idiom.

I’m increasingly sympathetic to an “infinite-precision” randf(T) function. Then randf() < p would be totally acceptable. I wouldn’t be so adverse to limiting that distribution to (0,1) either, since 0 would be “unlikely” anyway (except in Float16). We might decide to exclude subnormals, too, but I wouldn’t drop these unless we also drop 0 for stepsize reasons (they’re uncommon anyway except in Float16). This function’s performance cost would be negligible in scalar use but it would not vectorize well. My main “concern” with a (0,1) range would be that people would use it just to avoid 0 but then complain about poor performance. Maybe we don’t even offer a vectorized method because there’s minimal potential performance gain?

1 Like

We have two big objectives: speed and correctness. With two objectives, there is a pareto optimal frontier and not a single solution. We should provide multiple, clearly named paths.

Yes, we do care about this, and there is also a better notation for this IMHO which would allow improved specialization:

using Statistics

rand(Bernoulli(p))

Currently, this translated into rand() <= p. Note the <= which can be important for all the prevfloat(p) \ p and allow 0.0 arguments. BTW it is also efficient taking 3.5 ns on my machine.

For extra correctness for example:
rand(Bernoulli(1//3)) would be better translated into a rand(1:3)==1 or to a rejection sampling method, which would implement this accurately (assuming RNG is ‘perfect’).

PS Bernoulli(p) is a little long and inside a package, maybe a coinflip(p) function in Random would be just as readable and less painful.

Yes but the relative error around 0.5 is like 2^-32 so on this graph that’s like a millionth of a pixel width, whereas in the vicinity of 2^-32 the error is 2^-32 so that’s about as bad as it could be.

1 Like

I agree, but it is not necessarily self-evident that the concept of “relative error” is more relevant than “absolute error” when applied to probabilities. I think this is where you two are not reaching each other?

I tried to think about that in my big wall-of-text poll post. My argument was that, when you flip a large number N of coins, you will get a normal distribution ~ p ± sqrt(p * (1-p) / N ) . It may be more appropriate to measure probability errors in multiples of the standard deviation for some large N, i.e. in (p_wrong - p) / sqrt(p), instead of either absolute errors p_wrong - p nor relative errors (p_wrong - p) / p.

I’d like to summon a statistician to comment here, but I know none.

I mean, not technically trained as a statistician but I do statistical data analysis a lot and have spend 15 years discussing applied statistics daily on andrew gelman’s blog with a lot of other data analysis and stats people. Most of what I use Julia for is Bayesian data analysis.

imho the biggest concern is that we do transformations of random variables. The typical case might be to take [0,1) and push it through the inverse cdf of something else and want the tails to have some detail. In that case, errors near 0.5 are usually irrelevant, but errors near 0 or near 1 matter. The closest you can get to 1.0f0 is prevfloat(1.0f0) = 0.99999994f0 the closest you can get to 0.0 is technically nextfloat(0.0f0) = 1.0f-45 so these don’t have the same kind of resolution. However often one tail makes more “difference” than another. For example the exponential transform -log(rand()) will take the left tail of [0,1) to the right tail of the exponential, where the distances from the mean get unbounded, while the right tail of [0,1) goes to the left tail of the exponential where everything piles up on 0 so small errors probably don’t matter (think of say exponentially distributed energy of molecules… )

The thing is, there’s an unlimited number of transformations you might want to use… and if you need really detailed tail probabilities you should probably sample them in detail yourself. But I think people expect to be able to do things like g(rand()) for g some nonlinear function and expect that values in the range say [0,0.0001) are pretty densely represented. There’s no way for Float32 to have dense representation near 1.0 but there is a way to have dense representation near 0.0 and if that doesn’t reflect in the rand() call then it may well break moderately naive code in ways that it doesn’t have to be broken.

A person with knowledge of how floats work but not knowledge of the details of the rand algorithm would likely have reason to believe that very small floats are possible to get from rand(Float32). Or put another way, that a naive exponential sampler would still produce values out in the range bigger than 23 every so often and never produce Inf.

1 Like

To rephrase this: currently we have \mathrm{cdf}(p) = p for every representable p in the range [0.5, 1) and that is preserved by many versions of rand proposed but can be messed up by adding anything bigger than \mathrm{eps}(0.25) across the board.

3 Likes

Thought I had while at the library so I can’t test it

function rand32fallback()
   f = rand(Float32)
   if f < ldexp(1.0f0,-10)
       f = Float32(ldexp(rand(Float64),-10))
   end
   f
end

Basically, do the usual thing, if we get a value smaller than 1/1024 generate a 64 bit float and rescale it to the range [0, 1/1024)

It only costs extra 1/1024 of the times, and generates on a grid of size 2^-74 and rounds to the nearest Float32 so it should give resolution down into the 1e-23 range, without breaking any of the nice properties we’ve been discussing.

EDIT: home now, so here’s the timing

julia> @btime(rand32fallback())
  3.861 ns (0 allocations: 0 bytes)
0.6838771f0

julia> @btime(rand(Float32))
  2.860 ns (0 allocations: 0 bytes)
0.4580171f0

julia> 3.861/2.860
1.35

It’s also clear that the smallest nonzero float we would generate is 1/2^74 which is:

julia> Float32(ldexp(1.0,-74))
5.293956f-23

So getting an exact zero would take about 5000 hours on our notional million core machine.

We can make rand(Float64) work similarly:

julia> function rand64fallback()
       f = rand(Float64)
       if f < ldexp(1.0,-10)
          f = ldexp(rand(Float64),-10)
       end
       f
       end
rand64fallback (generic function with 1 method)

julia> @btime(rand64fallback())
  3.827 ns (0 allocations: 0 bytes)
0.41855979420533296

It costs the same more or less