Output distribution of rand(Float32) and rand(Float64), thread 2

All I’m trying to say is that if we can provide a mathematical proof for example that a certain process produces |cdf(x) - x| < eps(x) and eps(x) is small enough everywhere, we should accept that as a uniform random generator. Basically I think that’s a good criterion for deciding that we have a good generator. I personally think a good eps for the whole space is probably about 2^-40 or better. I also personally think that if there are a few quirks in there that are on the order 2^-30 it’d be acceptable, especially if it reduces the error in other places (like for example debiasing would be OK with me even if it produced some kink somewhere provided it’s an understood and documented kink and doesn’t affect practical simulations KS tests after an hour on one core for example)

I’m just saying I’m planning to set up a Quarto document that can assess this question empirically as well and be provided as a document other people can run and produce simulation evidence that the simulation WORKS the way we say it does. This provides both assurance against bugs and assurance that people can do certain kinds of computations with singularities that they can verify without having to write the verification code. I don’t recommend it as continuous integration test.

I think it would be an interesting question as to what happens if we propose algorithm A which is fast, generates say on a grid of N=32 and has a sufficiently simple algorithmic description that it can be proven to have say 1.5*2^(-32) as an upper bound for the error in the cdf everywhere on [0,1), and then in 5 hours of computation there’s another RNG say B which is even slightly faster for the rand! case for example, has no such easy manual verification procedure, but after the 5 hour computation the KS test rejects A and doesn’t reject B as a uniform RNG in the left tail… It’ll become a question at least worth documenting and so that’s what I’m doing, providing a way to document the tradeoffs.

To make it clear, the reason we should care about the left tail independently of the whole range is precisely because of the way floats work almost all the floats in (0,1) are near 0, half of them are below ldexp(1.0f0,-64) or so.

The relevance of the smallest values becomes less important as probabilities at 2^-110 are unobservable on computers in our universe, but 2^-40 is not, it only takes around 0.3 hours on a single core, so it’s a coffee break.

In any case, even if we choose a generator that fails some of these tests, it could be fine but perhaps after all this discussion there’s a DetailedRandom.jl package you can use if you’re interested in increasingly detailed numbers, and having a testing document that can compare them could be valuable for those people.

I’m just not really sure what you’re advocating at this point. Up above we have examples that produce an exact finite or infinite precision uniform distribution (at least to the extent that rand(rng,UIntXX) is correct) of floating point numbers on [0,1). In their current (immature) state, the infinite versions take “only” 4x as long as the current 24/53-bit algorithm. Further, it appears we can make them perform within about 2x of any 32/64-bit algorithm we could write.

Are you suggesting to do one of the improved (finite or infinite) algorithms we’re proposing, but with sloppy rounding to improve speed at the expense of preciseness? There’s an argument to be made for it, but personally I’ll advocate to do it right so long as it’s within reach to do so.

If you aren’t advocating for this, then I’m not understanding what you’re after. Please clarify.

If a statistical test rejects a correct algorithm, then either the underlying rand(UIntXX) is wrong (wow! we need to fix that and inform the authors of that RNG!) or we so-happened to fail the test by chance. Neither is the fault of the demonstrably correct code for the floating point conversion. Or are you advocating to test our “improved” rand (whatever that is after this discussion) against some distribution other than the one it claims to produce (we should be more specific about what that is) and then mad that it fails that test?

With considerable effort, I and others have shown that we can make this correct for good definitions of correct (up to and including exactly irrefutably correct by any measure on floating point numbers) within arguably-reasonable computational budgets. So I’m getting a little frustrated at continued suggestions that we could choose to do it “just a little wrong” instead (especially without clearly indicating the performance benefits).

By the way, you will not convince me that I should care about mistakes in the distribution that occur with probability 2^{-M} but not 2^{-N} for any finite exponents. The code should say what it does (we need to improve the documentation for rand) and do what is says and we shouldn’t rely on the user having insufficient computational resources to prove it wrong. Proving it wrong is as simple as inspecting the source.

1 Like

I’m suggesting that there are speed vs precision tradeoffs to be made. If we’re not going to do bit-for-bit precision all the way down to nextfloat(0.0f0) then where are we going to set the cutoff for accuracy? Right now we have N=24 and I think everyone agrees that’s not good enough. Should we use N=32, N=40, N=48, N=56, N=64 etc? Suppose we can get something that is "almost but not quite exactly N=56 but as fast as “exactly N=32"” Which should we prefer? N=32 will have much bigger errors in the CDF than N=56 but our algorithm doesn’t quite give us the grid of N=56 it gives us something with some kinks because of roundoff or whatever other situations come up? If we can’t quantify the size of the kinks in the distribution but we know they’re smaller than N=53 shouldn’t we prefer it to N=32?

etc etc.

So anyway, it doesn’t matter, carry on, I will work on my thing, if someone comes up with a perfect bit-for-bit and fast generator then we can just forget the whole question. Obviously 0 error at each representable float is the best we can do. It’s only when we are accepting some error at some floats that we have a question of “which ones and how much?”

I really thought I did that above, but of course you’re welcome to say that performance cost is still too high.

I showed that infinite bits is doable within 2x the time of (most?) any Float32/Float64 algorithm that can sample from 32/64 to 32/64 bits. I also showed that 32/64 bits was doable (correctly) within 2x the speed of 24/53 bits. Combined, this puts a correct infinite-bits algorithm with existing code (that should be SIMD-able!) within 4x the time of the current “inadequate” algorithms. Ironically, of those in this thread I seem to be among the most okay with the existing algorithm.

These performance numbers have been demonstrated with concrete implementations. Get a wizard on the phone and those factors might be improved!

If you can come up with a 32-bit Float32 algorithm that runs in less than 2x that of the current 24-bit algorithm, it could be used on its own or to accelerate the infinite-bits algorithm from above.

The easiest “over-wide” finite case to advocate is probably 53-bit Float32 using Float64 intrinsics. I don’t love this because it requires Float64 hardware when only Float32s are requested. It also does not suggest an implementation to improve Float64.

That possibility aside, if you can demonstrate a N>32 bit Float32 algorithm (I’ll suggest you target N=64) that runs in much less than 2x that of a 32-bit algorithm and doesn’t use 64-bit math (or at least not Float64), there would be a case to argue for a 32<N_\text{bits}<\infty algorithm and similarly for a wider Float64. It would still be hard to close the deal if it couldn’t SIMD, however.

3 Likes

Yep, it’s all about these factors of 2x or 4x speed or whatever. For now. I’ll just keep quiet and hope that people looking at it all come up with the perfect algorithm that costs like 1.3x our 24 bit algorithm and we can all just close the book.

But, again, the Float64 implementation is fine for all practical purposes. It is not “ideally dense” around 0, but it is dense as heck.

I still think that this topic is about improving a relatively unimportant use case that has much cheaper fixes, and that before we invest effort into this, it would be great to see an actual use case.

For all the examples (but one) that were mentioned above (extreme values, tail of Pareto, etc), there are trivial fixes that also improve runtime by orders of magnitude; in most cases, if you want the tail, sample from the tail. (The stat mech example above involves too much domain-specific knowledge for me to assess it, but I would bet that it can be improved, too).

One cool thing about the inverted floating point exponent algorithm is that it seamlessly scales to finer resolutions with wider inputs:

function bits2float(r::U, ::Type{Float32}) where {U <: Union{UInt32, UInt64, UInt128}}
    last_bit = r & -r
    exponent = UInt32(253)<<23 - reinterpret(UInt32, Float32(last_bit))
    exponent *= !iszero(r)
    fraction = ((r ⊻ last_bit)>>(8*sizeof(U) - 23)) % UInt32
    return reinterpret(Float32, exponent | fraction)
end

64 bits are indeed “free” for the scalar case. And at the cost of a branch that’s taken 0.00000000004% of the time, you could choose to iterate after using a UInt64 and get “perfect” precision. So I’d call that free in terms of performance — but now you have to do tricky things and ensuring you’ve gotten it right gets harder. Personally, I say 64 bits makes sense and is good enough.

SIMD is much more annoying, because there it’s hugely beneficial to use UInt32s in the common case. And I do think it’s a requirement to match the fundamentals between the two. But now we’re talking about a 0.2% chance per scalar, and annoyingly if you continue to iterate with UInt32s you’re gaining 23 bits at a time… which doesn’t add up right. (Edit: oh I was thinking of very simplistically re-sampling in the entire interval [0, 2^-9) by generating another float and scaling by 2^-9. But that’s very naive! We just need to add resolution to the particular [i*2^-32, (i+1)*2^-32) bin we already found! I think that’s a much simpler problem — and should be easier to get right. In every bin other than 0, that’s just the mantissa bits or less. )

Anyhow, doing a naive thing and just giving up and just falling back to the UInt64 algorithm isn’t too terrible for performance but it is subtly wrong.

Here’s my prototype: GitHub - mbauman/ShawshankRandemption.jl: Prototype for changes to stdlib/Random.jl. It implements 64-bit versions for scalar rand for both Float64 and Float32, exact Float16, and 64-bit SIMD’d rand! Float64 and Float32. It’s missing the non-SIMD rand! path, and there are still some missing optimizations.

Performance varies by arch. It’d be great to get a benchmark suite in.

julia> import Random, ShawshankRandemption

julia> using BenchmarkTools

julia> A = Array{Float32}(undef, 2^24);

julia> @benchmark Random.rand!($A)
BenchmarkTools.Trial: 940 samples with 1 evaluation.
 Range (min … max):  5.242 ms …  5.525 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.322 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.321 ms ± 23.727 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                             ▁▃ ▁ ▁▇█▆█▃▅  ▃▁
  ▂▂▂▂▂▃▁▃▂▂▂▃▂▂▃▃▂▃▃▄▄▃▄▅▅▅▆██████████████████▇▅▃▃▃▃▃▃▃▃▂▁▂ ▄
  5.24 ms        Histogram: frequency by time        5.37 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark ShawshankRandemption.rand!($A)
BenchmarkTools.Trial: 515 samples with 1 evaluation.
 Range (min … max):  9.372 ms …  10.023 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.760 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.704 ms ± 131.465 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                              ▂▁▆▇██▅▆▂
  ▂▁▂▄▃▆▆▄▆▆▃▄▃▅▄▄▄▂▄▂▃▂▃▃▃▃▃▁█▄▃▄▃▄▅▄▅▄▄▃▅▇▇▇█████████▆█▄▄▆▃ ▄
  9.37 ms         Histogram: frequency by time        9.87 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
2 Likes

Float64 is pretty darn good and there’s a reason it’s become the standard. My understanding though is that GPU tends to prefer Float32 (Float64 either being unsupported or supported at like 20x slower speeds, at least on affordable cards) and of course Float64 takes twice the memory bandwidth to transfer.

24 bit rand(Float32) fails simple K-S tests of uniformity in the left tail within seconds. To me that makes it unsuitable as a general purpose random generator. 32 bit versions pass those tests even after a minute or so. I’ll be running more stress tests to see how long it takes to fail a 32 bit generator, but it takes something like 10 minutes to produce a trillion floats, so I should be able to do a 10000 point KS test of values smaller than 1e-8 in about that kind of time.

If it fails a stochastic statistical test in a reasonable time by a big margin (p = 1e-6 or less for example) then I don’t think it’s suitable, after all doing monte Carlo is the point of random numbers.

Are there better methods to code up algorithms when you know you’ll be sampling the tail? Yeah. Importance sampling basically. Generate a uniform [0,1) scale it down by 1e-9 and then account for the fact that it’s a billion times less likely than your sampling frequency. Still, we would like Floats to have a distribution that you can’t prove beyond a doubt is broken in less time than it takes to sip your coffee.

Fully agree that if something fails a test like this then that is bad. But of course no one here is defending the current sampling for Float32, we will change it for one of these improved versions.

1 Like

Glad we are in agreement on that. As a piece of interesting information, I tested a 32 bit float generator (ldexp(Float32(rand(UInt32),-32) with rejection of 1.0f0) against a convolution generator (generate two of the above then add a + ldexp(b,-20), shift and rescale and reject outside the interval)

I tested by rejection sampling for 3000 samples less than \pi \times 2^{-29} \approx 5.85\times 10^{-9}

The first one fails at p = .00015 in about 31 minutes on one normal desktop core

The second one has p = 0.06 at about 42 minutes. Effective bit depth is not really clear, but the smallest possible nonzero value is something like 2^{-52}. Uniformity is clearly better than 32 bit.

A perfect 64 bit resolution generator is likely enough for all but the sort of person who should be able to know better.

Just some useful data points, not meant to advocate for anything in particular except maybe better than 32 bit resolution.


┌ Warning: This test is inaccurate with ties
└ @ HypothesisTests ~/.julia/packages/HypothesisTests/r322N/src/kolmogorov_smirnov.jl:68
Exact one sample Kolmogorov-Smirnov test
----------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.0396667

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0002

Details:
    number of observations:   3000

Maximum deviation from theoretical CDF was about 2.32e-10 or about 3.97 percent of the maximum value on this interval
the p value for that size deviation with 3000 samples was p = 0.00015
1868.850455 seconds (27.78 k allocations: 1.929 MiB, 0.00% compilation time)



┌ Warning: This test is inaccurate with ties
└ @ HypothesisTests ~/.julia/packages/HypothesisTests/r322N/src/kolmogorov_smirnov.jl:68
Exact one sample Kolmogorov-Smirnov test
----------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.0240976

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.0604

Details:
    number of observations:   3000

Maximum deviation from theoretical CDF was about 9.63e-12 or about 0.16 percent of the maximum value on this interval
the p value for that size deviation with 3000 samples was p = 0.06
2496.429806 seconds (578 allocations: 6.036 MiB)
Exact one sample Kolmogorov-Smirnov test
----------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.0240976

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.0604

Details:
    number of observations:   3000

Here’s my latest entry in the Float64 version:

function randf(u::Integer = rand(UInt64)); u = UInt64(u)
    h = Float64(u >>> 11 << 1)
    l = Float64((u % Int32) << 21 >> 20 + 1)
    muladd(h, exp2(11), l) * exp2(-65)
end

As you can see, it’s very simple. On x86 it benchmarks faster than the current rand() and on aarch64 it is 5% slower. I haven’t benchmarked a vectorized version but it only uses simple instructions for which vectorized versions exist. It has cdf(x) == x for all x in the region where Float64 values are densely sampled. In the region near zero where it doesn’t densely sample values, the sampled values are in the middle of each cdf region instead of the left, so |cdf(x) - x| is minimized and zero is never sampled. I.e. here’s the behavior at the low end:

julia> u0 = typemin(UInt64)
0x0000000000000000

julia> randf(u0)
2.710505431213761e-20

julia> randf(u0+1)
8.131516293641283e-20

julia> randf(u0+2)
1.3552527156068805e-19

julia> randf(u0+3)
1.8973538018496328e-19

julia> randf(u0+4)
2.439454888092385e-19

julia> (randf(u0) + randf(u0+1))/2 == exp2(-64)
true

julia> (randf(u0+1) + randf(u0+2))/2 == exp2(-63)
true

julia> (randf(u0+3) + randf(u0+4))/2 == exp2(-62)
true

This has the following behavior in terms of CDF error:

julia> function cdf(x::Float64)
           c = 0
           for u = typemin(UInt64):typemax(UInt64)
               randf(u) < x || break
               c += 1
           end
           return c*exp2(-64)
       end

julia> [k => cdf(x) - x for k = 0:24 for x = k*exp2(-67)]
25-element Vector{Pair{Int64, Float64}}:
  0 => 0.0
  1 => -6.776263578034403e-21
  2 => -1.3552527156068805e-20
  3 => -2.0328790734103208e-20
  4 => -2.710505431213761e-20
  5 => 2.0328790734103208e-20
  6 => 1.3552527156068805e-20
  7 => 6.776263578034403e-21
  8 => 0.0
  9 => -6.776263578034403e-21
 10 => -1.3552527156068805e-20
 11 => -2.0328790734103208e-20
 12 => -2.710505431213761e-20
 13 => 2.0328790734103208e-20
 14 => 1.3552527156068805e-20
 15 => 6.776263578034403e-21
 16 => 0.0
 17 => -6.776263578034403e-21
 18 => -1.3552527156068805e-20
 19 => -2.0328790734103208e-20
 20 => -2.710505431213761e-20
 21 => 2.0328790734103208e-20
 22 => 1.3552527156068805e-20
 23 => 6.776263578034403e-21
 24 => 0.0

At the high end we have the usual exact CDF behavior:

julia> U = typemax(UInt64)
0xffffffffffffffff

julia> randf(U)
0.9999999999999999

julia> randf((U - 0) << 11)
0.9999999999999999

julia> randf((U - 0) << 11 - 1)
0.9999999999999998

julia> randf((U - 1) << 11)
0.9999999999999998

julia> randf((U - 1) << 11 - 1)
0.9999999999999997

julia> randf((U - 2) << 11)
0.9999999999999997

julia> randf((U - 2) << 11 - 1)
0.9999999999999996

This seems like a pretty good candidate due to simplicity, performance and behavior. It minimizes CDF error and avoids zero without compromising distribution elsewhere. I haven’t worked out a 32-bit version of this, but that could either use the same technique with Float32 or go through Float64.

For Float16, I think we should generate all possible values including zero. This seems like a reversal from my previous position and it is, but I still don’t buy the reasons given before in the thread. The general principle that makes sense to me is that for the region where you sample a floating-point type densely, you should have cdf(x) == x for all values in that region. What makes Float16 qualitatively different from Float32 and Float64 is that with 64 bits of randomness, one can sample the entire float domain everywhere, including the smallest values. So for Float16 we need to generate zero because eps(zero(Float16)) == Float16(6.0e-8) is a value we want to sample and if we’re going have cdf(Float16(6.0e-8)) == P(rand(Float16) < Float16(6.0e-8)) == Float16(6.0e-8) then we need to generate zero with a probability of 1/2^24. The same logic doesn’t apply to Float32 and Float64 if you’re not sampling them densely, and if you want to minimize CDF error across all representable floats rather than just the generated ones, then you want to avoid zero.

6 Likes

This seems like a pretty good candidate due to simplicity, performance and behavior. It minimizes CDF error and avoids zero without compromising distribution elsewhere. I haven’t worked out a 32-bit version of this, but that could either use the same technique with Float32 or go through Float64.

I am utterly and completely unconvinced that this general goal (avoiding 0 and attempting to “minimize the CDF”) is a good idea. My gut is saying that it’s a pretty terrible and fundamentally broken idea, but I’m trying to remain open-minded.

Regardless, I’m not seeing how this particular implementation is compelling.

How am I meant to see this? Is it meant to be monotonic? It’s not:

julia> u = 3025000233158292479
3025000233158292479

julia> randf(u) > randf(u+1)
true

How am I to think about the possible outputs and their respective probabilities?

And, really, neither the performance nor behaviors of Float64 were the challenge here (nor, honestly, is its status quo 53-bit behavior meaningfully “more problematic” than a 64-bit version to make this worth a headache). The hard part is batch Float32s without a 2x+ performance regression and then having a meaningful definition that unites the varying floating point behaviors.


That’s what I have defined and hooked up and working at mbauman/ShawshankRandemption.jl . No extra performance cost on AVX512; ~50% performance regression on ARM.

It’s just one definition* that works for all unsigned bit integers and all IEEE floats. At its core, it’s a fundamentally simple idea. You take a bag of random bits and restructure them such that they represent a uniformly distributed floating point in the interval [0, 1). For example, given a value with the structure:

0bXXXXX100
       ^^^
        ┗ The location of the least significant set bit determines the exponent;
          In this example, it's the third bit; this represents 2^-3. This is an
          exponentially-distributed value that will have possible values from
          2^-1 to 2^-N for an `N`-bit unsigned type.
  ^^^^^
    ┗ The bits to the left of the least significant bit define the fraction.
      These are uniformly distributed bits, but the _number_ of available
      bits to use in the fraction may exponentially decrease as the exponent
      becomes more negative.

In this example, we’d end up with a floating point value of 0b1.XXXXX * 2^-3 — regardless of the floating point type!

demonstration
julia> reinterpret(Float16, ShawshankRandemption.bits2float(0b10011100, Float16))
Float16(0.1992)

julia> reinterpret(Float32, ShawshankRandemption.bits2float(0b10011100, Float32))
0.19921875f0

julia> reinterpret(Float64, ShawshankRandemption.bits2float(0b10011100, Float64))
0.19921875

julia> (1 + (0b10011/0b100000)) * 2^-3
0.19921875

This is way more compelling to me for multiple reasons:

  • It’s a single algorithm that has exhaustively proven-correct behaviors at 32-bits and below.
  • It’s a single definition for all types that can be described in a single sentence:

    It returns values in [0, 1) as though a random variable were drawn from a theoretically ideal uniform distribution and then rounded down the previous multiple of 2^-64 (or to the previous floating point number).

  • This is something that can be programmed against: For Float64 means that an x \in [2^{-11},1) gets chosen with probability eps(x) and represents drawing a random variable in [x, \texttt{nextfloat}(x)). And an even x \in [2^{-12}, 2^{-11}) represents drawing a random variable in [x, \texttt{nextfloat}(x, 2)). And you can add precision by tossing one more coin to determine whether it was in the x bin or the nextfloat(x) one. So this is where we could fundamentally get an implementation with “bit-perfect” semantics at the cost of some highly improbable branches, but making sure we don’t over-saturate at zero is just troublesome enough that I don’t find it worthwhile.

Here’s where the bit-twiddling lives.

8 Likes

Ahaha, never mind. I just realized that this produces negative values around u = 2^11-1 :grimacing:. It can be fixed by adding exp2(-53) at the end but that’s more work. Might be redeemable but I definitely posted prematurely.

3 Likes

It occurs to me that “avoiding zero and minimizing the CDF” is isomorphic to the following with our existing rand implementation:

better_random(::Type{T}) where {T}
    x = rand(T)
    return x < 0.5 ? x + eps(T)/4 : x
end

In short, you’re saying that rand rounds the real random variables down when you’re in the “dense” region and rounds nearest when you’re below it. So you could do it in arbitrarily “better” N-bit version, too — just add 2^(-N-1) once you drop below 2^(significand_bits(T)-N). It’d probably be ~free to add to my implementations (and might actually speed things up!).

But why?? Why is this a defensible thing? I mean, maybe it makes the result slightly more random-variable like when applied through transforms, but it also makes it harder to know how to interpret, no? It kinda feels like the compromise design-by-committee flavor of issue #33222.

3 Likes

It’s really not a different algorithm mathematically. @StefanKarpinski basically is doing a minor modification to “take a uniform number on 64 bit integers and divide it by 2^64” the number of zero bits on the left in the integer determines the exponent, the nonzero bits on the right determine the mantissa. Conversion to float will shift the bits left until the first nonzero bit is shifted out, this is how the exponent gets determined, and it’s the same process as your process of bit twiddling, except presumably there’s hardware instructions to do it directly?

The thing that’s different is the effect of rounding mode and that he’s twiddling the low order bits to bias the result upwards since otherwise it’s biased low.

ldexp(u,-64) is basically bit twiddling to determine the exponent from the leftmost string of zeros and the mantissa from the rightmost nonzero bits.

ldexp(u,-64)+ldexp(1,-65) does the biasing if we want.

I find division both conceptually more understandable than bit twiddling and I’m kind of shocked if it doesn’t work out to be just as fast, shouldn’t there be hardware instructions to convert a UInt64 to a float?

1 Like

I would love to just use multiplication of a UInt64 by exp2(-64), but processors are unfortunately terrible at rounding modes. It’s gotta round down to get the properties we all want (well, the properties I want everywhere and that we seem to have consensus on for values greater than 2^-11).

2 Likes

I don’t have an intuition for how rounding modes work against us… so I decided to just see what the ECDF looks like if you collect integers and multiply by exp2(-64)

I started down below the range where roundoff matters, and … sure enough it doesn’t matter… click for details

Some graphs where roundoff has no effect

plot(ecdf(collect(0:100)*exp2(-64))) |> display
plot(ecdf(collect(1:101)*exp2(-64)))|> display

plot(ecdf(collect(0:1024)*exp2(-64)))|> display
plot(ecdf(collect(1:1025)*exp2(-64)))|> display

So, I guess that shouldn’t be surprising, the above are exactly representable there is no rounding… rounding first occurs when we get to numbers that have more than 52 bits to the right of the first 1, so numbers like 2^53+1?

Here’s that range:

plot(ecdf(collect((2^53-50):(2^53+50))*exp2(-64))) |> display
plot!(x->(x-(2^53-50)*exp2(-64))/(100*exp2(-64)))

It’s kind of funny because we’re at the limit of floating point to represent the line so the “line” is itself wavy.

But basically, the result of rounding is that we are splitting the difference around the ideal cdf… sometimes we’re low and sometimes we’re high, the deviation is always about 2^-52 times the size of the number we’re concerned with… so for numbers like 2^-11 as you mention the errors are 2^-63 or so, for numbers like 0.5 they’re 2^-53 = 1.1e-16

I don’t find the concerns compelling for 64 bit… “the CDF is off from cdf(x) = x by 1e-16” is not the kind of thing that damages the actual uses that rngs are put to.

now for 32 bit the story is a little different… it’s still really hard to detect using very large Monte Carlo simulations and a KS test any deviations for ldexp(Float32(u),-32) and the ones I was able to detect were for threshold below ldexp(Float32(pi),-29) where obviously it’s a noticeable set of discrete steps.

We could say “why not use the bit twiddling version which has no roundoff error” and I’d say yeah sure, of course, but if u * exp2(-32) is faster and more SIMD able we have a tradeoff to discuss and I don’t find the small notches caused by roundoff error to be compelling problems for a random number generator. The RNG should be fast and approximately uniform at a reasonable approximation and I think deviations caused by roundoff are within acceptable limits if they get us speed.

The algorithm u*exp2(-32) or u*exp2(-64) are both mathematically compelling and suffer precisely controlled errors in the cdf caused by roundoff… so well understood.

So it comes down to speed tests and particularly in the SIMD case I think.

So the “trick” my approach is shifting things around to adjust the rounding behavior. In its most straightforward form:

function randf(u::Integer = rand(UInt64)); u = UInt64(u)
    hi = Float64(u >>> 32) * exp2(32)
    lo = Float64(u % UInt32)
    x = exp2(-1) - exp2(10) + lo
    x += hi + exp2(10)
    x * exp2(-64)
end

Or here’s the version that behaves like your version:

function randf(u::Integer = rand(UInt64)); u = UInt64(u)
    hi = Float64(u >>> 32) * exp2(32)
    lo = Float64(u % UInt32)
    sh = exp2(-1) - exp2(10)
    x = sh + lo
    x += hi - sh
    x * exp2(-64)
end

Subtracting exp2(10) and later adding it back has the effect of forcing rounding down in each 1/2^53 “bucket”; adding exp2(-1) has the effect of breaking ties so that the “ties to even” rule never kicks in. Not subtracting exp2(-1) at the end avoids zero and balances the cdf errors.

The problem with this approach is that it’s too many floating point operations, which are expensive and don’t follow mathematical identities that can be optimized. So the manual optimization problem becomes: how much of this can we “precompute” in terms of cheap integer operation before we convert to the floating point domain? My integer version essentially computes 2u - 2^11 + 1 separated into a 64-bit integer for the high 53 bits and a 32-bit integer for the low bits, converts those to float, should have added 2^11 back, and then divides by 2^65. I’m not really committed to this approach but I do think that it’s worth exploring different ways to do this. You never know what trick is going to end up being the most efficient.

1 Like

Yes, the properties of round-to-even multiplication is very well understood. This will:

  • Generate 1.0s with a probability of approximately eps(T)/4.
  • Over-represent even values. For Float64, this happens in the range [2^{-11}, 1), with the value 2^-11 being three times more likely than nextfloat(2^-11). Same for Float32s in [2^{-40}, 1) and at 2^-40.
3 Likes

To be clear, I’m still ambivalent between this and just doing rand(0:2^64-1)/2^64 rounded down. The most compelling reason to me is the following.

If we emit zero with any positive probability, e.g. 2^{-64}, then even though the absolute cdf error is fixed, the relative cdf error goes to infinity as we approach zero, i.e.

\lim_{x \to \infty^+} \frac{|P(\mathrm{rand}() < x) - x|}{x} = \lim_{x \to \infty^+} \frac{|2^{-64} - x|}{x} = \infty.

Another way to look at this is P(\mathrm{rand}() < x)/x which gives the RNG’s “oversampling factor” for the value x and this also blows up (a bit faster). This is a special problem for approaching zero because there are so many representable floats near zero and we (probably) aren’t going to generate all of them. To be concrete about this, if we take x = floatmin() which is the smallest normal (as in not subnormal) positive float, then we have a relative error of ~2.4e288, which is huge.

On the other hand, if we don’t emit zero, then we have this:

\lim_{x \to \infty^+} \frac{|P(\mathrm{rand}() < x) - x|}{x} = \lim_{x \to \infty^+} \frac{|0 - x|}{x} = 1.

This maximum relative error of 100% is attained for any value of x below the smallest value that rand() can generate.

This relative error argument is the heart of what @foobar_lv2 is getting at with the randexp example: you can oversample in the tail by many orders of magnitude because of this limit blowing up. If you omit zero, you don’t get the same blow up—you undersample but only by at most 100%.

Note that this argument doesn’t hold if we generate values densely since the limit goes to zero instead, which helps explain why Float16 is qualitatively different—we almost certainly want to sample densely for Float16 since there are so few values and even the smallest subnormal value pretty large.

A weaker argument but still a small benefit is that by letting the cdf go both above and below the ideal value in the sparsely sampled region, you can cut the maximum and average absolute cdf error in half. That’s not nearly as big a deal, but it’s not nothing. I agree that this by itself might not be worth the additional complexity, but I find the relative error argument quite compelling.

3 Likes