Output distribution of rand(Float32) and rand(Float64), thread 2

Let me get this straight… the argument is that we should exclude zero from rand() for types like Float64 and Float32 where the probability of getting zero can be made small enough that even though excluding it technically makes all the probabilities incorrect it doesn’t matter in practice (unless you draw billions of numbers, but who does that?). But then when it comes to a type like Float16 where that argument doesn’t work because the type is too coarse, you want to just give up and force users to define `rand’ themselves? Or include zero to avoid biasing things unacceptably much. I’m finding this argument less than persuasive.

In fact, it seems to me that on the contrary, we should use Float16 and Float8 (which is even a thing people are starting to use thanks to neural nets) as reductio ad absurdum stress test cases for any proposal to see if it is a good idea. If it falls apart for Float16, then we shouldn’t do it for any float type. Consider that efficient generic programming is one of the primary showcases of the language. If rand() doesn’t return zero for some float types but does for others—or worse, isn’t even defined—then how can we expect anyone to write generic code that actually works? If you think it’s bad that people might write code for Float64 that blows up if rand() returns a zero, isn’t it worse if that code is generic (as it often is) and will only possibly hit that case for a niche type like Float16? I’d prefer that people be required to explicitly check for zero for all types. In fact, this suggests that a good way to test code for corner cases like this is to try the code on Float16!

12 Likes

That is correct in a very narrow and tautological sense.

I have also argued that handing this distribution out is like giving a loaded footgun to a group of toddlers, as evidenced by the fact that randexp() returns Inf too often by 2^1000 orders of magnitude.

It really isn’t. You can have all of Float16, Float32 and Float64 share a PDF, binned differently depending on the precision the type can represent. In the limit, that must be the exact same as a uniform distribution over the reals in [0, 1), and no scheme you’ve proposed so far achieves that, due to excluding 0.0 and shifting the distribution of everything else upwards.

There have been multiple attempts to communicate that further up in the thread, but if you don’t like that, there’s not much we can say or do, it seems :person_shrugging:

2 Likes

No, we don’t need to make other probabilities incorrect. We just need to choose some small positive number that absorbs the probability mass of all the smaller (excluded) numbers.

(unless you draw billions of numbers, but who does that?).

Unless you draw quintillions of numbers. For billions of numbers, this is a non-issue.

That’s why I’m so unsure about Float16. Because everybody draws billions of numbers.

from @edit(randexp())

### fallback randexp for float types defining rand:
randexp(rng::AbstractRNG, ::Type{T}) where {T<:AbstractFloat} =
    -log1p(-rand(rng, T))

I guess since we never generate Float16(1.0) this never generates Inf. On the other hand, the largest uniform thing it can generate is: .9995 so the smallest Exponential thing it can generate is -log(.9995) which is .0004883

Correctly rounded, random exponential should produce 0.0 with probability on the same order as 0.0004883 so we’ve got a bias bigger than anything i’ve suggested by exp(1e38) times or something.

But for Float32 or Float64 the proposal of just rejecting exact 0.0 makes the probabilities incorrect by an unobservable amount. You could globally generate random numbers with every computer in the universe for the duration of the universe so far and not detect it.

Float16 not so much.

1 Like

The two of you seem unwilling to engage with the interval perspective Stefan and I have been talking about. I do see where you’re coming from — but your arguments only make sense if rand() could actually generate uniform numbers as though they were drawn from a theoretically ideal Set((0:ε:1-ε) + ε/2) where ε=eps(T(0)). But that’s axiomatically impossible, because you need to return numbers between the representable numbers!

So then the only perspective that remains is to identify all the numbers between two floating point numbers by either the left side or the right side. The ε here could actually be bigger than eps(T(0)), but that’s part of the beauty — it’s a fully-defined meaning all the way down to a one-bit datatype! And it also describes the status quo.

Do you see this perspective?

2 Likes

What? This is diametrically opposed to our viewpoint.

Can you maybe give the exact cumulative distribution function cdf(p) = probability(rand() < p) that you believe is appropriate?

You may of course equivalently talk in terms of cdf_eq(p) = probability(rand() <= p), which is conveniently given by cdf_eq(p) = cdf(nextfloat(p)), and conversely cdf(p) = cdf_eq(prevfloat(p)).

These two functions are defined on all representable non-NaN floating point numbers, i.e. on all bitpatterns with isnan(p) == false (by construction cdf_eq(-0.0) == cdf_eq(0.0)).

We must have cdf(p) == 0 for p <= 0, and cdf(p) == 1 for p >= 1, since we promise that 0 <= rand() < 1. In cdf_eq terms, cdf_eq(p) == 0 for p < 0 and cdf_eq(p) == 1 for p >= prevfloat(1).

PS. What I called the “ideal distribution” is given by cdf(p) = p for 0 <= p <= 1.

What I called the “clamped ideal” distribution is given by cdf(p) = p for MINNORM < p <= 1 and cdf(p) = 0 for 0 <= p <= MINNORM, where MINNORM is the smallest non-subnormal positive float. MINNORM < ldexp(1.0, -120) for the relevant (non-Float16) types.

participants in this thread may find this blog interesting http://marc-b-reynolds.github.io/

he seems to have put quite a lot of thought into floating-point concerns for rng

2 Likes

How? You seem to be pointing at the interval that rounds to zero and using that to formulate how frequently zeros should appear in the output.

But then how do you formulate how frequently prevfloat(1.0) appears?

Comparing programmers to toddlers isn’t really helpful and you definitely don’t need to talk about them having guns. You can make the “it’s easy to accidentally misuse” argument without such disturbing imagery.

The randexp argument is kind of unraveling imo. Initially the argument was that returning zero is uniquely dangerous with this code as an example of why—because it causes the output to be Inf. But it turns out that it’s not actually incorrect for randexp to return Inf, it just returns it too often. Which is no longer an argument for zero being uniquely dangerous; instead it’s an argument that zero is returned too often. Moreover, zero is no longer unique in this argument—you can make the same argument that randexp returns 43.7407708592482 (ziggurat_exp_r - log(rand(rng)) when rand returns 2.220446049250313e-16) way too often, which is just as true. So what is this argument really showing? Two things:

  1. That it would help to return exactly zero far less often, e.g. by the various approaches that we’ve been discussing here to return more fine grained random values in [0, 1).
  2. That a different algorithm which doesn’t rely on such fine-grained precision for rand() would be better for all unlikely values.

So this has decayed from an argument for the unique danger of rand() returning zero ever to a much weaker argument that it should be more fine grained at the lower end of the unit interval and that if you’re writing code that generates random distributions based on the unit interval you have to be aware of the granularity of the output.

8 Likes

Well, let’s compute that! The probability of hitting exactly p is given by cdf_eq(p) - cdf(p) = nextfloat(p) - p. Plugging that in gives 1.0 - prevfloat(1.0) == eps(0.5).

How often does 0.3 / (2^10) appear? eps(0.3 / (2^10)). How often does unclamped zero appear? nextfloat(0.0) == 5.0e-324.

2 Likes

Ok, awesome, so we agree on each floating point value representing the left-side of the bin of all numbers between it and the next one greater. I had thought you were looking at rounding which is where that +ε/2 factor came from. So now your universe is akin to drawing from rand(0:2^1022-1)/2^1022 (rounding down)… except you’re throwing away that zero. So now there’s not a direct correspondence between the likelihood of drawing less than some number and it’s value. Sure it’s off by an absurdly tiny amount, but it’s not absurdly tiny for Float16.

And it’s also our status quo, but with N=53. And perhaps we can increase N for some of the types, but Float16 still caps out at 24. And throwing away zero is fiddling with the output in a manner that adds a bias, however small. Seems like a pretty bad idea.

In my view, we should increase N as much as is practically possible and then document it with specific examples on how to properly transform and work with its output.

Fair enough. I got carried away with the imagery. Apologies for that, I will reign myself in.

…I am sticking to my finitary guns here that 2^1000 orders of magnitude (!) too often deserves the description “incorrect”.

Also, lest someone cries that my ideal cdf is an incomputable theoretical construct:

julia> function ideal64()
              base = convert(Float64, ( (Base.significand_mask(Float64) & rand(UInt64) )  | (UInt(1) << Base.significand_bits(Float64))) )
              shift = 53
              while shift < 1200
              u = rand(UInt64)
              shift += trailing_zeros(u)
              u != 0 && break
              end
              ldexp(base, -shift)
              end

Now the loop is bad for performance, and reasonable tradeoffs need to be made.

My point about the exact zeros and the randexp issue is that it is a very important property of this cdf that cdf_eq(0) is infinitesimal.

Due to information theory, we cannot produce infinitesimal probabilities from 64 bits of uniform random – smallest we can do is ldexp(1.0, -64). If we want anything with smaller probability to be possible at all, then we need more random, or a branch or loop.

So I am arguing that, if we take 64 bits of random due to performance tradeoffs, the smallest achievable number should be of order ldexp(1.0, -64), in order to avoid the extreme overweighting of small numbers.

1 Like

I do think that 0.0 is a particularly problematic value. But if Float64 and Float32 had better fine-grained granularity near 0 it would become an ignorable problem, provided there were never bugs that introduced excess 0.0 values. On the other hand it’s seemingly likely to me that protecting against such bugs by introducing ignorable biases would be a good idea because bugs are a thing. The least problematic bias would be to check for 0.0 and output nextfloat(0.0) in that case. This introduces a point mass of ignorable probability and doesn’t break any probability relations for values larger than nextfloat(0.0)…

Float16 on the other hand just has all kinds of biases no matter what you do. I’m out at a soccer game so can’t check Julia, but what’s the mean of a billion rand(Exponential(Float16(1.0))) values? I’m guessing we will find it’s not 1 to more than 2 or 3 decimal places for example.

For Float32 (I’m going to kick myself for trying to write this on an android phone… Please try to interpolate between autocorrects)

function rand32()
   i = rand(UInt64)
   a = i & 0xffffffff
   b =  i >> 32
   if a == 0x0
     b == 0 && b= 1
     c = Float32(b/2^64)
    else 
      c = Float32(a/2^32)
   end
   return c
end

This implements sampling from [1/2^64,1) mixture with a point mass of 1/2^64 of probability 1/2^64 and the cdf(x) == x for all x greater than 2/2^64 right?

Note: edited several times for correctness hopefully correct now

Cool cool, so now I can say you’re applying the statistics from the maximum/best value of N and erroneously expecting that our N=24 universe to comply with them.

That’s the fundamental disconnect.

?
I am saying that if we somehow are magically restricted to only 1<<24 possible output values, then I’d like them logarithmically spaced please (denser close to zero) instead of linearly spaced, as appropriate for FLOATING point numbers (as opposed to fixed point numbers where linear spacing is appropriate).

Also, I am very willing to trade off larger absolute errors in the cdf if I get smaller relative errors in exchange.

Because, like, relative errors are point of floating the point.

As always, there is an end of the scale where floating point numbers cannot give good relative errors any longer. This is when they hit the limits of their dynamic range.

So any kind of trade-off will need to decide on a range where the cdf has good fit (in terms of relative errors), and will then need to decide what to do below that range.

Below that range there are only two possibilities: We either overweight the tail of tiny numbers, or we underweight it. I am arguing that underweighting it is better than overweighting it.

As I’ve mentioned in this thread before (which would soon include the works of Shakespeare by chance), this is a multi-objective problem and as such without careful consideration of goals, it can go on forever.

To perhaps move things forward, here is another rand which I think can serve as a baseline from which to find faster ones. It has many nice qualities:

function slow_rand(::Type{T}; rounding = :down) where {T <: Real}
    lo, hi = zero(T), one(T)
    while nextfloat(lo) < hi
        mid = (hi/2)+(lo/2)
        ( mid == lo || mid == hi ) && break
        p = (mid - lo)/(hi - lo)
        if rand() < p
            hi = mid
        else
            lo = mid
        end
    end
    return rounding == :down ? lo : hi
end

Some of the good qualities are attested by:

julia> Iterators.filter(==(1), (slow_rand(Float16; rounding = :up) for _ in 1:100_000)) |> collect
44-element Vector{Float16}:
 1.0
 1.0
 ⋮
 1.0
 1.0

julia> Iterators.filter(==(0), (slow_rand(Float16; rounding = :up) for _ in 1:100_000)) |> collect
Float16[]

julia> Iterators.filter(==(1), (slow_rand(Float16; rounding = :down) for _ in 1:100_000)) |> collect
Float16[]

julia> Iterators.filter(==(0), (slow_rand(Float16; rounding = :down) for _ in 1:10_000_000)) |> collect
1-element Vector{Float16}:
 0.0

It produces subnormals too. And is pretty uniform. And will most likely work for other Reals, even ones yet to be defined. Feel free to add more rounding methods.

So: can we close this particular circle? We’ve all long agreed it’d be beneficial to increase the precision of rand.

We’ve not all agreed about going open-open, but given the agreement on that first point, it’s not very helpful to use our current limited precision as an argument to do so. We’ve also agreed that throwing away zeros biases some uses of rand() — but the size of the bias depends upon the final precision we land on.

We’ve not agreed on what that precision should be, but that’s where there’s a spectrum of tradeoffs. I’ve been thinking that all possible algorithms can be described by my single N parameter in the linearly spaced set of (0:2^N-1)/2^N, but the idea of fully-logarithmically spacing those points is interesting (though note that once that linear N exceeds the number of mantissa bits it starts approximating a logarithmic spacing).

4 Likes

There is also performance to consider, which is often an important factor. Also, will subnormals be produced is another question…
Anyway, just a demonstration of slow_rand on strange Reals:

using FixedPointNumbers

Base.nextfloat(fp::Fixed{T, N}) where {T,N} = 
  Fixed{T,N}(reinterpret(T, fp)+1,0)

and now we can:

julia> histogram([slow_rand(Fixed{Int8, 4}) for _ in 1:100_000])
                ┌                                        ┐ 
   [0.0 , 0.05) ┤█████████████████████████████████  6 350  
   [0.05, 0.1 ) ┤████████████████████████████████▉ 6 338   
   [0.1 , 0.15) ┤████████████████████████████████▋ 6 279   
   [0.15, 0.2 ) ┤███████████████████████████████▉ 6 150    
   [0.2 , 0.25) ┤  0                                       
   [0.25, 0.3 ) ┤████████████████████████████████▋ 6 278   
   [0.3 , 0.35) ┤████████████████████████████████▋ 6 272   
   [0.35, 0.4 ) ┤████████████████████████████████▋ 6 279   
   [0.4 , 0.45) ┤████████████████████████████████▍ 6 237   
   [0.45, 0.5 ) ┤  0                                       
   [0.5 , 0.55) ┤████████████████████████████████▋ 6 271   
   [0.55, 0.6 ) ┤████████████████████████████████▋ 6 291   
   [0.6 , 0.65) ┤████████████████████████████████▎ 6 175   
   [0.65, 0.7 ) ┤████████████████████████████████▌ 6 243   
   [0.7 , 0.75) ┤  0                                       
   [0.75, 0.8 ) ┤████████████████████████████████▍ 6 236   
   [0.8 , 0.85) ┤████████████████████████████████▎ 6 180   
   [0.85, 0.9 ) ┤████████████████████████████████▍ 6 230   
   [0.9 , 0.95) ┤████████████████████████████████▎ 6 191   
                └                                        ┘ 
                                 Frequency                 

There is a little side lesson: how do you define nextfloat more generally for other types, perhaps a typenext which would work more generally?
Another lesson: The need to extend Base.midpoint to all Reals, since it works just for integers and would be good to replace mid = (lo/2) + (hi/2) which can be tricky.

That seems like a good summary, the algorithm rand32 above is intended to partially logarithmically separate things, it uses the uniform algorithm with N=32 unless it would generate 0 in which case it uses the uniform algorithm N= 32 on the range (0,2^-32) (and then shifts the 0 point mass up to 1/2^64 but that could be removed.)

At 1/2^64 a million cores at a billion cycles per second would make a zero about every .0006 years so I still like the 0 fixup.