Output distribution of rand(Float32) and rand(Float64), thread 2

Dan · October 25, 2023, 5:34pm

Don’t feel alone with this approach. You might want to have a look at a few messages back in this thread:

mbauman · October 25, 2023, 6:24pm

You’re exactly right — and I think your conclusions there all stem from that theoretical equivalence of evenly dividing a continuous space into N possible values and selecting one (before truncation). Specifically, there are 2^{32}-2^{23} evenly spaced divisions from 2^{-9} to 1 with deltas of eps(2f0^-9). So if you want the algorithm to support all representable floating point values greater than 2^{-9}, there goes 32 bits. It doesn’t matter that you’re counting leading zeros to do a fancy exponential distribution thingy, those are just as meaningfully selecting exponents as they would point the way to an exact individual value! And that’s pretty cool.

This is where I got totally confused by @foobar_lv2’s prototype from years ago: I had thought it purported to select from all possible floating point values greater than its minimum output of 2^{-33} — that’d be selecting from 2^{56} possible values (before truncation) using just 32 bits of entropy! That’s of course not possible, and what’s going on is that it’s re-using the same bits for both the significand and exponent (all 32 bits are in the count_zeros part, allowing 33 different exponents… and it re-uses 24 of those bits for the significand, which means that there’s very few unique values represented in the very small exponents). So while its smallest value is 2^{-33}, its next smallest is 2^{-32}, its next smallest is 2^{-32}+2^{-33}, and so on. I think this is actually a very valid and clever re-use of the bits — but I’m still a little confused how this would seem to imply selecting from 2^{33} values with 32 bits.

Edit: oh, of course, it only returns the odd significands in the range [0.5,1), leading to the strange effect of it having less entropy in that octave than the status quo.

Dan · October 25, 2023, 6:45pm

This is why in my implementation (message 86 in thread), the leading_zero is discarded and the rest of the bits are used in the significant. They are independent and random. The problem is in the case leading_zeros are too many leaving too few bits for the significant - very rare when using UInt64 for Float32

mbauman · October 25, 2023, 10:15pm

The fun part about doing this as a transformation from a UInt32 is that you can exhaustively demonstrate that it’s doing what you want it to do, even on a relatively dinky laptop. Here’s a transformer inspired by Dan and foobar’s posts that selects from 2^{32} evenly-spaced values in [0, 1) and outputs them as correctly-truncated Float32s:

function bits2float(r::UInt32)
       exponent = trailing_zeros(r << 1) # 1 - 32
       exp_bits = UInt32(127-exponent)<<23
       fraction = (r & (0xffff_ffff ⊻ UInt32(1) << (exponent-1)))>>9 # unset the trailing one
       return reinterpret(Float32, exp_bits | fraction) * !iszero(r)
end

This thing outputs a grand total of 2^{26.3} distinct values. There is exactly one UInt32 that is transformed to each Float32 less than 2^{-8}, there are two UInt32s per float in [2^{-8}, 2^{-7}), four in [2^{-7}, 2^{-6}), and so on.

julia> function bits2float(r::UInt32)
              exponent = trailing_zeros(r << 1) # 1 - 32
              exp_bits = UInt32(127-exponent)<<23
              fraction = (r & (0xffff_ffff ⊻ UInt32(1) << (exponent-1)))>>9
              return reinterpret(Float32, exp_bits | fraction) * !iszero(r)
       end
bits2float (generic function with 1 method)

julia> A = @time [bits2float(u) for u in UInt32(0):typemax(UInt32)];
  7.475025 seconds (85.02 k allocations: 16.004 GiB, 0.41% gc time, 0.33% compilation time)

julia> @time sort!(A, alg=MergeSort);
120.746061 seconds (3 allocations: 8.000 GiB)

julia> function check_range(outputs)
           for i in eachindex(outputs)
               if outputs[i] != round(2^-32*(i-1), base=2, sigdigits=24, RoundDown)
                   error("index $i: expected $(round(2^-32*(i-1), base=2, sigdigits=24, RoundDown)), got $(outputs[i])")
               end
           end
       end
check_range (generic function with 1 method)

julia> @time check_range(A)
 47.876296 seconds (20.92 k allocations: 1.079 MiB, 0.03% compilation time)

julia> A_unique = unique(A);

julia> length(A_unique)
83886080

julia> log2(length(A_unique))
26.32192809488736

julia> count(x->bits2float(x)==prevfloat(2f0^-8), UInt32(0):typemax(UInt32))
1

julia> count(x->bits2float(x)==2f0^-8, UInt32(0):typemax(UInt32))
2

julia> count(x->bits2float(x)==prevfloat(2f0^-7), UInt32(0):typemax(UInt32))
2

julia> count(x->bits2float(x)==2f0^-7, UInt32(0):typemax(UInt32))
4

julia> count(x->bits2float(x)==prevfloat(1f0), UInt32(0):typemax(UInt32))
256

mikmoore · October 25, 2023, 10:36pm

Since we keep talking about it, here’s a version of a “scaling” algorithm that manually imposes the round-to-zero behavior on the unsigned → float conversion. It does this by manually truncating the integer to the representable value so that no rounding occurs.

function scale_to_float(u::U) where U<:Union{UInt16,UInt32,UInt64}
	# map UIntXX linearly into FloatXX on range [0,1)
	F = Base.floattype(U)
	bitwidth = 8 * sizeof(U) # bits in type U
	tokeep = Base.significand_bits(F) + 1 # keep this many significant bits
	mask = typemax(U) << (bitwidth - tokeep) # mask to retain `tokeep` MSBs
	u &= mask >>> leading_zeros(u) # for each leading zero we can accomodate a lower bit
	return F(u) * F(exp2(-bitwidth)) # lossless float conversion and scaling
end

This version does not SIMD very well, so there’s room to improve that. Manual truncation is unnecessary with a suitable rounding mode, so that would likely perform much better.

foobar_lv2 · October 26, 2023, 12:46am

I’m not advocating bits2float(u::UInt32) style of algorithms any longer. The reason is that random bits are so damn cheap since we moved away from mersenneTwiser that we can just take 64 bits for a Float32.

They are literally, not just figuratively, free since the random generator does not even bother to track finer grained number of bits than batches of 64!

If we insist on a bits2float(u::UInt32) then we’re screwed somewhere and need to do hacky things.

@mbauman I think your “count how many uints are mapped to each result” - style of arguments is very close to the following information content argument.

So in that context: information content of x is -log2(probability(x)). The expected / mean information content of the distribution is the Shannon entropy.

The desired ideal distribution has entropy 25 bits (its more like 24.9... bits – 23 from the mantissa, plus almost 2 from the truncated geometric exponent).

Nice, looks good when we start with 32 bits of random?

Nope! Half the numbers are in [0.5) and have 24 bits of information each. 1/4 of the numbers are in [0.25, 0.5) and have 25 bits of information each. 1/8 of the numbers have 26 bits each… and a significant chunk of numbers have more than 32 bits of information.

This is bad! Unless we can stash random bits for later use, it’s not helpful that we often only use 24 bits, or that we only need 25 bits on average – the rest will be discarded. And there is no stash to help us when the rainy day comes and we need more than our allotment of 32 bits.

So entropy (expected / mean information) is not really the relevant quantity.

On the other hand, maximal information content is not the relevant quantity either.

Consider the task of generating a geometric distributed BigInt. P(0) = 1/2, P(1) = 1/4, P(2) = 1/8, etc. How many bits do you need?

Shannon entropy is 2 bits, maximal information content is infinite.

So the relevant quantity is a quantile. Given N, let P_N be the probability that a random output has more than N bit information content. Now we have the language to say: We need enough random bits such that P_N is vanishingly small.

If we come into the bad (unlikely, prop < P_N) situation that we don’t have enough bits to select an output, then we can truncate, do hacky stuff, or halt and catch fire.

What should P_N, i.e. the threshold for “vanishingly rare” be? Depending on social contract that could be:

Once in a blue moon (a core-year). Nothing bad happens in case of failure, so we can afford rare failures. That would be log2(3e16) where 3e16 is the number of nanoseconds in a year, i.e. 55 bit.
So rarely that you can seriously consider rather disbelieving your CPU than believe that you got so unlucky. Afaiu 1 error per 1000 core-years is expected (limited reliability of silicon). That would be ~ 64 bit.
So rare that you definitely rather disbelieve the CPU. Or you rather disbelieve your lying eyes (you could be having a stroke right this second!). Or your software has a bug. It must be a bug. The CIA has spiked your coffee with LSD. This is a nightmare. Wake up. Wake up now! (that’s 128 bit likeliness)

Oscar_Smith · October 26, 2023, 12:59am

One counter point to this is that for vectorized random numbers, being able to efficiently use entropy may be useful.

danielwe · October 26, 2023, 2:02am

Is such a blanket statement warranted? I’ll reiterate that the bits2float conversion in Base is the fallback for RNGs throughout the ecosystem, including CUDA.jl, where things are still happening 32 bits at a time. GPU is a key motivation for using Float32.

It would be interesting to compare actual GPU performance once we have a viable alternative for both 32-bit and 64-bit conversion (perhaps even two different 32-bit candidates: one that only ever uses a single UInt32, and one that grabs a second one in a rare branch).

The conclusion could of course be that CUDA should implement its own conversion.

Note also that the fallback conversion—used by every RNG out there except Xoshiro since 1.7—is not even the 24-bit one we’ve been using as a baseline, but the old-fashioned 23-bit one that maps via [1, 2). Improving on this seems overdue. (Issue: Issues · JuliaLang/julia · GitHub)

Beyond that, let me just say I appreciate everyone’s insights and attempts at implementations here. I’m learning a lot! There seems to be a consensus that rand(Float32) must do better, though it’s still unclear exactly how good. We have damning evidence against relying on any specific rounding mode in the implementation (I have some more up my sleeve, but I don’t think the threads need it).

Going forward, it seems to me that the question of how good we need to do would be best served by converging on a small number of actually viable candidate implementations consuming different numbers of bits, so folks can discuss their pros, cons, and tradeoffs, including suitability for different hardware. With that in mind, I’ll just be plodding away at some ideas whenever I have a spare moment, hopefully improving my bit twiddling-fu as I go, and report back if I learn something important.

(Btw., my plodding will assume we’re sticking to [0, 1). I’m personally a fan of (0, 1], as it makes more sense to be open at the end where representable numbers are denser, but, as I read it, changing bounds would be breaking, and I don’t agree that occasionally returning 0 is inherently problematic.)

mbauman · October 26, 2023, 2:50am

Continuing to jump up and down shouting “this is bad!” is simply not a useful argumentation technique here. Many of us are engaging trying to figure out if we can do better, and what better would look like.

There are tradeoffs and correctness is crucial. To wit, the corsix blog post does not advocate trying to generate all possible Float64s, but rather finds a reasonable trade off with gradations on the scale of 2^{-76}. But that number could just as well be 2^{-64} or 2^{-53}, and I’ve yet to see cases where improving this would be meaningful in the same way that changing Float32 could be.

So then, yeah, what about Float32? Let’s start getting some performance numbers together on some of these implementations so we can weigh those tradeoffs. If it’s too cost-prohibitive then we document it and tell people to watch out or use Float64.

foobar_lv2 · October 26, 2023, 9:18am

Do we have reached consensus that, in absence of performance consideration, the ideal uniform random Float32 generation is given by roughly convert(Float32, rand(UInt128)>>2) / convert(Float32, UInt128(1)<<126)?

And especially not by uniform random [1f0,2f0) - 1.f0, i.e. by linearly spaced points that each have the same probability? Some people have argued that each number in the set of possible outputs should have the same probability.

If that consensus is reached, then we can start thinking about tradeoffs and tricks and precision to performantly approximate this slightly inconvenient distribution.

We can especially discuss whether convert(Float32, rand(UInt32)) / convert(Float32, 1<<32) is good enough, and whether convert(Float32, rand(UInt64)) / convert(Float32, UInt128(1)<<64) is fast enough.

We can also think about tradeoffs between accuracy and testability: There is a “Bermuda interval” of rare events, from maybe 40-80 bits, that is rare enough to be invisible in testing and likely enough to potentially spoil a cluster run. If we think that user code bugs triggered by exact zeros are prevalent (1/ rand() style), then it would be helpful to keep the probability of exact zeros outside of this “Bermuda interval”.

PS.
I did not see that before, but my proposed getfloat_trail(r::UInt32) from 5 years ago is exactly equivalent in terms of distribution to convert(Float32, rand(UInt32)) / convert(Float32, 1<<32) with round-to-zero. Just think about what convert(Float32, u::UInt32) is doing – it counts the leading zeros, and that will be the exponent. Then it throws away all the “consumed bits”, i.e. the leading zeros and the first one-bit, and shifts the remainder of u into the mantissa. This discards some random bits if leading zeros was small; and it shifts in zeros from the right if there are too many leading zeros.

mbauman · October 26, 2023, 12:56pm

Yes, exactly. The equivalence between a correct algorithm and a correctly-rounded rand(0:2^N-1)/2^N is precisely what I’ve been trying to express in a myriad of ways throughout this. Bigger N is better. Smaller N is worse. You need N bits of entropy to do this correctly, even with clever trailing zeros or leading ones things.

@mikmoore’s latest algorithm is also provably correct and functionally equivalent to mine, pulling from rand(0:2^32-1)/2^32.

Your getfloat_trail is not correct; it’s pulling from something like rand(setdiff(1:2^33, 2^9:2:2^33))/2^33.

Dan · October 26, 2023, 2:04pm

Both methods are equivalent, but I disagree about the use of entropy. The rand(0:2^N-1) uses N bits of entropy and then discards entropy when rounding.
When using the leading_zero methods the amount of entropy used is variable. But mostly smaller than N for the same distribution. Essentially, after using the leading bits for sampling a geometric distribution (on average about 2-3 bits but this method uses variable number of bits), and using a fixed amount of bits for the significant, a few bits are left. These bits can potentially be used subsequently.

Without the above details, just the use of rounding usually means it is not informationally optimal.

mbauman · October 26, 2023, 2:30pm

The entropy isn’t thrown away upon rounding. It’s just not in the bit pattern. It was quite effectively used to increase the propensity of those “rounded” values.

Dan · October 26, 2023, 3:59pm

To make things concrete, here is a function to sample a block of Float32s (of length 1000 by default):

function rand_f32_block(len = 1000)
    res = Vector{UInt32}(undef, len)
    c = 0
    r = 0x0
    tt = 0
    while c < len
        t = trailing_zeros(r) + tt
        tt = 0
        r >>= t+1
        c += 1
        res[c] = UInt32(126-t) << 23
        if r == 0x0
            r = countrand(UInt64)
            tt = leading_zeros(r)
        end    
    end
    r2 = 0x0
    bitsleft = 0
    for i in 1:len
        if bitsleft < 23
             r2 = countrand(UInt64)
             r = r | (r2 << bitsleft)
             r2 = r2 & ((1 << bitsleft)-1)
             bitsleft += 64
        end
        res[i] |= r & ((1 << 23)-1)
        bitsleft -= 23
        r = (r >> 23) | (r2 << (64-23))
        r2 >>= 23
    end
    return reinterpret(Float32, res)
end

and to test it:

julia> nbits::Int = 0
0

julia> countrand(T) = ( global nbits ; nbits += 8*sizeof(T); rand(T) )
countrand (generic function with 1 method)

julia> rand_f32_block();

julia> nbits
25024

julia> 32*1_000   # the bits used in a 1_000 rand(UInt32)s
32000

25_024 is much lower than 32_000. Actually, there might be some speed improvement due to this bitwise consumption of randomness. Especially if the random generator has larger generated outputs (say 512bits).

(of course, bugs may still lurk in this hasty code)

mbauman · October 26, 2023, 4:07pm

Prove to me that you’re generating a valid distribution and then we’ll talk about how to exploit this newfound compression algorithm that breaks information theory.

foobar_lv2 · October 26, 2023, 4:54pm

This is not a new-found compression algorithm that breaks information theory. It’s not even fancy – it is a huffman table for Float32 for the probability distribution induced by the real line. No insane ANS / zstd-style fractional bits involved.

There won’t be a speed improvement. Random generation is faster than decompression in 99% of cases.

mbauman · October 26, 2023, 5:26pm

I’m not talking about the entropy in the bits or even of the numeric values themselves. You can generate all possible random Float32s between 0 and 1 with 30 bits:

obviouslybadrand() = reinterpret(Float32, UInt32(rand(0:2^7-2)) << 23 | UInt32(rand(0:2^23-1)))

But that’s not going to be a uniform distribution in the value domain. Y’all can pontificate all you want, until you prove that you can generate a uniform distribution with greater granularity and/or fewer bits than @mikmoore or I can, I’m not going to believe you.

dlakelan · October 26, 2023, 5:29pm

Noone has commented on my convolution distribution. Is this because it wasn’t understood, or thought to be wrong, or thought to be inefficient, or insufficiently fast? I haven’t thought carefully about it but it may not provide any advantages over just converting from a float 64

mbauman · October 26, 2023, 5:35pm

You mean rand32float()? I think it’s all of the above: it uses at least 64 bits of randomness (and sometimes more), has a while loop, I think it returns 1s, and I wouldn’t be surprised if it has some rounding issues.

dlakelan · October 26, 2023, 5:47pm

It doesn’t return 1

All but about 1/4096 calls goes through that while loop once and btime shows that it’s amortized cost is fairly reasonable compared to converting from float 64. But converting from float 64 is a bit faster.

It is clearly capable of calculating a wide variety of very small floating point numbers for example if a is of order 2^-20 and b is of order 2^-10 you’ll get a full mantissa worth of bits for a value near 2^-20. That’s the main value I think.

Topic		Replies	Views
Output distribution of rand() Internals & Design	7	2231	October 19, 2023
Frédéric Goualard found an inexactness in our rand() Numerics random	19	1036	September 8, 2022
Performance of generating uniform Float32 numbers General Usage question	4	273	November 6, 2023
Random number in (0,1] General Usage	16	14963	October 8, 2018
Generating Float32 uniform random numbers with Distributions.jl General Usage distributions , random	2	827	November 6, 2023

Output distribution of rand(Float32) and rand(Float64), thread 2

Related topics