Are rand(Float32) really uniform in [0,1)?

Hello, this is my first post.

I started tinkering with Julia and I wanted to understand more how random floating point numbers are generated. I will usually call rand(Float32) and be happy. I make the following statement:

If a floating point x is uniformely distributed in [0,1] then mantissa(x) (as a bitstring) is uniformely distributed in [0..2^m)

I’m pretty sure of the result because if it’s uniform in [0,1] it must be uniform in every interval [2^i, 2^(i+1)) and being uniform in such intervals means having a uniform mantissa.

I have produced a notebook (which I cannot upload for some reason) in which I test the distribution of the mantissa bits for different generation policy.

Why the designers of the language have chosen this particular generation style? Isn’t better to have (mostly) uniform mantissa bits? Btw, here’s the alternative generation scheme, copied from the standard library of C++ (there may be errors)

struct CppRNG <: Random.AbstractRNG
    rng::Random.AbstractRNG
    CppRNG(rng::Random.AbstractRNG) = new(rng)
end


function Random.rand(
    rng::CppRNG,
    ::Type{T}
) where { T <: Union{Float16, Float32, Float64} }
    U = Base.uinttype(T)

    x_int = rand(rng.rng, U)
    max_int = 2^(8*sizeof(U) + 1)

    converted = T(x_int)
    normalized = converted / T(max_int)

    @assert 0.0 <= normalized < 1.0
    @assert typeof(normalized) == T

    return normalized
end

I’m interested in the motivation for the choice. I understand that a possible change would break all the current programs so it’s out of scope

4 Likes

Not quite. If x is uniformly distributed over all positive normalized floating point numbers (not over the real interval [0,1]), then the mantissa bits would be uniformly distributed.

When sampling over the real interval [0,1] the results are biased because there are intervals, such as [0.5,1] that are larger than other intervals, such as [0.25,0.05] and those larger intervals have mantissas all of 1. and the implicit leading 1 combined with mantissa bits creates a non-uniform distribution when viewed as raw bits.

12 Likes

Your observation is correct. Random floating point numbers in julia are generated as random fixed point numbers, then rounded to floating point. This is equivalent to sampling a random number in [1,2) and subtracting 1.

The most egregious violation is that the probability of hitting exactly zero with rand(Float32) should be on order nextfloat(1.0f0), i.e. never observed in human history, but in fact happens all the time.

A corollary is that you somewhat counter-intuitively cannot use random samples from (0,1] to sample from an unbounded distribution with known cdf, like e.g. the otherwise attractive 1/rand() algorithm for sampling from power-law: You might naively expect that this works, because you can evaluate 1/x with good precision on floating point, but you really are evaluating 1/((1+rand()) -1) which suffers from catastrophic cancellation / floating point precision loss.

This is a known problem, we had a giant thread about it. The standard computer science literature is pretty silent on the issue. Many real world existing codes are very wrong on this, e.g. the Float32 output of Pareto in rand_distr - Rust has catastrophically wrong tails.

One of the issues is that we cannot quickly evaluate rand(Float64) in a precise and branch-free manner. Sampling such a number has ~54 bit entropy, but entropy is really just the expected number of needed random bits, and each number has a 0.1% chance of needing more than 64 bit.

In the end, we did not go forward with fixing the issue in julia. Maybe we will at some future time.

But code that relies on “correct” floating point sampling is pretty problematic anyway, since to our knowledge almost no other language / framework generates “correct” floating point samples. Writing such code is a landmine for any persons who port it to a different language.

9 Likes

Not quite. As I remember, the bits-to-float transformation used in Base is slightly more sophisticated and preserves one more bit of entropy. It’s equivalent to sampling a random number in [0.5, 1) and then flipping a coin whether to subtract 0.5 or not.

I think CUDA.jl’s device RNG library may be using the [1, 2) - 1 variant.

Only the brave enter here: Output distribution of rand(Float32) and rand(Float64), thread 2

6 Likes

My thesis is that rand(Float32) should return the floating point representation of a real number uniformely distributed in [0,1]. By what I have written above, the mantissa (what comes after 1.xxxx) should be uniformely distributed in every exponent range and so it should be uniform (regardless of how represented are such exponent ranges).

With the current generation scheme when x is sampled, depending on the exponent range it doesn’t have m significant digits, with m being the size of the mantissa

How do I interpret the graphs properly? What does the 0-22 mean for the mantissa bits and why do the frequencies for each peak at 0.5 instead of collectively adding to 1?

If you’re talking about the bit sequence following the radix point, then it wouldn’t be uniform if we’re rounding samples from a continuous standard uniform distribution to floating point numbers because of the different interval lengths per exponent value. I made a rough diagram of the intervals for each 2-bit sequence, and you can see that 00 is allotted a smaller interval than the others at each exponent value except for the smallest normals, and the largest normals contributes a larger tail interval to 11. The bit sequences that aren’t all 0s or all 1s do seem evenly matched.

I don’t know how that would translate to your graphs though, it could very well balance out to a flat line, and this doesn’t imply anything about the actual algorithms or rounding.

Just checking, isn’t the maximum of an unsigned integer 2^(8*sizeof(U))-1? That’s about what typemax does anyway. If we’re going for [0, 1) as documented, then maybe it’s just 2^(8*sizeof(U)).

I think the expectation that sampling values uniformely from [0,1] implies that all Float32 values in that interval can be returned from rand is just not true. And once that is not guaranteed, there is no reason to expect that the mantissas are uniformely distributed.

The question is kind of: to what bin size is the output of rand uniformely distributed? I think a good part of the thread linked above discusses basically this question.

7 Likes

We are mostly talking about rounding modes here.

We are really considering 4 aspects here:

  1. Expectations from prior work: What do users expect who are knowledgeable about prior work in numerical computing. Arguments about this aspect talk about what some famous fortran library in the 80s was doing.
  2. Expectations from mathematical intuition: What do users expect who are knowledgeable about real numbers, but have not widely read implementations of this thing in other languages / frameworks. Arguments about this aspect can talk both about subjective “I think it should be obvious that…”, or can point out more objective empirically observable misunderstandings, like Rust rand_distr having very wrong tail statistics for sampling from pareto, because the implementors of “uniform distribution” used “fixed point” interpretation / rounding mode, while the implementors of “pareto distribution” applied an inverse cdf sampling strategy that is only correct if “uniform distribution” uses the “floating point” interpretation / rounding mode.
  3. Usefulness. How useful would it be to have the “floating point” interpretation of sampling from the unit interval? Arguments in that direction would be of the style “using inverse cdf to sample from pareto would be quite neat”, or conversely “only a fool would use inverse cdf to sample from pareto – that works if you have a floating-point-style uniform source, but this is such a big footgun that you shouldn’t write that code, somebody will port it to different languages / architectures / run on cuda / run on apple accelerator, and everything will silently break”.
  4. Implementation convenience. How hard / performant would it be to get floating-point-style uniform numbers? On CPU, on GPU, on TPU, on arm, on amd64, etc etc. Arguments in this direction write and benchmark beautiful avx512 code that minimizes branch-misses, and then have a big discussion how an analogue performs on arm or on this and that nvidia chip.

FWIW, I very strongly agree with you on (2). The current state offends my mathematical aesthetics, and it empirically leads to bugs because people thoughtlessly apply inverse cdfs with singularity at zero. We could mitigate these bugs.

That being said, the other 3 aspects carry a lot of weight.

Pragmatically you need to accept that “uniform random float from the unit interval” as an API predates and is much bigger than julia; and it is a hopelessly broken legacy API that you cannot in general rely on doing anything specific.

Such is life when working with any culture / technology, everything builds on ill-considered decisions of our forebears, and we cannot fix their mistakes, only work around them.

5 Likes

The plots are the distribution of set bits (so bits of value 1) in every mantissa bit of the Float32 generated. The color encodes if they pass a binomial test with null hypothesis being: “the bit is equally distributed”.

yes, you are right.

1 Like

In other words, the frequency of each bit being 1. In that case, the earlier mentioned skew of rounding uniform samples of [0,1) should just slightly raise each frequency to the same number slightly >0.5. This is a very indirect statistic so it’s not clear what intervals are underrepresented; a particular bit being 1 is true for many non-adjacent floating point numbers with varying exponent values that aren’t evenly paced over [0,1). It’d probably help if the exact methodology of this statistic is provided.

FWIW, I can roughly corroborate it with a naive program.
julia> let n = 100_000
         freqs = zeros(Int, 23)
         for j in 1:n
           num = rand(Float32)
           mantissa = bitstring(num)[10:end]
           for i in 1:23
             if mantissa[i] == '1'
               freqs[i] += 1
             end
           end
         end
         freqs
       end
23-element Vector{Int64}:
 49979
 50010
 49904
 49754
 50018
 50095
 49938
 49806
 50041
 50010
 49879
 50031
 49945
 49747
 49941
 49893
 49550
 49545
 48269
 47024
 43837
 37529
 25079

Also worth pointing out that while a uniform sampling should result in a flat graph, the inverse isn’t generally true. For example, if the generator ignores any specific exponent value, then we’d still get a flat line slightly >0.5.

I designed the test to specifically target the “mantissa bits are uniformely distributed” hypothesis. For example the C++ floating point generator passes the test, but not the Julia one. I have ruled out doing a proper uniformity test because it would require a lot more samples.

If all the generator proposed are uniform, then I’d say

Range(From12Generator) < Range(JuliaGenerator) < RangeCppGenerator

with “<” being the inclusion sign. I’d argue that having a bigger range is better anyway, assuming the generator is correct

I categorized the set-bit frequencies by subintervals with a specific exponent value (bounds in my code are off by 1, sorry). Not a perfect map of where the missing bit sequences are, but it does help. What’s interesting is that the lower rows (bits farthest from radix point, lower numbers on your graph) diagonally zero out as the exponent values decrease from 126 (largest normals from 0.5f0 to prevfloat(1.0f0)), which decreases the total contribution to the lower bits plotted in the original graphs. In other words, as the exponent values decreased by 1, the effective number of bits in the mantissa decreased by 1 to keep the precision at 2^-24. I don’t know the exact reason in the algorithm, but it’s acceptable for uniform sampling to use about the same precision throughout [0,1) instead of getting more precise for smaller subintervals of normals with the same exponent value. Aside: subintervals halve in width as exponent value decreases by 1, which is consistent with the sampling frequencies halving.

julia> freqs = let n = 100_000
         freqs = zeros(Int, 23, 2^8) # mistaken 1-based indexing
         for j in 1:n
           num = bitstring(rand(Float32))
           exponent = parse(Int, num[2:9], base=2)
           mantissa = num[10:end]
           for i in 1:23
             if mantissa[i] == '1'
               freqs[i, exponent] += 1 # thankfully exponent never hit 0
             end
           end
         end
         freqs
       end;

julia> freqs[:,105:130]
23×26 Matrix{Int64}:
 0  0  0  1  0  1  0  1  3  6  13  27  45  101  208  404  804  1549  3061  6308  12568  25341  0  0  0  0
 0  1  0  0  0  1  0  2  2  4  13  34  41   99  184  395  801  1503  3107  6247  12489  25091  0  0  0  0
 0  0  0  1  0  0  0  0  2  3  12  28  50   79  195  416  783  1565  3106  6359  12557  25010  0  0  0  0
 0  0  0  0  0  0  0  1  2  2   7  30  45   96  189  398  814  1533  3027  6404  12590  25038  0  0  0  0
 0  0  0  1  0  0  0  1  2  5  11  32  41  104  198  417  851  1530  3041  6290  12425  25081  0  0  0  0
 0  0  0  0  0  0  0  1  2  4   9  32  47   96  185  396  795  1536  3144  6288  12573  24973  0  0  0  0
 0  0  0  0  0  1  0  0  3  4   9  35  38   80  205  413  822  1613  3029  6245  12448  24800  0  0  0  0
 0  0  0  0  0  0  0  1  3  5   8  20  48   94  216  395  807  1558  3032  6240  12607  24986  0  0  0  0
 0  0  0  0  0  0  0  1  2  7  11  32  49  100  188  378  818  1544  3048  6220  12383  24980  0  0  0  0
 0  0  0  0  0  0  0  0  2  2  12  28  46   99  194  394  779  1525  3024  6234  12520  25002  0  0  0  0
 0  0  0  0  0  0  0  0  0  6  12  26  47   98  200  406  812  1515  3086  6273  12509  25128  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  10  28  52   98  211  374  819  1574  3094  6251  12464  25070  0  0  0  0
 0  0  0  0  0  0  0  0  0  0   0  25  49   95  217  405  773  1539  3110  6218  12407  25000  0  0  0  0
 0  0  0  0  0  0  0  0  0  0   0   0  32   97  190  403  829  1563  3121  6176  12541  24982  0  0  0  0
 0  0  0  0  0  0  0  0  0  0   0   0   0   87  196  395  810  1539  3050  6231  12402  24867  0  0  0  0
 0  0  0  0  0  0  0  0  0  0   0   0   0    0  185  392  833  1607  3082  6346  12496  25036  0  0  0  0
 0  0  0  0  0  0  0  0  0  0   0   0   0    0    0  417  749  1610  3079  6252  12515  24991  0  0  0  0
 0  0  0  0  0  0  0  0  0  0   0   0   0    0    0    0  789  1555  3118  6278  12450  25038  0  0  0  0
 0  0  0  0  0  0  0  0  0  0   0   0   0    0    0    0    0  1569  3088  6283  12598  25388  0  0  0  0
 0  0  0  0  0  0  0  0  0  0   0   0   0    0    0    0    0     0  3075  6201  12520  25137  0  0  0  0
 0  0  0  0  0  0  0  0  0  0   0   0   0    0    0    0    0     0     0  6152  12658  25058  0  0  0  0
 0  0  0  0  0  0  0  0  0  0   0   0   0    0    0    0    0     0     0     0  12502  24981  0  0  0  0
 0  0  0  0  0  0  0  0  0  0   0   0   0    0    0    0    0     0     0     0      0  25095  0  0  0  0

ECDFs approach the CDF as the sample size increases, so there’s no visible missing subintervals:

julia> using CairoMakie; let T = Float32, N = 1_000
         lines(0:1, 0:1, color=:red)
         ecdfplot!(rand(T,N), color=:blue)
         ecdfplot!(rand(T,100*N), color=:green)
         current_figure()
       end

The reason is that when an integer in [0..2^(m+1)) is converted to float, the bits are shifted to the left until you find the first set bit. This process makes the LSBs of the mantissa more likely to be zero and removes significant digits

1 Like

This observation is explained by

Rephrasing this a bit: Sampling uniformely from fixed point numbers means that the spacings between sample points is constant which directly translates to the constant precision that was observed.

As for the reason: it is a reasonable approximation of a uniform distribution and fast to compute (no branches, extra calls to the randomness sources, so amenable to SIMD).

IIRC, a large part of the previous thread was exploration whether one 1) should do better 2) could do better while still being performant. I think it turned out that sampling all floating point values cannot be done branchless since sometimes you need more than 64bits of randomness (i.e. need a second call to the randomness source). IIRC the current algorithm uses the randomness bits pretty well, so it is in some sense as good as one can be when restricting oneself to a single call to the randomness source. I do not recall whether most of the discussion applied to Float64 or Float32 though.

1 Like

Well, it is possible to do better. Take the case of a random Float32 generated from a single call to a UInt32-producing randomness source. You have 32 bits of entropy to work with. The current approach uses 24 (it’s uniform over a sample space of 2^24 numbers: \{0 \cdot 2^{-24}, 1 \cdot 2^{-24}, 2 \cdot 2^{-24}, \ldots, (2^{24} - 1) \cdot 2^{-24}\}) and discards the remaining 8. What you could do instead is:

  • Take the first 23 bits as the mantissa
  • Look at the next bit. If 0, set the exponent to -1, you got a number in [0.5, 1), return. If 1, continue.
  • Look at the next bit. If 0, set the exponent to -2, you got a number in [0.25, 0.5), return. If 1, continue.
  • …and so on until exhausting the 32 bits.

This still doesn’t use all 32 bits of entropy, as half of the time it discards 8 of them like before. But in the other half, it uses them to refine the sample space, without having to sample another UInt32. If you work it out, the resulting distribution has about 25 bits of entropy. I’m sure you can find a better scheme than this, making even better use of the original 32 bits.

The problem is to implement something like this in a performant way using branch-free, simdable bit twiddling.

1 Like

Indeed, and that also occurs in your CppRNG implementation when you converted a random UInt32 to Float32. I guess another way of thinking about it is while the default implementation has a fixed precision of the largest normals eps(prevfloat(1.0f0)) == 2.0f0^-24 == 5.9604645f-8, your CppRNG implementation’s division by T(max_int) reintroduced varying precision and allows the smaller values to have apparently greater precision.

I suspected that this would be balanced by worse precision at larger values, and it’s worse than I thought. In your first CppRNG implementation for T == Float32, the denominator 2^(8*sizeof(UInt32) + 1) == 2^33 would be 8589934592, which is 2x too large for UInt32’s range 0:4294967295. The earlier suggested fix would be 1+Int(typemax(UInt32)) == 2^32, intending division to stay in the documented [0, 1) interval. However, the Float32 conversion indeed eats enough UInt32 precision that the largest corrected 128 generated values are 1.0f0, which falls outside the interval. This does not occur if we instead paired Float64 with UInt32, of course.

julia> denominator = Float32(1+Int(typemax(UInt32)))
4.2949673f9

julia> Float32(typemax(UInt32)-128) / denominator
0.99999994f0

julia> Float32(typemax(UInt32)-127) / denominator
1.0f0

julia> Float32(typemax(UInt32)) / denominator
1.0f0

The 128 UInt32 values resulting in 1.0f0 is part of a more widely biased map; only 1 value can result in 0.0f0, but 257 values can result in 0.5f0. The straightforward adjustment to map each generated integer to a separate floating point value is to match the mantissa’s nominal size and divide 0:(2^24-1) by 2^24, but we’re back where we started. I don’t see a way to reconcile a uniform usage of mantissa bits (effective floating point precision) and removal of redundant values, even if we don’t hit 1.0f0. If we increase a uniform sampling resolution on the real line to reach more precise small floating point values, we’d also hit a larger floating point value’s surrounding real subinterval more times to round to it. Only exponential sampling can match the exponential scaling of floating point values without redundancies.

FWIW, your denominator-corrected CppRNG implementation isn’t biased enough for a visible impact on ECDFs (2^-24 << 1), but it and the rounding to 1.0f0 seem to be good enough reasons to generate Float32 in other ways.

I assume float division wasn’t workable in other ways?

There is an instruction for that, it’s trailing_zeros / tzcnt.

In simd context, i.e. for bulk generation, there are various okayish ways of doing that. One could either improve the distribution, by better using the random bits we have, or get the correct distribution with very unlikely branches. Some people in the old thread discovered that various avx versions could use int->float conversion instructions as a simd ersatz lzcnt instruction.

But getting the “right” distribution such that output of rand(Float32) becomes practically indistinguishable from convert(Float32, rand(Float128)) is only one part.

If we commit to always generating the right output distribution, then:

  1. All random code becomes slower
  2. We will need to write a lot of custom different functions for scalar / simd on aarch64, avx2, avx512, and multiple GPU
  3. Julia becomes a unicorn that does something completely different from the majority of the existing literature and scientific software. Hence, julia users who rely on the hypothetical “actual floating point” distribution are in for a bad time when porting to different languages.
  4. The “port alg to other language” thing also applies when using well-established libraries via FFI, like e.g. curand, or whatever apple ships,

I think it would be a good first step to offer a source of “actually floating point” numbers, and then point that out in the documentation, and open PRs to use that where applicable (e.g. in distributions.jl. Last time I reviewed, Distributions.jl uses Float64 internally, even when configured for Float32 output; so yes, many distributions have slightly inaccurate tail statistics, but that is probably invisible in practice).

I used to believe that making that the default would be a good idea – mitigate the bugs that are bound to happen because people write 1/rand().

By now, I am rather unsure: Even if we fix this, 1/rand() would still be terrible code, because it would depend on julia idsiosyncracies. It would be more appropriate for people to be explicit and write 1/rand(Random.FloatOpen01()), because that explicitly marks the problem and protects people from even more insidious porting bugs (having incorrect tail statistics for a power-law distribution is bad; having the distribution correct in julia prototypes / on CPU / during testing, and then become subtly incorrect when deployed on GPU or ported to some micro-controller is much worse).

PS. “correct distribution” means: Mantissa bits are all independent 0/1, and exponent is exponentially distributed. Exponentially distributed exponent fits even in Float32, because the probability of an integer overflow in the exponent is so rare to be “probably unobservable in human history” / more likely to be caused by software or hardware fault.

1 Like

Would it help for the analysis to have a table of the frequencies of every floating point number in [0,1) when generated with a random generator that returns systematically every integer?

What is the metric to compare all the generation schemes proposed so far?

Here’s a post from near the end of that megathread where I offered a SIMDing infinite-precision implementation that was only a few times slower than the (2-years ago) status quo (and had a plausible path to being only marginally slower).

Someone is free to take that work and make some PRs to Random. The main tasks (prototypes already in that post!) would be to:

  • Make a block-RNG implementation that can be used to produce a SIMD vector of random bits on demand. Integrate it with Random with a nice API, possibly updating other SIMD generation functions to use it also (or rather, see the next sub-bullet, which should actually be the common part).
    • Make a (maybe-not-public) immutable form to avoid the linked-post-mentioned performance loss of a mutable version (which is basically already written inline in the SIMDing rand functions, if I recall correctly). This would probably nearly eliminate the performance gap from the status quo, as discussed in that post.
  • Make a SIMDing block-based infinite-precision rand function.
  • Make a scalar version (easy!).
  • Make sure these implementations are fast on different architectures.

mbauman also did some other work that someone could try to adapt.

2 Likes

First you need to define what a uniform distribution means in the first place. The best formal meaning I have for our current definition is:

rand(Float32) < p has probability p of being true for all p that are multiples of 2^{-24} in [0, 1]. Note that the Float32s in [0.5, 1] are the multiples of 2^{-24}. The exact inequality operator here matters, as does the precision of p. It matters how you use it!

2 Likes