Output distribution of rand(Float32) and rand(Float64), thread 2

This is true for the default RNG, but the bits-to-float conversion function in Base is used by many RNGs in the ecosystem, including CUDA.jl’s native RNG, which produces 32 bits at a time for hardware-related reasons. That’s why I’d like to see how well one can do starting from a single UInt32. But perhaps the conclusion is that we’d be better off with separate conversion functions depending on the RNGs native sampling type.

Those values would be rounded in exactly the same way if you did the calculation in full precision and rounded after. The only issue with my first version is that it rounds 2r before adding 1.

Here’s a version of (2r+1) / 2^{33} without the unnecessarily rounding:

uint_to_float(u::UInt32) = ((1//2 + Float32(u & 0x0000ffff)) + Float32(u & 0xffff0000)) * Float32(exp2(-32))
1 Like

Of course, facepalm moment! Your improved version definitely does the job for (0, 1] (with the obvious caveat that this straightforward conversion hasn’t been statistically scrutinized). But to get [0, 1) in a similar way, i.e., r / 2^{32} rounded down, it seems like the rounding mode of Float32 would have to be changed somehow?

Yes. Everything I wrote obviously uses the current rounding mode. Usually this is round-to-nearest. It is presently very difficult (i.e., I don’t know how) to change the rounding mode for anything but BigFloat – probably because this is usually a task-global processor flag so is tremendously prone to accidental mis-setting. But for “simple” operations like these, given the known integer inputs and that this only uses a single rounding operation, it may be possible to emulate a different rounding mode (exactly or well enough) by simply applying an appropriate offset.

I definitely agree that getting the endpoints and their frequencies entirely correct is much easier if the rounding mode can be adjusted. In this case, round-down/round-to-zero seems desirable.

That said, it seems difficult to get the bias to zero for any half-open range. A closed range [0,1] would either need 0 and 1 to be “half” as likely as the others (maybe undesirable) or have the wrong covariance due to their over-representation (also maybe undesirable). An open range may be slightly nicer in this way but uglier in others and would still likely have a similar covariance mismatch. There don’t seem to be any “fantastic and obvious” solutions here. Floating point numbers just have too many idiosyncrasies to represent a “continuous” uniform distribution in a fully satisfying way.

So to answer my own question here — doing any sort of thresholding (like that Bernoulli example way up above) will “snap” its effective threshold to the next-highest representable value in the output. So, testing rand(Float32) < 1e-8 is effectively the same as thresholding against ceil(1e-8*2^24)/2^24 ≈ 5.96e-8 — there’s that factor of 6. So this means any threshold x may have an error of up to \frac{x + 2^{-24}}{x}: That’s ~6% for values on the order of 10^{-6}, 0.6% for 10^{-5}, 0.06% for 10^{-4}. Now we’re getting into very common territory.

So that’s pretty conclusive to me that we definitely need to be doing better for rand(Float32). The exact same analysis holds for all those other “density” values — so for with a Float64-status-quo density of steps of 2^{-53} we’re talking about errors of 0.0001% for 10^{-10}.

5 Likes

This is already the case at the upper end using the conversion you came up with: 1f0 is half as likely as prevfloat(1f0). In fact, there is no UInt32 value that would map to 1.0 without rounding, but many that get rounded up to it, though only half as many as those that get rounded to prevfloat(1f0) from both sides. Near the bottom end, every possible value of (2r + 1) / 2^{33} can be represented exactly, so nothing gets rounded down to 0f0. Thus the resulting sample space approximates (0, 1].

This in effect demonstrates how, among half-open intervals, (0, 1] has less built-in bias than [0, 1) in a floating-point representation, but I guess that ship sailed a long time ago.

julia> uint_to_float(u::UInt32) = ((1//2 + Float32(u & 0x0000ffff)) + Float32(u & 0xffff0000)) * Float32(exp2(-32))
uint_to_float (generic function with 1 method)

julia> myrandf32() = uint_to_float(rand(UInt32))
myrandf32 (generic function with 1 method)

julia> function test_hits(x::Float32, N)
           hits = 0
           for _ in 1:N
               (myrandf32() == x) && (hits += 1)
           end
           bin = (min(nextfloat(x), one(x)) - max(prevfloat(x), zero(x))) / 2
           @show N, hits, hits / N, bin
           return nothing
       end
test_rand_hits (generic function with 1 method)

julia> test_hits(1f0, 2^32)
(N, hits, hits / N, bin) = (4294967296, 131, 3.050081431865692e-8, 2.9802322f-8)

julia> test_hits(prevfloat(1f0), 2^32)
(N, hits, hits / N, bin) = (4294967296, 256, 5.960464477539063e-8, 5.9604645f-8)

With that proof-of-concept UInt32-based implementation out of the way and its basic properties described, we should probably listen to @foobar_lv2 and @mbauman’s plea to focus more on the conceptual aspects here.

For me, the question now is: would jumping from 24 to 32 bits be enough to be worth doing something different for Float32? It’s an improvement, but using that Bernoulli example above, even with an “ideal” [0,1) implementation, you’d still see an error of up to 2% for thresholds on the order of 10^{-8}. Is that good enough to warrant a change? Maybe it would be if we can preserve a 2x speed difference over rand(Float64)… but that surely requires that simdy rounding-down multiplication that y’all were discussing.

1 Like

I haven’t looked at the code, but could you just generate random Float32 by generating a 64 bit int, then taking the top 32 bits to generate a 32 bit float evenly spaced on (0,1) and the bottom 32 bits to generate a 32 bit float evenly spaced on say (0,2^-12)

c = -1.0
while c <0.0 or c >= 1.0
    i = rand(Int64)
    a = i / 2^64
    b = mod(i,2^32)/2^32
    c = a + b/2^(-12)-2^(-13)
end

Doesn’t xoshiro generate 64 bits at a time anyway?

(actually thinking about it, you need to do a little more arithmetic to get uniformity… the distribution will be the convolution of the two uniform distributions… which will be uniform from slightly above 0 to slightly below 1)

Given your analysis, it’s probably worthwhile to do better for RNGs that sample 64 bits at a time anyway. Maybe @foobar_lv2’s Float32(rand(Float64)) suggestion is indeed the way to go? Though as an occasional CUDA user, I’m still invested in the solution for 32-bit RNGs as well, and I’m not quite sure what I’d prefer between a conversion like the one discussed and occasionally pulling a second UInt32 to improve density even further.

And while cuRAND seems to be doing essentially the conversion described above based on my experiments, the documentation opens for the possibility that it may, now or in the future, pull extra bits as needed:

Distribution functions may use any number of unsigned integer values from a basic generator. The number of values consumed is not guaranteed to be fixed.

https://docs.nvidia.com/cuda/curand/device-api-overview.html#distributions

I don’t really feel comfortable taking a stance on how much precision a user needs in their rand outputs. If they need “a lot”, they shouldn’t be relying on the rand() < threshold idiom in the first place. The default rand() (i.e., rand(Float64)) seems to handle this adequately, with “reasonable” performance for threshold > 1e-12.

That is to say that I don’t hate the status quo. If a user goes out of their way to call rand(Float32) rather than rand(), then the consequences are on them.

That said, I’ve been trying and haven’t yet mustered any blocking objections to a XX-bit rand(FloatXX) implementation. If a particular RNG only produces YY-bit values, I could even be further persuaded to support YY-bit Float32 and Float16 (although Float16 can only utilize 14 or 24 bits, depending on whether we produce subnormals). Although I am hesitant to commit to these “wide rand” outputs because they seem detrimental to vectorized rand generation.

2 Likes

Well that’s precisely why I think performance is now an important aspect of this discussion — the only reason I reach out of my way to grab a Float32 is performance (perhaps on GPUs). Why should we provide rand(::Type{Float32}) = Float32(rand(Float64)) when the user can easily do that themselves? We should document the pitfalls and let the user beware.

4 Likes

I think it’s better to think of this idiom as a smoking gun for the more general issue of rare event statistics. It maps directly to the issue of power-law tails raised in the OP.

1 Like

Good point! I guess the objection is that, if one decides on a rand(Float32) implementation that uses 32 bits the majority of the time but reaches for an extra 32 bits once in a blue moon (which shouldn’t affect performance much as long as the common branch is fast [EDIT: scratch that, I guess it would be hard to vectorize]), it just seems a bit wasteful if the underlying RNG actually generated 64 bits for each call, half of which were discarded. Hence the idea of having different conversion paths depending on the RNG. That’s not to say that the one for 64-bit RNGs should go via Float64.

But if the decision is for rand(Float32) to only do one generator call per value (assuming all RNGs are at least 32-bit), then I agree it makes sense to always use the 32-bit conversion and let any additional bits be discarded (with proper documentation).

Ok, I had some time to think about the problem and came up with this, though I don’t think it’s really “better” than Float32(rand(Float64))

function rand32float()
   c = -1.0
   while c <0.0 || c >= 1.0
     i = rand(UInt64)
     a = Float32(i >> 32)/2^(32)
     b = Float32(mod(i,2^32)/2^32)
     c = (a + b/2^(12)-1/2^(12))/(1 - 1/2^(12))
    end
    return c
end

EDIT: error, above I don’t preserve Float32 type, so this fixes it I think:

function rand32float()
          c = -1.0f0
          while c <0.0f0 || c >= 1.0f0
            i = rand(UInt64)
            a = Float32(i >> 32)/2^(32)
            b = Float32(mod(i,2^32))/2^32
            c = (a + b/2^(12)-1.0f0/2^(12))/(1.0f0 - 1.0f0/2^(12))
           end
           return c
       end

Simple histograms suggest that I’m right but I didn’t write a proof.

On my machine:

julia> @btime rand32float()
  4.969 ns (0 allocations: 0 bytes)
0.8600827395260989

julia> @btime Float32(rand(Float64))
  2.845 ns (0 allocations: 0 bytes)
0.4434461f0

EDIT: timing of the newer version:

julia> @btime rand32float()
  4.408 ns (0 allocations: 0 bytes)
0.06446942f0

julia> @btime Float32(rand(Float64))
  2.855 ns (0 allocations: 0 bytes)
0.5801686f0

1 Like

Here’s a version using @mikmoore’s conversion up to twice, first to sample x \in (0, 1], and then, if there exist other representable values within (x - 2^{-33}, x + 2^{-33}], to subsample that interval.

The smallest element in this algorithm’s sample space is 2^{-65}, and the smallest spacing between elements is 2^{-64}.

Intuitively the reasoning seems statistically sound, but those are famous last words in RNG land. The issue of adapting this approach from (0, 1] to [0, 1) remains unresolved.

function uint_to_float(u::UInt32, w::Float32)
    num = (0.5f0 + Float32(u & 0x0000ffff)) + Float32(u & 0xffff0000)
    return num * (w * Float32(0x1.0p-32))
end

function randf32()
    u = rand(UInt32)
    x = uint_to_float(u, 1f0)
    if u < 0x00ffffff
        dx = uint_to_float(rand(UInt32), Float32(0x1.0p-32))
        x = (x - Float32(0x1.0p-33)) + dx
    end
    return x
end

With the default CPU RNG, this is slower than Float32(rand(Float64)), even though the common case is only calling the generator once. Perhaps there’s room for further optimizing in uint_to_float? Anyway, it should be better adapted to GPUs and 32-bit RNGs, but I suppose vectorization might be an issue.

julia> @btime randf32();
  4.723 ns (0 allocations: 0 bytes)

julia> @btime Float32(rand(Float64));
  3.443 ns (0 allocations: 0 bytes)

EDIT: Slight speedup from moving the comparison into the UInt32 domain. No qualitative change.

It should be mentioned that Float32(rand(Float64)) also needs proper scrutiny before it’s considered a viable candidate. Float32(x) rounds to nearest and breaks ties by rounding to even (i.e., least significant bit zero). This can lead to small-scale structure that’s especially pronounced in certain octaves, where even (odd) numbers are significantly oversampled (undersampled). I haven’t figured out whether this is in fact a problem when the domain before rounding is the sample space of rand(Float64). It would show up most clearly in regions where rounding ought to be an N-to-1 map for a relatively small N > 1, with N = 2 possibly being the worst case.

The function uint_to_float above avoids this issue due to its careful two-step rounding combined with the numerator always being odd, but it’s not at all obvious. A simple x = Float32(u) * Float32(0x1.0p-32) conversion (an obvious thing to try for sampling in [0, 1)) would produce structure for any u >= 2^24, and it would be especially bad for 2^24 <= u < 2^25, where even numbers would be oversampled by a ratio of 3:1 over odd numbers.

1 Like

Before people jump to conclusions about “this is a good algorithm” purely based on performance, I think it should be run through TestU01, the most widely used framework for statistical testing of PRNGs. Our Xoshiro at least passes these tests as far as I’m aware, so replacing the generation with something different ought to pass these statistical tests just the same. I’ve been meaning to wrap this in a JLL & package for a while, but never got around to doing that.

4 Likes