Output distribution of rand(Float32) and rand(Float64), thread 2

Ok, please clarify two points:

  1. The uniform distribution you want on Float32 is the one where each x has probability ~ nextfloat(x)-x – i.e. exact 1/2 has a larger probability than exact 1/8, right? I.e. the distribution coming from the real line, the distribution we’re talking about in this thread, and not the one coming from “each distinct float has same probability”?
  2. With “fewer bits” you mean “fewer bits on average”, i.e. fewer bits/float if you want to generate 10000 floats?

gush This is gonna be so cool. Today you are one of the lucky ten thousand. Brace yourself, @dlakelan and me are running to the grocery store and fetch mentos and coke, and we will show you the beauty of variable bitlength encoding.

Maybe I’m misunderstanding Dan’s algorithm or your objection, but one can use somewhat fewer bits than either of us have proposed to actually use. Commit S bits to the significand in any given octave. Now generate random bits one at a time until you generate a 1. The number of draws required determines the octave (1 draw for [0.5,1), 2 for [0.25,0.5), etc.). On average, this will require 2 draws. So one can generate uniform-distribution floating point numbers on the interval (0,1) for an average of S+2 bits per value (even for an infinite number of potential octaves). If you give up at some point, call the bottom octave 0 to extend the support to [0,1).

That said, this algorithm requires conditionally generating additional bits (or blocks of bits, in practice). There is virtually no chance that this is performant. There is no way to SIMD this for vectorized generation of random numbers. This is why mbauman and I have been advocating for a fixed number of additional bits (providing only a limited range). Whatever that fixed number is, we anticipate it will ultimately be faster to acquire potentially-wasteful amounts of entropy and perform the calculation deterministically.

We already generate a UInt64 when building a rand(Float64), so we proposed to use the leftover 11 bits discarded by the old algorithm and call that a worthy improvement. Of course, we could generate additional bits (probably in multiples of 64 at that point), but at some point we’re doing a lot of extra work to fill out a vanishing tail of the distribution. I think the current rand is fine and one that uses the currently-wasted bits is better (although we’ll need to see the performance cost). Something that gives up SIMD or requires significantly more bits than the type width is probably not something we need in our general-purpose rand.

2 Likes
  1. yes
  2. no, I was considering the scalar case and not the aggregate bit usage

I see where you’re coming from now. It’s hard to see through the noise here sometimes; thanks for clarifying and sorry for the mistaken impression.

I’m fully on board with @mikmoore’s assertions that — while possible — this almost certainly won’t be practical due to performance considerations. And I have a strong bias towards algorithms that we can prove correct, especially since there’ve already been algorithms proposed that have been proven incorrect.

2 Likes

Oh, no happy 10000 moment after all. Damn. But compression via variable length encoding is seriously cool as a concept.

Could you re-read my post from above? I think that got lost in the noise.

The entire variable length stuff with 32 bits being not enough for 25 bits of entropy is fundamentally about the exponent, which follows a geometric distribution.

That post above computes how many bits we need to allocate in the scalar case for what level of fidelity. It is a hard bound coming from information theory.

This is the question I’ve been trying to ask since laying out the spectrum of possibilities. The answer will be one of real performance tradeoffs — which is precisely why I’ve not been giving any real attention to the variable length loopy things in this thread.

1 Like

If we do rand(UInt64)/2^64 the smallest number larger than 0 that can occur is 1/2^64 but the next smallest is 2/2^64. These are obviously extremely small values and I don’t think anyone should complain about this in “normal” usage (ie. if you need extremely extremely small probability events to be distinguished you should write your own RNG).

Float32(rand(Float64)) still has as its smallest increment of probability 1/2^64 right? Since 1/2^64 is a representable float32.

at 5ns per generation an event of probability 2^-64 takes place about once every 3000 years of computation (if you’re just computing random floats and nothing else).

My suggestion is our resolution of small events should be something that takes more like 1 year to run at 5ns per generation. If you’re planning to run 1 year of continuous RNG generations you should certainly be vetting the RNG yourself.

If I’m calculating that right, it’s around 1/2^52 so anything that allows Float32 to produce values smaller than 1/2^52 should be in our ballpark… so I think Float32(rand(Float64)) looks easy and attractive.

It’s not correctly rounded and outputs 1f0s.

But the error in roundoff is below our threshold for caring right? ~2^52. We can solve the “outputs 1” by

c = 2.0f0
while c >= 1.0f0
c = Float32(rand(Float64))
end
return c

About once every 3000 years of continuous computation the while loop will go through twice. (EDIT: This is an exaggeration, but it will happen extremely rarely)

Right, so then it’s even slower yet. Tradeoffs, people. A better fixup is to map those 1s to 0s (edit: I think?); your solution has zeros as being half as likely as they should be. But it’s still papering over a rounding issue that makes even Float32s ever-so-slightly more likely in the output (sure, it’s on the order of 2^-29, but it’s the kind of thing an RNG could — and should — get right).

This is why I call this the work of magicians and have been (perhaps overly) critical here. There’s so many ways you can get RNGs wrong.

2 Likes

or, if you have 10k cores, every couple of weeks

1 Like

Any Float32 greater than or equal to 1-2^{-25} will round to 1f0. So there are roughly 2^{53-25} = 2^{28} Float64 values that will round to 1f0. Since we sample the area near 1 densely in the Float64s, the probability that we round a Float32(rand(Float64)) == 1f0 is a whopping 2^{-25}. Definitely not 2^{-53}.

This point aside, anything with a branch or loop is a performance killer, no matter how rarely it is used, because it is impossible to SIMD.

3 Likes

You’re not wrong! Maybe the real solution is to provide an implementation that throws an error message telling the user to choose one of several implementations explicitly.

Yeah, I realized this after I wrote it. thanks for that clarification.

and this is a good point.

a sufficiently smart computer could make it fast, but in practice it will be slow.

I don’t know much about SIMD but how does this kind of thing work?

a = try1()
if a == 1
   a = try1()
end
if a == 1
   a = try1()
end
while a == 1
   a = try1()
end
return a

if a == 1 is already quite low probability, the probability of entering the loop is extremely low (like one in billions or trillions). I don’t know if SIMD can make the unrolled portion of the loop fast?

No. Not even these trailing_zeros versions are SIMD-friendly on my Apple M1. This leads to this performance of the above functions: our 32-bit versions are competitive in the scalar case, but are 4.5x slower over arrays. Note that very special care has been taken to make our current rand SIMD-y (and even its trivial implementation doesn’t SIMD by default). I don’t have the chops to write those LLVM simd intrinsics, and it’s not even clear to me which architectures would support such a thing.

the code
julia> builtin_rand32() = rand(Float32)
builtin_rand32 (generic function with 1 method)

julia> @inline function mbauman_rand32()
           r = rand(UInt32)
           exponent = trailing_zeros(r << 1) # 1 - 32
           exp_bits = UInt32(127-exponent)<<23
           fraction = (r ⊻ (UInt32(1) << (exponent-1)))>>9
           return reinterpret(Float32, exp_bits | fraction) * !iszero(r)
       end
mbauman_rand32 (generic function with 1 method)

julia> @inline function mikmoore_rand32()
           U = UInt32
           u = rand(U)
           F = Base.floattype(U)
           bitwidth = 8 * sizeof(U) # bits in type U
           tokeep = Base.significand_bits(F) + 1 # keep this many significant bits
           mask = typemax(U) << (bitwidth - tokeep) # mask to retain `tokeep` MSBs
           u &= mask >>> leading_zeros(u) # for each leading zero we can accomodate a lower bit
           return F(u) * F(exp2(-bitwidth)) # lossless float conversion and scaling
       end
mikmoore_rand32 (generic function with 1 method)

julia> @btime builtin_rand32();
  3.000 ns (0 allocations: 0 bytes)

julia> @btime mbauman_rand32();
  3.042 ns (0 allocations: 0 bytes)

julia> @btime mikmoore_rand32();
  3.041 ns (0 allocations: 0 bytes)

julia> function r!(f, A)
           @inbounds @simd for i in eachindex(A)
               A[i] = f()
           end
           return A
       end
r! (generic function with 2 methods)

julia> @btime r!(builtin_rand32, $(Array{Float32}(undef, 10^6)));
  1.179 ms (0 allocations: 0 bytes)

julia> @btime r!(mbauman_rand32, $(Array{Float32}(undef, 10^6)));
  1.622 ms (0 allocations: 0 bytes)

julia> @btime r!(mikmoore_rand32, $(Array{Float32}(undef, 10^6)));
  1.444 ms (0 allocations: 0 bytes)

julia> @btime rand!($(Array{Float32}(undef, 10^6)));
  329.333 μs (0 allocations: 0 bytes)
1 Like

Yes, in examining some of these implementations and eventually leading_zeros in isolation, I eventually realized that my computer (and perhaps all of x86) does not have a vectorized leading_zeros instruction. That will make any of these schemes extremely difficult to vectorize. It would probably be easy enough to have a sequence of unrolled comparisons for a small number of octaves, but even that may not be great.

I do wonder whether LLVM is actually doing a good job here. It appears to synthesize a vectorized leading_zeros from many other instructions, but I’m not convinced it’s actually better than just doing that part of the computation devectorized. Compare

julia> code_native(x->map(leading_zeros,x), (NTuple{1,UInt32},); debuginfo=:none)

julia> code_native(x->map(leading_zeros,x), (NTuple{4,UInt32},); debuginfo=:none)

The 1x version uses 5 instructions (only 1 or 2 of which aren’t boilerplate) while the 4x one uses 30. Upping this to 8x is 51. But compilers are usually pretty smart so maybe this is still better than scalarized code?

Changing the floating point rounding mode might be the only potentially performant way to do this. Or maybe a wizard can make something with leading_zeros performant, but it probably isn’t me.

Control flow (if, while, and some other constructs) is not generally SIMD-able. Sometimes, trivial branches can be written using conditional moves (“pick one of these two values based on this boolean”), but they involve evaluating both possibilities so are only really suitable for selecting among a small number of computed values.

I’m not really comfortable with code that can fail (no matter how vanishing the probability) for a simple operation like this where it isn’t intrinsically necessary. We’re powerless to stop cosmic rays, but we can make sure that we write sound algorithms that will never fail on their own merit (even in the hands of an adversarial RNG).

I think it’s also ok if it “never fails” in the cryptographic / negligible sense. Because even some well-vetted encryption schemes have (known) non-zero decryption failure probabilities (in the 100s of bits range)

the problem is that 52 bits is not even close to negligible. that’s brute-forceable by a medium sized cluster

1 Like

Noone should be using these RNGs for crypto.

How about something like

function float32rand()
    i = rand(UInt64)
    a = mod(i,2^20)
    b = Float32(i >> 20)

    if a == 0 # 1 in 2^20 chance
        b = b/2^64 #uniform on [0,2^-20)
    else    
        f220 = 1.0f0/2^20
#        nf220 = nextfloat(1.0f0/2^20)
        b = b/2^44 # uniform on [0,1) grid
        b = b*(prevfloat(1.0f0) - f220) + f220 # rescale to the upper "register"
    end
    return b
end

It either creates a float on [0,2^-20) (probability 1/2^20) based on equally spaced on a fine grid, or a float on [2^-20,prevfloat(1.0)] based on a coarser grid.

It can’t generate 1.0f0 and it generates minimum values in the range 2^-64

its one branch depends only on comparing a number to 0, so if you can take both branches simultaneously and then choose based on a == 0 that should be SIMD worthy?

julia> @btime float32rand()
  3.596 ns (0 allocations: 0 bytes)
0.6185959f0

julia> @btime Float32(rand(Float64))
  2.845 ns (0 allocations: 0 bytes)
0.6428215f0

So, just to give some inspiration.

The AMD library for GPU random uses this.

It generates floats excluding 0 and including 1.

The nvidia curand library documents

The curandGenerateUniform() function is used to generate uniformly distributed floating point values between 0.0 and 1.0, where 0.0 is excluded and 1.0 is included.

As far as I can tell, Cuda.jl uses curand if available. I guess they use the same algorithm as rocRand.

As such, I don’t think we should sweat too much about GPU – it already produces a quite different distribution.

========================

In some sense this is an API problem: Generating random floats including 0 and excluding 1 is bloody stupid, considering how floating point numbers work.

========================

I guess I can comment on that, given that I wrote the original implementation (under the chethega account on github; Jeff had to clean up my PR because I dropped out for personal reasons; apologies for that).

Xoshiro is of block design, not counter design. As such it is fundamentally impossible to SIMD via auto-vectorizer (there’s a reason that counter-based designs are becoming popular!).

So we don’t need to worry about SIMD for rand. Not possible anyways.

We do need to worry about SIMD for rand! on large arrays. The way we currently do this is to take our single RNG and use it to seed a vector of 8 independent Xoshiro RNGs. And now we go cranking out numbers into our large array! Once the rand! request is served, we discard out vector of xoshiro RNGs. (I am quite proud of that design for serving bulk requests of random numbers!)

So we can use entirely different strategies for the SIMD and non-SIMD cases. But it would be very desirable if both produced the exact same distribution.

========================

If we want to use 64 bits, this is 100% free in the sequential (rand) case. For the bulk (simd / rand!) case it is not free, but probably affordable.

Temporarily changing the rounding mode to round-to-zero is probably OK for the SIMD case.

Touching the rounding is absolutely not OK for the sequential case.

I really don’t know how to switch rounding modes across different CPU architectures (various arm? risc-V? powerPC? Is there mips support? I could drop some inline assembly for x86 and maybe arm, but I really don’t know power or risc-V arch, and I have no idea how to do this portably).

========================

PS, regarding branching. We don’t care about rare branches in the sequential case.

If the branch is rare enough, we can maybe do that in the SIMD case – we’d need to stop all 8 lanes at the same time, do our rare fix-up, and then continue. We would need to look very carefully at the generated code, to make sure that the branch induces no spilling to the stack on machines with 256 bit vectors and arm 128 bit vectors.

3 Likes

Yeah, this is what I called the “bermuda interval” – rare enough to be invisible for testing, frequent enough to spoil cluster runs if it triggers bugs / Inf / NaN.

Biases of that rarity are OK, though.

It is available in avx512, but is missing in avx2.

We do have llvm uitofp / x86 vectorized CVTDQ2PS though – and as I mentioned, integer-to-float conversion includes counting of leading zeros.

So we could use a leading/trailing zero algorithm for single float generation and some uint-to-float thingy for SIMD.

We could also use some kind of entropy pooling. That is: A single float needs on average 25 bits, but there is a smallish-but-significant probability of needing more than 32 bits.

We always have one SIMD-block of 512 bits, and need to output 16 floats, which take on average 400 bits of random. The probability that 512 bits are insufficient to generate 16 floats is astronomically small (thank god for the law of big numbers!).