Output distribution of rand(Float32) and rand(Float64), thread 2

Why not just do b = (i >> 32) + 1. Adds a 1/2^64 glitch around 1/2^32 but saves a conditional (maybe not so crucial for speed - needs measurement).

1 Like

This is a good line of argument that I want to continue. Not in a position to do so right now but this feels like the right direction to discuss.

1 Like

@StefanKarpinski I agree, and it’s more or less the argument I thought that was always being made, so I like that we’re converging on some common understanding.

Here’s some data on timing for my slightly modified code above:

julia> function rand32()
                i = rand(UInt64)
                a = i & 0xffffffff
                b =  i >> 32
                if a == 0x0
                  if b == 0 ;  b= 1 ; end
                  c = ldexp(Float32(b),-64)
                 else 
                   c = ldexp(Float32(a),-32)
                end
                return c
             end
rand32 (generic function with 1 method)

julia> @btime(rand32())
  4.308 ns (0 allocations: 0 bytes)
0.4268289f0

julia> @btime(rand(UInt64))
  2.866 ns (0 allocations: 0 bytes)
0xbcb048922324e5f6

julia> 4.308/2.866
1.5031402651779482

So it’s about 1.5 times the cost compared to just generating the 64 bits, and improves the left tail a LOT, the smallest nonzero value it can generate is ldexp(1.0f0,-64) compared to I think ldexp(1.0,-24) for the current code?

The smallest Float32 on the other hand is about 2.58f-26 as big as ldexp(1.0f0,-64)

So we have the option also to extend the dynamic range even further, we could do something like ldexp(Float32(rand(UInt32)),-64-32) as a 3rd step of extending the dynamic range even farther, it would only get hit once every 2^64 calls.

How about:

function rand32_while()
    exponent = -32
    while ((a = rand(UInt32)) == 0)
        exponent -= 32
    end
    return ldexp(Float32(a), exponent)
end

(chance of 0.0 is around 2^(-128))

or same semantics but simpler to prove finite run-time:

function rand32_for()
    local exponent, a
    for outer exponent in -32:-32:-196
        (a = rand(UInt32)) == 0 || break
    end
    return ldexp(Float32(a), exponent)
end

Just from a speed perspective, my understanding was the RNG generates 64 bits at a time, so I don’t know if we are throwing those away and calling the generator more times than needed? Maybe it stores the unused bits and returns them quickly at the second call?

Note, maybe it doesn’t matter, we only make a second call once every 4 billion (2^32) calls, and a third call once every 2^64

Presumably a real implementation would eliminate the branching/loop anyway by getting all the random bits upfront?

Which is better, avoiding a branch but calling the RNG twice or 3 times as much as needed every time, or avoiding the second RNG almost always? I guess time to benchmark. I’m guessing the loop is probably faster since it almost always goes once and the RNG won’t SIMD per previous post.

julia> @btime(rand32_for())
  4.067 ns (0 allocations: 0 bytes)
0.15846866f0

julia> @btime(rand(UInt64))
  2.875 ns (0 allocations: 0 bytes)
0x7ae0be730a4bf57b

julia> @btime(rand(UInt32))
  2.865 ns (0 allocations: 0 bytes)
0x43bc04e7

So, I love it. I also like the idea of doing the same thing for Float64 since it shouldn’t be the case that Float64 is more problematic than Float32

julia> function rand64_for()
           local exponent, a
           for outer exponent in -64:-64:-256
               (a = rand(UInt64)) == 0 || break
           end
           return ldexp(Float64(a), exponent)
       end
rand64_for (generic function with 1 method)

julia> @btime(rand64_for())
  4.548 ns (0 allocations: 0 bytes)
0.26893953364756196

julia> @btime(rand(Float64))
  2.845 ns (0 allocations: 0 bytes)
0.8204255370807716

julia> 4.548/2.845
1.5985940246045693

This tries to measure the latency of a single call, but I guess it could be more interesting to benchmark applying rand to an array.

1 Like

There are really 3 distinct cases we care about:

  1. Calling rand often, but in between there is something that the compiler cannot optimize through. In this case, the internal state of the RNG is read, modified, and written back for each call. The read/write probably doesn’t hit L1 cache and instead goes via store buffer.
  2. Calling rand in a tight loop that the compiler understands. The internal RNG state is read only once and written only once, otherwise it is kept in register. Register is faster than store buffer.
  3. Calling rand! on an array. This is a completely separate codepath and algorithm that uses SIMD.
3 Likes

So, to summarize, we have a simple candidate algorithm for rand() for Float64 and Float32 which uses a finite loop, is guaranteed to terminate, and generates 0.0 at less than 2^-128 = 2.9e-39 for either type, which means a million core billion hz supercomputer generates it less than once every 10^16 years in the absence of software bugs. It also has the CDF property that cdf(x) = x for values between 0.0 and 1.0 to within the capacity of the RNG to be fully uniform, and it costs about 1.5 times as much as generating a single UInt64.

I consider this a job well done. What’s the consensus?

Pretty good.

Two points:

  1. There is a related conundrum I’ve encountered:
julia> @benchmark Float32(rand(UInt64))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  6.384 ns … 30.382 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     6.842 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   6.878 ns Β±  0.384 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

           β–‚β–…β–†β–ˆβ–ˆβ–‡β–†β–ƒ                                           
  β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–„β–„β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–‚β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  6.38 ns        Histogram: frequency by time        8.39 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark Float32(rand(UInt32))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.075 ns … 150.727 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.159 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.190 ns Β±   1.488 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                           ▁ β–β–†β–‚β–†β–ƒβ–„β–ˆβ–‚β–‡β–‚β–ƒβ–‡β–ƒβ–ƒβ–‡β–β–„                 
  β–‚β–β–β–β–β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–„β–†β–…β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–…β–„β–„β–„β–ƒβ–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚ β–„
  3.08 ns         Histogram: frequency by time        3.22 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

The two are identical in terms of rand calculation, so why does conversion cost more for UInt64 ? Looking at the generated code, there is another branch for the UInt64 case, but I’m not sure what it is for. Anyone?

  1. Officially I’m renaming Float16 as BraindeadFloat16. The Float16 cannot hold 0xFFFF which is the the maximum UInt of the same size. IEEE have made the wrong decisions and BFloat16 should replace it all. My theory for why Float16 is this way, is because it has 10 bits of significant means it reached 1024, which means it can be translated into 3 decimal digits. It looks good for all the decimal system fans. Today humans are in much less regard and I think it wouldn’t have gone the same way.
1 Like

That’s true for any FP format, except possibly for a trivial FP format which only has a mantissa and is thus actually just an integer…? EDIT: OK, thanks for the clarification.

I mean 65535 cannot be represented as a float with an exponent! It is Inf !

julia> Float64(floatmax(Float16))
65504.0
julia> Float16(0xffff)
Inf16

or to spell it out: the exponent doesn’t have more than ~log(#bits)+1 bits.

I was also thinking along these lines. However, for reasons discussed earlier, I’d rather throw away some bits than have implicit rounding, and I’d like to avoid branches, so here’s the way I would do it:

function randf32()
    u = rand(UInt64)
    a = Float32(u >>> 40) * Float32(0x1.0p-24)
    b = Float32((u >>> 16) & 0xffffff) * Float32(0x1.0p-48)
    return ifelse(a === 0f0, b, a)
end

This takes the top 24 bits to create a \in [0, 1) and the next 24 bits for b \in [0, 2^{-24}), returning b if a == 0 and a otherwise. This could of course be iterated further if desired.

Performance looks alright:

julia> @btime randf32();
  3.446 ns (0 allocations: 0 bytes)

julia> @btime rand(Float32);
  2.942 ns (0 allocations: 0 bytes)

What I don’t like so much about this approach is the step change in density at the chosen cutoff. The sample space remains as coarse as the current implementation down to 2^{-24} and is only refined for numbers smaller than that. I think we should aim to do better but also to avoid rounding and branches.

There is no good conversion function uint64->float32 in x86, there’s only one for signed integers. It’s why my proposal upthread did Float32(rand(UInt64)>>1).

On the upside, if you do use this, it has a vectorized version on AVX2.

1 Like

In a scheme like this, you can also stagger the stages such that you don’t exhaust the dynamic range of one stage before switching to the next. Here’s a 3-stage version with no rounding that first samples in [0, 1) with resolution 2^{-24}, but if the outcome is < 2^{-8} resamples in [0, 2^{-8}) with resolution 2^{-32}, and if that outcome is < 2^{-16} resamples in [0, 2^{-16}) with resolution 2^{-40}.

function randf32()
    _24BITS = 0x0000000000ffffff
    _2POW16 = 0x0000000000010000
    u = rand(UInt64)       
    u1 = u >>> 40
    u2 = (u >>> 32) & _24BITS
    u3 = (u >>> 24) & _24BITS
    x = Float32(u3) * Float32(0x1.0p-40)
    x = ifelse(u2 < _2POW16, x, Float32(u2) * Float32(0x1.0p-32))
    x = ifelse(u1 < _2POW16, x, Float32(u1) * Float32(0x1.0p-24))
    return x
end
julia> @btime randf32();
  3.894 ns (0 allocations: 0 bytes)

julia> @btime rand(Float32);
  2.937 ns (0 allocations: 0 bytes)

It’s possible to iterate three more times and get down to a resolution of 2^{-64} before exhausting a UInt64, but performance suffers since the algorithm takes all branches by design.

In general, the stage overlap can be adjusted to trade resolution near zero for resolution near the transition points.

1 Like

Threadmates, remind me what was wrong with:

randf32() = abs(ldexp(Float32(rand(Int64)),-63))

?

ADDED: Seems to me, the only problem is:

rand(r::Union{TaskLocalRNG, Xoshiro}, ::SamplerTrivial{CloseOpen01{Float16}}) =
    Float16(Float32(rand(r, UInt16) >>> 5) * Float32(0x1.0p-11))

rand(r::Union{TaskLocalRNG, Xoshiro}, ::SamplerTrivial{CloseOpen01{Float32}}) =
    Float32(rand(r, UInt32) >>> 8) * Float32(0x1.0p-24)

in Xoshiro.jl which doesn’t use the full 64-bit to generate smaller floats, which indeed should be fixed (at least in a RNG which generates 64-bit chunks anyway and no tradeoff relevant)

It rounds-to-even for values of Int64 > 2^24, so not only does it return 1.0, even floats in the 2^24 octave over overrepresented by 3x (and each octave thereafter by ((N+2)/N)x.

1 Like

But it rounds to nearest first. So which specific numbers are 3x more common? When using 63-bits, this rounding issue is only with numbers ~ 2^(-40) ?