Output distribution of rand(Float32) and rand(Float64), thread 2

One thing that I’ve found as I’ve continued to noodle on this here and there is how easy it is to get things subtly wrong, especially when chaining together many iterations adding bits. That’s where it’s really nice to have algorithms that are optimizations of a simple and obviously correct (but slow) version.

I’ve also found it super useful to break things down into very small definitions that are easy to test. I’ve not finished this implementation yet, but I think this idea for “perfect precision” is looking promising — or at least should have useful components for rounding-down floating point bit twiddles.

Here’s the slow-but-obviously-correct premise:

# Rounding down an unsigned fixed-point [0,1) representation to float
fixed2float(::Type{F}, r::Unsigned) where {F} = F(big(r)/2^nbits(r), RoundDown)
# Divide a positive floating point value by 2^N (N > 0), rounding down.
÷₂ꜛ(value::F, N) where {F} = F(value * big(2.0^-N), RoundDown)
# Add a `smaller` floating point value into `bigger` (where smaller is less than the smallest nonzero value of bigger), rounding down
+₎(bigger::F, smaller::F) where {F} = F(big(bigger) + big(smaller), RoundDown)
# Get the epsilon of a value
log2eps(value) = round(Int, log2(eps(value)), RoundDown)

# Iteratively getting to "perfect precision"
function randf(rng, ::Type{F}, ::Type{U}) where {F <: Base.IEEEFloat, U <: Unsigned}
    value = fixed2float(F, rand(rng, U))
    nbits = 8*sizeof(U)
    while log2eps(value) <= -nbits
        value = value +₎ (fixed2float(F, rand(rng, U)) ÷₂ꜛ nbits)
        nbits += 8*sizeof(U)
    end
    return value
end

The huge benefit is that the above is the test suite. And it’s all super-generic; you can test iteratively getting every single Float16 value, three bytes at a time. Anyhow, my current optimizations of the above are looking promising. Scalar performance is the easy part — it’s so rare to hit those branches that it’s basically an even match. And I’ve just done the rudiments of the SIMD implementation without much testing there — but it is doing the above iterative thing and I’m still just barely within that 2x slowdown mark for Float32 on my M1.

julia> @btime Random.rand(Float32);
  2.958 ns (0 allocations: 0 bytes)

julia> @btime ShawshankRandemption.rand(Float32);
  3.084 ns (0 allocations: 0 bytes)

julia> A = Array{Float32}(undef, 2^20);

julia> @btime Random.rand!($A);
  325.541 μs (0 allocations: 0 bytes)

julia> @btime ShawshankRandemption.rand!($A);
  645.125 μs (0 allocations: 0 bytes)
5 Likes