Why not just do b = (i >> 32) + 1
. Adds a 1/2^64 glitch around 1/2^32 but saves a conditional (maybe not so crucial for speed - needs measurement).
This is a good line of argument that I want to continue. Not in a position to do so right now but this feels like the right direction to discuss.
@StefanKarpinski I agree, and itβs more or less the argument I thought that was always being made, so I like that weβre converging on some common understanding.
Hereβs some data on timing for my slightly modified code above:
julia> function rand32()
i = rand(UInt64)
a = i & 0xffffffff
b = i >> 32
if a == 0x0
if b == 0 ; b= 1 ; end
c = ldexp(Float32(b),-64)
else
c = ldexp(Float32(a),-32)
end
return c
end
rand32 (generic function with 1 method)
julia> @btime(rand32())
4.308 ns (0 allocations: 0 bytes)
0.4268289f0
julia> @btime(rand(UInt64))
2.866 ns (0 allocations: 0 bytes)
0xbcb048922324e5f6
julia> 4.308/2.866
1.5031402651779482
So itβs about 1.5 times the cost compared to just generating the 64 bits, and improves the left tail a LOT, the smallest nonzero value it can generate is ldexp(1.0f0,-64) compared to I think ldexp(1.0,-24) for the current code?
The smallest Float32 on the other hand is about 2.58f-26 as big as ldexp(1.0f0,-64)
So we have the option also to extend the dynamic range even further, we could do something like ldexp(Float32(rand(UInt32)),-64-32) as a 3rd step of extending the dynamic range even farther, it would only get hit once every 2^64 calls.
How about:
function rand32_while()
exponent = -32
while ((a = rand(UInt32)) == 0)
exponent -= 32
end
return ldexp(Float32(a), exponent)
end
(chance of 0.0
is around 2^(-128)
)
or same semantics but simpler to prove finite run-time:
function rand32_for()
local exponent, a
for outer exponent in -32:-32:-196
(a = rand(UInt32)) == 0 || break
end
return ldexp(Float32(a), exponent)
end
Just from a speed perspective, my understanding was the RNG generates 64 bits at a time, so I donβt know if we are throwing those away and calling the generator more times than needed? Maybe it stores the unused bits and returns them quickly at the second call?
Note, maybe it doesnβt matter, we only make a second call once every 4 billion (2^32) calls, and a third call once every 2^64
Presumably a real implementation would eliminate the branching/loop anyway by getting all the random bits upfront?
Which is better, avoiding a branch but calling the RNG twice or 3 times as much as needed every time, or avoiding the second RNG almost always? I guess time to benchmark. Iβm guessing the loop is probably faster since it almost always goes once and the RNG wonβt SIMD per previous post.
julia> @btime(rand32_for())
4.067 ns (0 allocations: 0 bytes)
0.15846866f0
julia> @btime(rand(UInt64))
2.875 ns (0 allocations: 0 bytes)
0x7ae0be730a4bf57b
julia> @btime(rand(UInt32))
2.865 ns (0 allocations: 0 bytes)
0x43bc04e7
So, I love it. I also like the idea of doing the same thing for Float64 since it shouldnβt be the case that Float64 is more problematic than Float32
julia> function rand64_for()
local exponent, a
for outer exponent in -64:-64:-256
(a = rand(UInt64)) == 0 || break
end
return ldexp(Float64(a), exponent)
end
rand64_for (generic function with 1 method)
julia> @btime(rand64_for())
4.548 ns (0 allocations: 0 bytes)
0.26893953364756196
julia> @btime(rand(Float64))
2.845 ns (0 allocations: 0 bytes)
0.8204255370807716
julia> 4.548/2.845
1.5985940246045693
This tries to measure the latency of a single call, but I guess it could be more interesting to benchmark applying rand
to an array.
There are really 3 distinct cases we care about:
- Calling
rand
often, but in between there is something that the compiler cannot optimize through. In this case, the internal state of the RNG is read, modified, and written back for each call. The read/write probably doesnβt hit L1 cache and instead goes via store buffer. - Calling
rand
in a tight loop that the compiler understands. The internal RNG state is read only once and written only once, otherwise it is kept in register. Register is faster than store buffer. - Calling
rand!
on an array. This is a completely separate codepath and algorithm that uses SIMD.
So, to summarize, we have a simple candidate algorithm for rand() for Float64 and Float32 which uses a finite loop, is guaranteed to terminate, and generates 0.0 at less than 2^-128 = 2.9e-39 for either type, which means a million core billion hz supercomputer generates it less than once every 10^16 years in the absence of software bugs. It also has the CDF property that cdf(x) = x for values between 0.0 and 1.0 to within the capacity of the RNG to be fully uniform, and it costs about 1.5 times as much as generating a single UInt64.
I consider this a job well done. Whatβs the consensus?
Pretty good.
Two points:
- There is a related conundrum Iβve encountered:
julia> @benchmark Float32(rand(UInt64))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min β¦ max): 6.384 ns β¦ 30.382 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 6.842 ns β GC (median): 0.00%
Time (mean Β± Ο): 6.878 ns Β± 0.384 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
ββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
6.38 ns Histogram: frequency by time 8.39 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark Float32(rand(UInt32))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min β¦ max): 3.075 ns β¦ 150.727 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 3.159 ns β GC (median): 0.00%
Time (mean Β± Ο): 3.190 ns Β± 1.488 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β βββββββββββββββββ
βββββββββββββββββββββββββ
βββββββββββββββββββββ
β
ββββββββββββ β
3.08 ns Histogram: frequency by time 3.22 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
The two are identical in terms of rand
calculation, so why does conversion cost more for UInt64 ? Looking at the generated code, there is another branch for the UInt64 case, but Iβm not sure what it is for. Anyone?
- Officially Iβm renaming Float16 as BraindeadFloat16. The Float16 cannot hold 0xFFFF which is the the maximum UInt of the same size. IEEE have made the wrong decisions and BFloat16 should replace it all. My theory for why Float16 is this way, is because it has 10 bits of significant means it reached 1024, which means it can be translated into 3 decimal digits. It looks good for all the decimal system fans. Today humans are in much less regard and I think it wouldnβt have gone the same way.
Thatβs true for any FP format, except possibly for a trivial FP format which only has a mantissa and is thus actually just an integerβ¦? EDIT: OK, thanks for the clarification.
I mean 65535 cannot be represented as a float with an exponent! It is Inf !
julia> Float64(floatmax(Float16))
65504.0
julia> Float16(0xffff)
Inf16
or to spell it out: the exponent doesnβt have more than ~log(#bits)+1
bits.
I was also thinking along these lines. However, for reasons discussed earlier, Iβd rather throw away some bits than have implicit rounding, and Iβd like to avoid branches, so hereβs the way I would do it:
function randf32()
u = rand(UInt64)
a = Float32(u >>> 40) * Float32(0x1.0p-24)
b = Float32((u >>> 16) & 0xffffff) * Float32(0x1.0p-48)
return ifelse(a === 0f0, b, a)
end
This takes the top 24 bits to create a \in [0, 1) and the next 24 bits for b \in [0, 2^{-24}), returning b
if a == 0
and a
otherwise. This could of course be iterated further if desired.
Performance looks alright:
julia> @btime randf32();
3.446 ns (0 allocations: 0 bytes)
julia> @btime rand(Float32);
2.942 ns (0 allocations: 0 bytes)
What I donβt like so much about this approach is the step change in density at the chosen cutoff. The sample space remains as coarse as the current implementation down to 2^{-24} and is only refined for numbers smaller than that. I think we should aim to do better but also to avoid rounding and branches.
There is no good conversion function uint64->float32 in x86, thereβs only one for signed integers. Itβs why my proposal upthread did Float32(rand(UInt64)>>1)
.
On the upside, if you do use this, it has a vectorized version on AVX2.
In a scheme like this, you can also stagger the stages such that you donβt exhaust the dynamic range of one stage before switching to the next. Hereβs a 3-stage version with no rounding that first samples in [0, 1) with resolution 2^{-24}, but if the outcome is < 2^{-8} resamples in [0, 2^{-8}) with resolution 2^{-32}, and if that outcome is < 2^{-16} resamples in [0, 2^{-16}) with resolution 2^{-40}.
function randf32()
_24BITS = 0x0000000000ffffff
_2POW16 = 0x0000000000010000
u = rand(UInt64)
u1 = u >>> 40
u2 = (u >>> 32) & _24BITS
u3 = (u >>> 24) & _24BITS
x = Float32(u3) * Float32(0x1.0p-40)
x = ifelse(u2 < _2POW16, x, Float32(u2) * Float32(0x1.0p-32))
x = ifelse(u1 < _2POW16, x, Float32(u1) * Float32(0x1.0p-24))
return x
end
julia> @btime randf32();
3.894 ns (0 allocations: 0 bytes)
julia> @btime rand(Float32);
2.937 ns (0 allocations: 0 bytes)
Itβs possible to iterate three more times and get down to a resolution of 2^{-64} before exhausting a UInt64
, but performance suffers since the algorithm takes all branches by design.
In general, the stage overlap can be adjusted to trade resolution near zero for resolution near the transition points.
Threadmates, remind me what was wrong with:
randf32() = abs(ldexp(Float32(rand(Int64)),-63))
?
ADDED: Seems to me, the only problem is:
rand(r::Union{TaskLocalRNG, Xoshiro}, ::SamplerTrivial{CloseOpen01{Float16}}) =
Float16(Float32(rand(r, UInt16) >>> 5) * Float32(0x1.0p-11))
rand(r::Union{TaskLocalRNG, Xoshiro}, ::SamplerTrivial{CloseOpen01{Float32}}) =
Float32(rand(r, UInt32) >>> 8) * Float32(0x1.0p-24)
in Xoshiro.jl which doesnβt use the full 64-bit to generate smaller floats, which indeed should be fixed (at least in a RNG which generates 64-bit chunks anyway and no tradeoff relevant)
It rounds-to-even for values of Int64 > 2^24
, so not only does it return 1.0, even floats in the 2^24 octave over overrepresented by 3x (and each octave thereafter by ((N+2)/N)x.
But it rounds to nearest first. So which specific numbers are 3x more common? When using 63-bits, this rounding issue is only with numbers ~ 2^(-40) ?