# Output distribution of rand(Float32) and rand(Float64), thread 2

It occurs to me that the relative cdf error argument is also implicitly related to why we’d want to sample more densely at the lower end of [0,1) at all. If all we care about is maximum absolute cdf error, then why would we care about sampling more finely than `eps(0.5)` since that’s the finest we can sample at the top of the interval? Partly, it’s appealing just because we can do better relatively easily and bit twiddling is fun.

The real problem that we keep coming back to, however, is the fact that `eps(Float32(0.5))` is pretty big (relative to what?)—around `6e-8`. But we don’t have an issue with that at the high end of the interval, so why is it a problem at the low end? In principle you can’t make an argument that distinguishes these based solely on absolute cdf error, so by even admitting that we should sample more finely at the lower end of the interval, we are implicitly admitting that something besides absolute cdf error must matter. And all the arguments seem to boil down to picking a very small `x` and demonstrating that `P(rand() < x)` is way too big relative to `x`. Which is precisely the relative cdf argument.

So I think that if we’re conceding that we need to do something different for `rand(Float32)` in the first place (which we are), then we need consider why we’re conceding that and what it says about our goodness criteria for `rand`.

3 Likes

Awesome, that’s what I was looking for — both the algorithm foundation and the theoretical motivation. I see you.

On the algorithm, I do really like having a monotonic transform function if possible — it’s eminently more testable. If we can get something working there, that’d be awesome. I think that’s what @mikmoore’s versions did above.

On the “de-biasing” motivation, I’m not as convinced. But I don’t yet have a coherent counter argument beyond the fact that it’s a complicated description of what rand means, so if we’re gonna do something complicated let’s use the simple description (but slightly fiddly algorithm) and just support returning all values `x` with probability `eps(x)`. And that directly answers the parenthetical question you just posited (and then edited out, hah) — `eps(T)/2` is pretty big relative to `eps(x)`!

I truly think “perfection” can be done at near-zero performance cost over my current implementation. The fiddly part that I was referring to is that you must be careful around over-saturating 0… and getting the algorithms written down in a way that it’s obviously correct and testable is tricky. But maybe that’s worth it.

The part that makes it SIMD-able is that you just check each batch of generated values at once; it’s unlikely enough that it’s not a huge cost. I already do one iteration of it for Float32. For Float64, If you’re in `0 < f < 2^-11`, you just add the bits directly to the significand. And if you win the Powerball and get a 0, you re-do your `Float64(rand(UInt64)/big(2)^64, RoundDown)` and multiply by `0x1p-64`. The fiddly part is when you get close to an exponent of `significand_bits(T)+exponent_bias(T)`, because then you need to multiply by a smaller exponent lest you generate zeros with a probability of 8e-320 instead of the correct 5e-324 (). (Edit: actually, doing this as multiplication like I described here would be ok, but for performance I’m just subtracting 64 from the exponent… which then does make this is a problem).

3 Likes

Yes, generating all possible values certainly addresses the issue. And the clever approach of deciding if you need more precision as a batch does allow it to vectorize, which may make the performance good enough. Maybe that’s the best solution.

I edited the parenthetical back in (it felt like it was just confusing the argument, but apparently not).

Monotonicity is really nice, but if it’s faster to do something non-monotonic we should. One approach is to do a little invertible preamble bit twiddling to change the function to be monotonic and then test that version. Then you can just drop the invertible preamble and what remains is guaranteed to have the right distribution without being monotonic by itself.

3 Likes

I’d like to re-inject the consideration of GPU.

A batchwise-branching-SIMD approach should work very well on modern CPU and may turn out fast enough to become default `rand`.

But it won’t work as well on GPU

So are we in consensus about the following?

1. GPU gets as default something based on UInt64->Float64 and UInt32->Float32 (hopefully better than the current 24 bit Float32 construction!). Call it `fastrand` for now.

2. CPU gets access to the same (for testing/compat and cases where speed matters).

3. CPU and GPU both get access to an ideal (“perfect” but branchy) generator, for cases where this matters (like users who want to apply an inverse cdf with singularity at zero). Call it `idealrand`.

4. CPU will default to either `fastrand` or `idealrand`, depending on tradeoffs and discussion – consensus on that is not yet reached, but will be informed by e.g. @mbauman 's code that will hopefully demonstrate that idealrand can be quite fast with SIMD, and will also be informed by GPU people who can tell us how bad GPU-CPU divergence would be.

Why would we want or need to use the same algorithm on the CPU as the GPU? There are GPU specialized random number generators that can be used.

1 Like

It’s not so much about the algorithm as the definition and meaning. If `rand()` means rounding down to the previous floating point value on CPU, but rounding down to the previous multiple of `eps(T)/2` on GPU, is that ok? Or should that also change? GPUs are more likely to use 32 or 16 bit floats, too.

1 Like

Gotcha. I was under the impression though that the GPU folks were specifically not overriding regular calls to `rand` because of this, and instead ask that you use `CUDA.rand` to be clear that you’re using something with a different meaning and algorithm.

Not an expert on either side of this, but that distinction makes me feel it’s okay to design `rand` specifically around the CPU (though I must say that I’d be pretty unhappy with any changes that negatively impact SIMD)

Hah, I was under the same impression. This is wrong, though, and @maleadt corrected me in Kernel random numbers generation entropy / randomness issues

So CPU and GPU currently use different algorithms (xoshiro vs philox) to arrive at the same (questionable) distribution for unqualified `rand()`.

Divergence between distributions would make it harder to port code from CPU to GPU without possibly introducing very subtle accuracy bugs. Especially it would be harder to test code for correctness and accuracy on CPU before deploying to GPU, for people who are unaware of the entire issue.

This is a real problem!

But we must swallow some frog, we’re only haggling about which one is more palatable. I guess divergence is grudgingly acceptable. But I’m not a GPU person, so my opinion on that is not qualified enough to be taken seriously.

(the two frogs would be: Throw the `1/rand(T)` people under the bus immediately on CPU, or wait until they have tested their algorithms to satisfaction and deploy to GPU to present them a little surprise debugging puzzle)

Why not? I’ve been thinking that SIMT on GPU is less fussy than SIMD on CPU. The threads that don’t enter the branch will be on hold for a couple of cycles, but then you just sync the threads and carry on. However, I’m no expert on this, I just remember that Tim had no objections to using rejection sampling in CUDA.jl’s implementation of Box-Muller: Overflow in `randn` using CUDA.jl's native RNG · Issue #1464 · JuliaGPU/CUDA.jl · GitHub

What GPU considerations might inform is whether the raw data for `rand(Float32)` should be 32 or 64 bits. However, GPU libraries could always implement their own transforms if the fallback ones are undesirable.

2 Likes

I agree; it seems likely that an algorithm that can be implemented as efficient SIMD code can also be implemented efficiently on a GPU.

3 Likes

I think I’m almost there.

Here is some SIMDable code for generating infinite-precision round-down floats. It uses a “significand-exponent” algorithm like I described far above, generating the significand and exponent separately (but with appropriate distributions) and then combining them to produce the result. By using batched rejection sampling, it manages to SIMD. I have verified that it produces the correct distribution on `Float16`, so if there are mistakes at other sizes they’re fixable.

float sampling
``````@inline randtup(rng::AbstractRNG, ::Type{T}, N::Val) where T = ntuple(_->rand(rng, T), N)

function block_randf_sigexp(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
N = Val(32 ÷ sizeof(T)) # number of values to generate
U = Base.uinttype(T) # same-width Unsigned
estep = reinterpret(U, floatmin(T)) # least significant exponent bit
eincr1 = U(leading_zeros(sigmask) * estep) # first time we fail to produce a valid exponent, decrement by this
eincr = U(trailing_ones(sigmask) * estep) # additional times we fail to produce a valid exponent, decrement by this
ebase = reinterpret(U, T(1//2)) - eincr1 # exponent base that we start with - we'll offset upward from here
numscales,remscale = fldmod(ebase, eincr) # number of eincr steps to reach the bottom
rembits = fld(remscale, estep) # how many leftover bits to reach floatmin

# expoffset(u::U) where U = U(8sizeof(U) - leading_zeros(u)) * estep # valid for any input, but SIMDs poorly
expoffset(u::U) where U = reinterpret(U, convert(Base.floattype(U), u)) & ~sigmask - reinterpret(U, convert(Base.floattype(U), 1//2)) # good SIMD but only correct for 1 <= u <= maxintfloat

u = randtup(rng, U, N)
sig = u .& sigmask # we'll use this for the significand
uhi = u .>>> trailing_ones(sigmask) # get our remaining bits
ebits = ebase .+ expoffset.(uhi)
good = .!iszero.(uhi)
all(good) && return reinterpret.(T, ebits .| sig)
for _ in 1:numscales
ebase -= eincr
u = randtup(rng, U, N) .& sigmask # use to generate exponent
ebits = ifelse.(good, ebits, ebase .+ expoffset.(u))
good = good .| .!iszero.(u)
all(good) && return reinterpret.(T, ebits .| sig)
end
if iszero(rembits)
# move directly to subnormals
ebits = ifelse.(good, ebits, zero(U))
else
# still have space to fill down to subnormals
u = randtup(rng, U, N) .& remmask # use to generate exponent
ebits = ifelse.(good, ebits, ifelse.(iszero.(u), zero(U), expoffset.(u))) # finally need to handle iszero(u) correctly
end
return reinterpret.(T, ebits .| sig)
end
``````

Upthread, we found that `leading_zeros` did not SIMD well on all but the most recent x86’s, so this implementation doesn’t use it. It’s possible that we could do better still on machines with native SIMD `leading_zeros`. In any case, this one uses int->float conversions to do that part at the cost of restricting the input range (such that we can only use `significand_bits` bits per conversion instead of the full width). But the chances of redraws even at this rate are so vanishing that the waste is negligible.

I also took a crack at a SIMDing random number generator to feed this.

block RNG
``````using Random
using Random.XoshiroSimd: Xoshiro, forkRand, _plus, _rotl23, _shl17, _xor, _rotl45

mutable struct XoshiroBlock{N} <: Random.AbstractRNG
s0::NTuple{N,Core.VecElement{UInt64}}
s1::NTuple{N,Core.VecElement{UInt64}}
s2::NTuple{N,Core.VecElement{UInt64}}
s3::NTuple{N,Core.VecElement{UInt64}}
end

XoshiroBlock{N}(rng::Union{TaskLocalRNG, Xoshiro}) where N = XoshiroBlock{N}(forkRand(rng, Val(N))...)

@inline function Base.rand(rng::XoshiroBlock{N}) where {N}
# based on Random.XoshiroSimd.xoshiro_bulk_simd

res = _plus(_rotl23(_plus(s0,s3)),s0)
t = _shl17(s1)
s2 = _xor(s2, s0)
s3 = _xor(s3, s1)
s1 = _xor(s1, s2)
s0 = _xor(s0, s3)
s2 = _xor(s2, t)
s3 = _rotl45(s3)
rng.s0 = s0 # write
rng.s1 = s1 # write
rng.s2 = s2 # write
rng.s3 = s3 # write

return map(x->x.value, res) # result as a NTuple{N,UInt64}
end

@inline Base.rand(rng::XoshiroBlock, ::Type{T}) where T = @inline tuplecast(T, rand(rng))
@inline function randtup(rng::XoshiroBlock{N}, ::Type{T}, ::Val{M}) where {M,N,T}
M*sizeof(T) == N*sizeof(UInt64) || error("XoshiroBlock{\$N} will only generate NTuple{\$(N*sizeof(UInt64)÷sizeof(T))}")
return rand(rng,T)
end

for m in 0:4
u = 8*sizeof(UInt64)
M = 2^m
MVU = NTuple{M,Core.VecElement{UInt64}}
for T in Base.uniontypes(Base.BitUnsigned)
t = 8*sizeof(T)
N = fld(M*u, t)
if N > 0
NVT = NTuple{N,Core.VecElement{T}}
code = """
%res = bitcast <\$M x i\$u> %0 to <\$N x i\$t>
ret <\$N x i\$t> %res
"""
@eval @inline _tuplecast(::Type{\$T}, x::\$MVU) = Core.Intrinsics.llvmcall(\$code, \$NVT, Tuple{\$MVU}, x)
end
end
end

"""
tuplecast(T::Type, x::NTuple{N,U})

Reinterpret the bits of `x` as a NTuple{M, T} of matching size.
Currently only supports `U == UInt64` and `T isa \$(Base.BitUnsigned)`.
"""
tuplecast(::Type{T}, x::NTuple) where {T} = map(x->x.value, _tuplecast(T, map(Core.VecElement, x)))
``````

For `UIntXX`, this block RNG is considerably slower than the builtin implementation and I haven’t fully investigated why. My guess is that the difference is that it’s constantly updating the mutable struct that holds the RNG. The builtin keeps the state in local immutables, which probably avoids memory access.

Here’s a performance comparison between these and the current RNG:

``````julia> using Random, BenchmarkTools

julia> r = Random.default_rng();

julia> r4 = XoshiroBlock{4}(r); # generate 4*64 bits at a time

julia> @btime rand(\$r, UInt64, 2^17); # raw RNG performance
42.500 μs (2 allocations: 1.00 MiB)

julia> @btime [randtup(\$r4, UInt64) for _ in 1:2^15]; # raw RNG performance
120.400 μs (2 allocations: 1.00 MiB)

julia> @btime rand(\$r, Float32, 2^18);
49.500 μs (2 allocations: 1.00 MiB)

julia> @btime [block_randf_sigexp(\$r4, Float32) for _ in 1:2^15];
130.000 μs (2 allocations: 1.00 MiB)

julia> @btime rand(\$r, Float64, 2^17);
94.000 μs (2 allocations: 1.00 MiB)

julia> @btime [block_randf_sigexp(\$r4, Float64) for _ in 1:2^15];
135.700 μs (2 allocations: 1.00 MiB)
``````

One thing that stands out to me is that the current `rand` is 1.8x slower for `Float64` than `Float32` for the same number of bytes. With proper SIMD, one would expect them to be comparable (which we see in my `block_randf_sigexp`). I don’t know why this is, but it seems that `Float64` could be improved.

In any case, this in-hand implementation is 2.5x slower for `Float32` and 1.5x slower for `Float64`. I’ll point out that time to produce floats is only slightly longer than the cost of `UInt64` generation for all schemes (except the aforementioned `rand(Float64,N)` slowness), so a better block RNG than the one I provided might shrink this gap considerably. A very optimistic compensation for the 10us extra to produce 1MiB `Float32` or 15us extra to produce `Float64` rather than `UInt64` with my clumsy block generator suggests that we might produce infinite-precision `Float32` in 1.1x the current time (55 vs 50us) and `Float64` in 0.6x the current time (60 vs 95 us), although I suspect the actual numbers will be somewhat worse.

2 Likes

Hunh, I didn’t see that in my benchmarking. I’m seeing ~2x slower for 2x the bytes (the same number of elements) on both M1 and Intel (M1 copied below):

I can reproduce the discrepancy I saw in a fresh session. Maybe it’s just an issue with my outdated v1.9.1 install? Whatever.

The comparisons with `Float32` should still be accurate and should translate approximately to a properly-working `Float64`.

Ah, yes, I’ve been testing against master; perhaps that’s it.

One thing that I’ve found as I’ve continued to noodle on this here and there is how easy it is to get things subtly wrong, especially when chaining together many iterations adding bits. That’s where it’s really nice to have algorithms that are optimizations of a simple and obviously correct (but slow) version.

I’ve also found it super useful to break things down into very small definitions that are easy to test. I’ve not finished this implementation yet, but I think this idea for “perfect precision” is looking promising — or at least should have useful components for rounding-down floating point bit twiddles.

Here’s the slow-but-obviously-correct premise:

``````# Rounding down an unsigned fixed-point [0,1) representation to float
fixed2float(::Type{F}, r::Unsigned) where {F} = F(big(r)/2^nbits(r), RoundDown)
# Divide a positive floating point value by 2^N (N > 0), rounding down.
÷₂ꜛ(value::F, N) where {F} = F(value * big(2.0^-N), RoundDown)
# Add a `smaller` floating point value into `bigger` (where smaller is less than the smallest nonzero value of bigger), rounding down
+₎(bigger::F, smaller::F) where {F} = F(big(bigger) + big(smaller), RoundDown)
# Get the epsilon of a value
log2eps(value) = round(Int, log2(eps(value)), RoundDown)

# Iteratively getting to "perfect precision"
function randf(rng, ::Type{F}, ::Type{U}) where {F <: Base.IEEEFloat, U <: Unsigned}
value = fixed2float(F, rand(rng, U))
nbits = 8*sizeof(U)
while log2eps(value) <= -nbits
value = value +₎ (fixed2float(F, rand(rng, U)) ÷₂ꜛ nbits)
nbits += 8*sizeof(U)
end
return value
end
``````

The huge benefit is that the above is the test suite. And it’s all super-generic; you can test iteratively getting every single Float16 value, three bytes at a time. Anyhow, my current optimizations of the above are looking promising. Scalar performance is the easy part — it’s so rare to hit those branches that it’s basically an even match. And I’ve just done the rudiments of the SIMD implementation without much testing there — but it is doing the above iterative thing and I’m still just barely within that 2x slowdown mark for Float32 on my M1.

``````julia> @btime Random.rand(Float32);
2.958 ns (0 allocations: 0 bytes)

julia> @btime ShawshankRandemption.rand(Float32);
3.084 ns (0 allocations: 0 bytes)

julia> A = Array{Float32}(undef, 2^20);

julia> @btime Random.rand!(\$A);
325.541 μs (0 allocations: 0 bytes)

julia> @btime ShawshankRandemption.rand!(\$A);
645.125 μs (0 allocations: 0 bytes)
``````
4 Likes

I now better understand the different flavors of bit twiddles here:

• There’s the exact fixed to float conversions, akin to division rounding down. The exponent is set by the position of the most significant set bit, with the mantissa bits following it. But regardless of how you do it (either clz or float conversion), you need to read back what the exponent was in order to know how to shift the remaining bits into their proper place in the mantissa.
• That’s why the trailing zero implementation I initially used is faster. It uses the least set bit as the negative exponent, unsets it, and then the entire thing is the mantissa at a fixed shift from the correct place for it to be interpreted as the mantissa.

Both algorithms depend upon the result being normal — that’s why you ignore/unset that most/least significant bit: it’s the implicit one in the floating point fraction.

Now there’s a third class that could be worth exploring — and that’d be something that is only working correctly for values in [2^-9, 1). For example, you could use the most significant bit as the exponent and just always use the 23 least significant bits as the exponent. Now that’ll be horribly biased for smaller numbers, but we’re resampling in that case anyhow.

Are there any other flavors with promising performance characteristics I’m missing?

2 Likes

Another version of a poll taken in this thread. There is no consensus on this point.

Any reason not to take 64-bits of random for 32 floats if there is no overhead? (which is the case in Xoroshiro on the CPU)
The 2^-9 would be much smaller in that case.