Output distribution of rand(Float32) and rand(Float64), thread 2

It occurs to me that the relative cdf error argument is also implicitly related to why we’d want to sample more densely at the lower end of [0,1) at all. If all we care about is maximum absolute cdf error, then why would we care about sampling more finely than eps(0.5) since that’s the finest we can sample at the top of the interval? Partly, it’s appealing just because we can do better relatively easily and bit twiddling is fun.

The real problem that we keep coming back to, however, is the fact that eps(Float32(0.5)) is pretty big (relative to what?)—around 6e-8. But we don’t have an issue with that at the high end of the interval, so why is it a problem at the low end? In principle you can’t make an argument that distinguishes these based solely on absolute cdf error, so by even admitting that we should sample more finely at the lower end of the interval, we are implicitly admitting that something besides absolute cdf error must matter. And all the arguments seem to boil down to picking a very small x and demonstrating that P(rand() < x) is way too big relative to x. Which is precisely the relative cdf argument.

So I think that if we’re conceding that we need to do something different for rand(Float32) in the first place (which we are), then we need consider why we’re conceding that and what it says about our goodness criteria for rand.

3 Likes

Awesome, that’s what I was looking for — both the algorithm foundation and the theoretical motivation. I see you.

On the algorithm, I do really like having a monotonic transform function if possible — it’s eminently more testable. If we can get something working there, that’d be awesome. I think that’s what @mikmoore’s versions did above.

On the “de-biasing” motivation, I’m not as convinced. But I don’t yet have a coherent counter argument beyond the fact that it’s a complicated description of what rand means, so if we’re gonna do something complicated let’s use the simple description (but slightly fiddly algorithm) and just support returning all values x with probability eps(x). And that directly answers the parenthetical question you just posited (and then edited out, hah) — eps(T)/2 is pretty big relative to eps(x)!

I truly think “perfection” can be done at near-zero performance cost over my current implementation. The fiddly part that I was referring to is that you must be careful around over-saturating 0… and getting the algorithms written down in a way that it’s obviously correct and testable is tricky. But maybe that’s worth it.

The part that makes it SIMD-able is that you just check each batch of generated values at once; it’s unlikely enough that it’s not a huge cost. I already do one iteration of it for Float32. For Float64, If you’re in 0 < f < 2^-11, you just add the bits directly to the significand. And if you win the Powerball and get a 0, you re-do your Float64(rand(UInt64)/big(2)^64, RoundDown) and multiply by 0x1p-64. The fiddly part is when you get close to an exponent of significand_bits(T)+exponent_bias(T), because then you need to multiply by a smaller exponent lest you generate zeros with a probability of 8e-320 instead of the correct 5e-324 (:joy:). (Edit: actually, doing this as multiplication like I described here would be ok, but for performance I’m just subtracting 64 from the exponent… which then does make this is a problem).

3 Likes

Yes, generating all possible values certainly addresses the issue. And the clever approach of deciding if you need more precision as a batch does allow it to vectorize, which may make the performance good enough. Maybe that’s the best solution.

I edited the parenthetical back in (it felt like it was just confusing the argument, but apparently not).

Monotonicity is really nice, but if it’s faster to do something non-monotonic we should. One approach is to do a little invertible preamble bit twiddling to change the function to be monotonic and then test that version. Then you can just drop the invertible preamble and what remains is guaranteed to have the right distribution without being monotonic by itself.

3 Likes

I’d like to re-inject the consideration of GPU.

A batchwise-branching-SIMD approach should work very well on modern CPU and may turn out fast enough to become default rand.

But it won’t work as well on GPU :frowning:

So are we in consensus about the following?

  1. GPU gets as default something based on UInt64->Float64 and UInt32->Float32 (hopefully better than the current 24 bit Float32 construction!). Call it fastrand for now.

  2. CPU gets access to the same (for testing/compat and cases where speed matters).

  3. CPU and GPU both get access to an ideal (“perfect” but branchy) generator, for cases where this matters (like users who want to apply an inverse cdf with singularity at zero). Call it idealrand.

  4. CPU will default to either fastrand or idealrand, depending on tradeoffs and discussion – consensus on that is not yet reached, but will be informed by e.g. @mbauman 's code that will hopefully demonstrate that idealrand can be quite fast with SIMD, and will also be informed by GPU people who can tell us how bad GPU-CPU divergence would be.

Why would we want or need to use the same algorithm on the CPU as the GPU? There are GPU specialized random number generators that can be used.

1 Like

It’s not so much about the algorithm as the definition and meaning. If rand() means rounding down to the previous floating point value on CPU, but rounding down to the previous multiple of eps(T)/2 on GPU, is that ok? Or should that also change? GPUs are more likely to use 32 or 16 bit floats, too.

1 Like

Gotcha. I was under the impression though that the GPU folks were specifically not overriding regular calls to rand because of this, and instead ask that you use CUDA.rand to be clear that you’re using something with a different meaning and algorithm.

Not an expert on either side of this, but that distinction makes me feel it’s okay to design rand specifically around the CPU (though I must say that I’d be pretty unhappy with any changes that negatively impact SIMD)

Hah, I was under the same impression. This is wrong, though, and @maleadt corrected me in Kernel random numbers generation entropy / randomness issues

So CPU and GPU currently use different algorithms (xoshiro vs philox) to arrive at the same (questionable) distribution for unqualified rand().

Divergence between distributions would make it harder to port code from CPU to GPU without possibly introducing very subtle accuracy bugs. Especially it would be harder to test code for correctness and accuracy on CPU before deploying to GPU, for people who are unaware of the entire issue.

This is a real problem!

But we must swallow some frog, we’re only haggling about which one is more palatable. I guess divergence is grudgingly acceptable. But I’m not a GPU person, so my opinion on that is not qualified enough to be taken seriously.

(the two frogs would be: Throw the 1/rand(T) people under the bus immediately on CPU, or wait until they have tested their algorithms to satisfaction and deploy to GPU to present them a little surprise debugging puzzle)

Why not? I’ve been thinking that SIMT on GPU is less fussy than SIMD on CPU. The threads that don’t enter the branch will be on hold for a couple of cycles, but then you just sync the threads and carry on. However, I’m no expert on this, I just remember that Tim had no objections to using rejection sampling in CUDA.jl’s implementation of Box-Muller: Overflow in `randn` using CUDA.jl's native RNG · Issue #1464 · JuliaGPU/CUDA.jl · GitHub

What GPU considerations might inform is whether the raw data for rand(Float32) should be 32 or 64 bits. However, GPU libraries could always implement their own transforms if the fallback ones are undesirable.

2 Likes

I agree; it seems likely that an algorithm that can be implemented as efficient SIMD code can also be implemented efficiently on a GPU.

3 Likes

I think I’m almost there.

Here is some SIMDable code for generating infinite-precision round-down floats. It uses a “significand-exponent” algorithm like I described far above, generating the significand and exponent separately (but with appropriate distributions) and then combining them to produce the result. By using batched rejection sampling, it manages to SIMD. I have verified that it produces the correct distribution on Float16, so if there are mistakes at other sizes they’re fixable.

float sampling
@inline randtup(rng::AbstractRNG, ::Type{T}, N::Val) where T = ntuple(_->rand(rng, T), N)

function block_randf_sigexp(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
	N = Val(32 ÷ sizeof(T)) # number of values to generate
	U = Base.uinttype(T) # same-width Unsigned
	sigmask = Base.significand_mask(T)
	estep = reinterpret(U, floatmin(T)) # least significant exponent bit
	eincr1 = U(leading_zeros(sigmask) * estep) # first time we fail to produce a valid exponent, decrement by this
	eincr = U(trailing_ones(sigmask) * estep) # additional times we fail to produce a valid exponent, decrement by this
	ebase = reinterpret(U, T(1//2)) - eincr1 # exponent base that we start with - we'll offset upward from here
	numscales,remscale = fldmod(ebase, eincr) # number of eincr steps to reach the bottom
	rembits = fld(remscale, estep) # how many leftover bits to reach floatmin
	remmask = U(2^rembits - 1) # mask including bottom rembits

	# expoffset(u::U) where U = U(8sizeof(U) - leading_zeros(u)) * estep # valid for any input, but SIMDs poorly
	expoffset(u::U) where U = reinterpret(U, convert(Base.floattype(U), u)) & ~sigmask - reinterpret(U, convert(Base.floattype(U), 1//2)) # good SIMD but only correct for 1 <= u <= maxintfloat

	u = randtup(rng, U, N)
	sig = u .& sigmask # we'll use this for the significand
	uhi = u .>>> trailing_ones(sigmask) # get our remaining bits
	ebits = ebase .+ expoffset.(uhi)
	good = .!iszero.(uhi)
	all(good) && return reinterpret.(T, ebits .| sig)
	for _ in 1:numscales
		ebase -= eincr
		u = randtup(rng, U, N) .& sigmask # use to generate exponent
		ebits = ifelse.(good, ebits, ebase .+ expoffset.(u))
		good = good .| .!iszero.(u)
		all(good) && return reinterpret.(T, ebits .| sig)
	end
	if iszero(rembits)
		# move directly to subnormals
		ebits = ifelse.(good, ebits, zero(U))
	else
		# still have space to fill down to subnormals
		u = randtup(rng, U, N) .& remmask # use to generate exponent
		ebits = ifelse.(good, ebits, ifelse.(iszero.(u), zero(U), expoffset.(u))) # finally need to handle iszero(u) correctly
	end
	return reinterpret.(T, ebits .| sig)
end

Upthread, we found that leading_zeros did not SIMD well on all but the most recent x86’s, so this implementation doesn’t use it. It’s possible that we could do better still on machines with native SIMD leading_zeros. In any case, this one uses int->float conversions to do that part at the cost of restricting the input range (such that we can only use significand_bits bits per conversion instead of the full width). But the chances of redraws even at this rate are so vanishing that the waste is negligible.

I also took a crack at a SIMDing random number generator to feed this.

block RNG
using Random
using Random.XoshiroSimd: Xoshiro, forkRand, _plus, _rotl23, _shl17, _xor, _rotl45

mutable struct XoshiroBlock{N} <: Random.AbstractRNG
    s0::NTuple{N,Core.VecElement{UInt64}}
    s1::NTuple{N,Core.VecElement{UInt64}}
    s2::NTuple{N,Core.VecElement{UInt64}}
    s3::NTuple{N,Core.VecElement{UInt64}}
end

XoshiroBlock{N}(rng::Union{TaskLocalRNG, Xoshiro}) where N = XoshiroBlock{N}(forkRand(rng, Val(N))...)

@inline function Base.rand(rng::XoshiroBlock{N}) where {N}
    # based on Random.XoshiroSimd.xoshiro_bulk_simd

    s0 = rng.s0 # read
    s1 = rng.s1 # read
    s2 = rng.s2 # read
    s3 = rng.s3 # read
    res = _plus(_rotl23(_plus(s0,s3)),s0)
    t = _shl17(s1)
    s2 = _xor(s2, s0)
    s3 = _xor(s3, s1)
    s1 = _xor(s1, s2)
    s0 = _xor(s0, s3)
    s2 = _xor(s2, t)
    s3 = _rotl45(s3)
    rng.s0 = s0 # write
    rng.s1 = s1 # write
    rng.s2 = s2 # write
    rng.s3 = s3 # write

    return map(x->x.value, res) # result as a NTuple{N,UInt64}
end

@inline Base.rand(rng::XoshiroBlock, ::Type{T}) where T = @inline tuplecast(T, rand(rng))
@inline function randtup(rng::XoshiroBlock{N}, ::Type{T}, ::Val{M}) where {M,N,T}
    M*sizeof(T) == N*sizeof(UInt64) || error("XoshiroBlock{$N} will only generate NTuple{$(N*sizeof(UInt64)÷sizeof(T))}")
    return rand(rng,T)
end

for m in 0:4
    u = 8*sizeof(UInt64)
    M = 2^m
    MVU = NTuple{M,Core.VecElement{UInt64}}
    for T in Base.uniontypes(Base.BitUnsigned)
        t = 8*sizeof(T)
        N = fld(M*u, t)
        if N > 0
            NVT = NTuple{N,Core.VecElement{T}}
            code = """
                %res = bitcast <$M x i$u> %0 to <$N x i$t>
                ret <$N x i$t> %res
                """
            @eval @inline _tuplecast(::Type{$T}, x::$MVU) = Core.Intrinsics.llvmcall($code, $NVT, Tuple{$MVU}, x)
        end
    end
end

"""
    tuplecast(T::Type, x::NTuple{N,U})

Reinterpret the bits of `x` as a NTuple{M, T} of matching size.
Currently only supports `U == UInt64` and `T isa $(Base.BitUnsigned)`.
"""
tuplecast(::Type{T}, x::NTuple) where {T} = map(x->x.value, _tuplecast(T, map(Core.VecElement, x)))

For UIntXX, this block RNG is considerably slower than the builtin implementation and I haven’t fully investigated why. My guess is that the difference is that it’s constantly updating the mutable struct that holds the RNG. The builtin keeps the state in local immutables, which probably avoids memory access.

Here’s a performance comparison between these and the current RNG:

julia> using Random, BenchmarkTools

julia> r = Random.default_rng();

julia> r4 = XoshiroBlock{4}(r); # generate 4*64 bits at a time

julia> @btime rand($r, UInt64, 2^17); # raw RNG performance
  42.500 μs (2 allocations: 1.00 MiB)

julia> @btime [randtup($r4, UInt64) for _ in 1:2^15]; # raw RNG performance
  120.400 μs (2 allocations: 1.00 MiB)

julia> @btime rand($r, Float32, 2^18);
  49.500 μs (2 allocations: 1.00 MiB)

julia> @btime [block_randf_sigexp($r4, Float32) for _ in 1:2^15];
  130.000 μs (2 allocations: 1.00 MiB)

julia> @btime rand($r, Float64, 2^17);
  94.000 μs (2 allocations: 1.00 MiB)

julia> @btime [block_randf_sigexp($r4, Float64) for _ in 1:2^15];
  135.700 μs (2 allocations: 1.00 MiB)

One thing that stands out to me is that the current rand is 1.8x slower for Float64 than Float32 for the same number of bytes. With proper SIMD, one would expect them to be comparable (which we see in my block_randf_sigexp). I don’t know why this is, but it seems that Float64 could be improved.

In any case, this in-hand implementation is 2.5x slower for Float32 and 1.5x slower for Float64. I’ll point out that time to produce floats is only slightly longer than the cost of UInt64 generation for all schemes (except the aforementioned rand(Float64,N) slowness), so a better block RNG than the one I provided might shrink this gap considerably. A very optimistic compensation for the 10us extra to produce 1MiB Float32 or 15us extra to produce Float64 rather than UInt64 with my clumsy block generator suggests that we might produce infinite-precision Float32 in 1.1x the current time (55 vs 50us) and Float64 in 0.6x the current time (60 vs 95 us), although I suspect the actual numbers will be somewhat worse.

2 Likes

Hunh, I didn’t see that in my benchmarking. I’m seeing ~2x slower for 2x the bytes (the same number of elements) on both M1 and Intel (M1 copied below):

performance graph from ShawshankRandemption.jl

I can reproduce the discrepancy I saw in a fresh session. Maybe it’s just an issue with my outdated v1.9.1 install? Whatever.

The comparisons with Float32 should still be accurate and should translate approximately to a properly-working Float64.

Ah, yes, I’ve been testing against master; perhaps that’s it.

One thing that I’ve found as I’ve continued to noodle on this here and there is how easy it is to get things subtly wrong, especially when chaining together many iterations adding bits. That’s where it’s really nice to have algorithms that are optimizations of a simple and obviously correct (but slow) version.

I’ve also found it super useful to break things down into very small definitions that are easy to test. I’ve not finished this implementation yet, but I think this idea for “perfect precision” is looking promising — or at least should have useful components for rounding-down floating point bit twiddles.

Here’s the slow-but-obviously-correct premise:

# Rounding down an unsigned fixed-point [0,1) representation to float
fixed2float(::Type{F}, r::Unsigned) where {F} = F(big(r)/2^nbits(r), RoundDown)
# Divide a positive floating point value by 2^N (N > 0), rounding down.
÷₂ꜛ(value::F, N) where {F} = F(value * big(2.0^-N), RoundDown)
# Add a `smaller` floating point value into `bigger` (where smaller is less than the smallest nonzero value of bigger), rounding down
+₎(bigger::F, smaller::F) where {F} = F(big(bigger) + big(smaller), RoundDown)
# Get the epsilon of a value
log2eps(value) = round(Int, log2(eps(value)), RoundDown)

# Iteratively getting to "perfect precision"
function randf(rng, ::Type{F}, ::Type{U}) where {F <: Base.IEEEFloat, U <: Unsigned}
    value = fixed2float(F, rand(rng, U))
    nbits = 8*sizeof(U)
    while log2eps(value) <= -nbits
        value = value +₎ (fixed2float(F, rand(rng, U)) ÷₂ꜛ nbits)
        nbits += 8*sizeof(U)
    end
    return value
end

The huge benefit is that the above is the test suite. And it’s all super-generic; you can test iteratively getting every single Float16 value, three bytes at a time. Anyhow, my current optimizations of the above are looking promising. Scalar performance is the easy part — it’s so rare to hit those branches that it’s basically an even match. And I’ve just done the rudiments of the SIMD implementation without much testing there — but it is doing the above iterative thing and I’m still just barely within that 2x slowdown mark for Float32 on my M1.

julia> @btime Random.rand(Float32);
  2.958 ns (0 allocations: 0 bytes)

julia> @btime ShawshankRandemption.rand(Float32);
  3.084 ns (0 allocations: 0 bytes)

julia> A = Array{Float32}(undef, 2^20);

julia> @btime Random.rand!($A);
  325.541 μs (0 allocations: 0 bytes)

julia> @btime ShawshankRandemption.rand!($A);
  645.125 μs (0 allocations: 0 bytes)
5 Likes

I now better understand the different flavors of bit twiddles here:

  • There’s the exact fixed to float conversions, akin to division rounding down. The exponent is set by the position of the most significant set bit, with the mantissa bits following it. But regardless of how you do it (either clz or float conversion), you need to read back what the exponent was in order to know how to shift the remaining bits into their proper place in the mantissa.
  • That’s why the trailing zero implementation I initially used is faster. It uses the least set bit as the negative exponent, unsets it, and then the entire thing is the mantissa at a fixed shift from the correct place for it to be interpreted as the mantissa.

Both algorithms depend upon the result being normal — that’s why you ignore/unset that most/least significant bit: it’s the implicit one in the floating point fraction.

Now there’s a third class that could be worth exploring — and that’d be something that is only working correctly for values in [2^-9, 1). For example, you could use the most significant bit as the exponent and just always use the 23 least significant bits as the exponent. Now that’ll be horribly biased for smaller numbers, but we’re resampling in that case anyhow.

Are there any other flavors with promising performance characteristics I’m missing?

2 Likes

Another version of a poll taken in this thread. There is no consensus on this point.

Any reason not to take 64-bits of random for 32 floats if there is no overhead? (which is the case in Xoroshiro on the CPU)
The 2^-9 would be much smaller in that case.