I think I’m almost there.

Here is some SIMDable code for generating infinite-precision round-down floats. It uses a “significand-exponent” algorithm like I described far above, generating the significand and exponent separately (but with appropriate distributions) and then combining them to produce the result. By using batched rejection sampling, it manages to SIMD. I have verified that it produces the correct distribution on `Float16`

, so if there are mistakes at other sizes they’re fixable.

##
float sampling

```
@inline randtup(rng::AbstractRNG, ::Type{T}, N::Val) where T = ntuple(_->rand(rng, T), N)
function block_randf_sigexp(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
N = Val(32 ÷ sizeof(T)) # number of values to generate
U = Base.uinttype(T) # same-width Unsigned
sigmask = Base.significand_mask(T)
estep = reinterpret(U, floatmin(T)) # least significant exponent bit
eincr1 = U(leading_zeros(sigmask) * estep) # first time we fail to produce a valid exponent, decrement by this
eincr = U(trailing_ones(sigmask) * estep) # additional times we fail to produce a valid exponent, decrement by this
ebase = reinterpret(U, T(1//2)) - eincr1 # exponent base that we start with - we'll offset upward from here
numscales,remscale = fldmod(ebase, eincr) # number of eincr steps to reach the bottom
rembits = fld(remscale, estep) # how many leftover bits to reach floatmin
remmask = U(2^rembits - 1) # mask including bottom rembits
# expoffset(u::U) where U = U(8sizeof(U) - leading_zeros(u)) * estep # valid for any input, but SIMDs poorly
expoffset(u::U) where U = reinterpret(U, convert(Base.floattype(U), u)) & ~sigmask - reinterpret(U, convert(Base.floattype(U), 1//2)) # good SIMD but only correct for 1 <= u <= maxintfloat
u = randtup(rng, U, N)
sig = u .& sigmask # we'll use this for the significand
uhi = u .>>> trailing_ones(sigmask) # get our remaining bits
ebits = ebase .+ expoffset.(uhi)
good = .!iszero.(uhi)
all(good) && return reinterpret.(T, ebits .| sig)
for _ in 1:numscales
ebase -= eincr
u = randtup(rng, U, N) .& sigmask # use to generate exponent
ebits = ifelse.(good, ebits, ebase .+ expoffset.(u))
good = good .| .!iszero.(u)
all(good) && return reinterpret.(T, ebits .| sig)
end
if iszero(rembits)
# move directly to subnormals
ebits = ifelse.(good, ebits, zero(U))
else
# still have space to fill down to subnormals
u = randtup(rng, U, N) .& remmask # use to generate exponent
ebits = ifelse.(good, ebits, ifelse.(iszero.(u), zero(U), expoffset.(u))) # finally need to handle iszero(u) correctly
end
return reinterpret.(T, ebits .| sig)
end
```

Upthread, we found that `leading_zeros`

did not SIMD well on all but the most recent x86’s, so this implementation doesn’t use it. It’s possible that we could do better still on machines with native SIMD `leading_zeros`

. In any case, this one uses int->float conversions to do that part at the cost of restricting the input range (such that we can only use `significand_bits`

bits per conversion instead of the full width). But the chances of redraws even at this rate are so vanishing that the waste is negligible.

I also took a crack at a SIMDing random number generator to feed this.

##
block RNG

```
using Random
using Random.XoshiroSimd: Xoshiro, forkRand, _plus, _rotl23, _shl17, _xor, _rotl45
mutable struct XoshiroBlock{N} <: Random.AbstractRNG
s0::NTuple{N,Core.VecElement{UInt64}}
s1::NTuple{N,Core.VecElement{UInt64}}
s2::NTuple{N,Core.VecElement{UInt64}}
s3::NTuple{N,Core.VecElement{UInt64}}
end
XoshiroBlock{N}(rng::Union{TaskLocalRNG, Xoshiro}) where N = XoshiroBlock{N}(forkRand(rng, Val(N))...)
@inline function Base.rand(rng::XoshiroBlock{N}) where {N}
# based on Random.XoshiroSimd.xoshiro_bulk_simd
s0 = rng.s0 # read
s1 = rng.s1 # read
s2 = rng.s2 # read
s3 = rng.s3 # read
res = _plus(_rotl23(_plus(s0,s3)),s0)
t = _shl17(s1)
s2 = _xor(s2, s0)
s3 = _xor(s3, s1)
s1 = _xor(s1, s2)
s0 = _xor(s0, s3)
s2 = _xor(s2, t)
s3 = _rotl45(s3)
rng.s0 = s0 # write
rng.s1 = s1 # write
rng.s2 = s2 # write
rng.s3 = s3 # write
return map(x->x.value, res) # result as a NTuple{N,UInt64}
end
@inline Base.rand(rng::XoshiroBlock, ::Type{T}) where T = @inline tuplecast(T, rand(rng))
@inline function randtup(rng::XoshiroBlock{N}, ::Type{T}, ::Val{M}) where {M,N,T}
M*sizeof(T) == N*sizeof(UInt64) || error("XoshiroBlock{$N} will only generate NTuple{$(N*sizeof(UInt64)÷sizeof(T))}")
return rand(rng,T)
end
for m in 0:4
u = 8*sizeof(UInt64)
M = 2^m
MVU = NTuple{M,Core.VecElement{UInt64}}
for T in Base.uniontypes(Base.BitUnsigned)
t = 8*sizeof(T)
N = fld(M*u, t)
if N > 0
NVT = NTuple{N,Core.VecElement{T}}
code = """
%res = bitcast <$M x i$u> %0 to <$N x i$t>
ret <$N x i$t> %res
"""
@eval @inline _tuplecast(::Type{$T}, x::$MVU) = Core.Intrinsics.llvmcall($code, $NVT, Tuple{$MVU}, x)
end
end
end
"""
tuplecast(T::Type, x::NTuple{N,U})
Reinterpret the bits of `x` as a NTuple{M, T} of matching size.
Currently only supports `U == UInt64` and `T isa $(Base.BitUnsigned)`.
"""
tuplecast(::Type{T}, x::NTuple) where {T} = map(x->x.value, _tuplecast(T, map(Core.VecElement, x)))
```

For `UIntXX`

, this block RNG is considerably slower than the builtin implementation and I haven’t fully investigated why. My guess is that the difference is that it’s constantly updating the mutable struct that holds the RNG. The builtin keeps the state in local immutables, which probably avoids memory access.

Here’s a performance comparison between these and the current RNG:

```
julia> using Random, BenchmarkTools
julia> r = Random.default_rng();
julia> r4 = XoshiroBlock{4}(r); # generate 4*64 bits at a time
julia> @btime rand($r, UInt64, 2^17); # raw RNG performance
42.500 μs (2 allocations: 1.00 MiB)
julia> @btime [randtup($r4, UInt64) for _ in 1:2^15]; # raw RNG performance
120.400 μs (2 allocations: 1.00 MiB)
julia> @btime rand($r, Float32, 2^18);
49.500 μs (2 allocations: 1.00 MiB)
julia> @btime [block_randf_sigexp($r4, Float32) for _ in 1:2^15];
130.000 μs (2 allocations: 1.00 MiB)
julia> @btime rand($r, Float64, 2^17);
94.000 μs (2 allocations: 1.00 MiB)
julia> @btime [block_randf_sigexp($r4, Float64) for _ in 1:2^15];
135.700 μs (2 allocations: 1.00 MiB)
```

One thing that stands out to me is that the current `rand`

is 1.8x slower for `Float64`

than `Float32`

for the same number of bytes. With proper SIMD, one would expect them to be comparable (which we see in my `block_randf_sigexp`

). I don’t know why this is, but it seems that `Float64`

could be improved.

In any case, this in-hand implementation is 2.5x slower for `Float32`

and 1.5x slower for `Float64`

. I’ll point out that time to produce floats is only slightly longer than the cost of `UInt64`

generation for all schemes (except the aforementioned `rand(Float64,N)`

slowness), so a better block RNG than the one I provided *might* shrink this gap considerably. A very optimistic compensation for the 10us extra to produce 1MiB `Float32`

or 15us extra to produce `Float64`

rather than `UInt64`

with my clumsy block generator suggests that we *might* produce infinite-precision `Float32`

in 1.1x the current time (55 vs 50us) and `Float64`

in 0.6x the current time (60 vs 95 us), although I suspect the actual numbers will be somewhat worse.