is it easy to make the algorithm that rand and rand! Utilize user selectable globally?
What is the benefit of using globals here when we can simply specify the behavior via usual function arguments, e.g. rand(FloatWithNBits(72))
or such?
is it easy to make the algorithm that rand and rand! Utilize user selectable globally?
What is the benefit of using globals here when we can simply specify the behavior via usual function arguments, e.g. rand(FloatWithNBits(72))
or such?
That has the same problem as rounding modes: It composes really really badly (e.g. one package needs ideal-rand, and another needs fast-rand in order to be comparable to gpu).
Users who are aware of the issue should be able to manually select the Random.Sampler
they need.
The default determines whether we get surprising correctness problems for unaware users (who naively do 1/rand()
), or surprising perf issues (if IdealUniform
turns out to be much slower and becomes the default) or surprising divergences between CPU and GPU (I think that gpu IdealUniform cannot be done without major perf impact).
An unspoken assumption underlying all these examples and f(rand())
desiderata — be they @Tamas_Papp’s or @dlakelan’s or the OG blog post — is that people want to treat rand()
's output as a true random variable. Like an rv random variable from probs and stats literature.
But it’s not. It never will be. It can mostly stand in as one, but there are caveats, no matter how it is defined. An “ideal” generator won’t save you from numerical analysis, but it might reduce some classes of error.
The thing that initially convinced me were the errors in rand(Float32)
's CDF, but even then you’d need to sample an astronomical number of times to actually observe any such error for values > eps(1f0)
.
For Float64 yeah but for Float32 it takes less than a few minutes to make those graphs I made, on one core.
I agree with this with respect to the question of a global “mode”
Based on discussion here I’m thinking we should do two things… one is work on SIMD implementations so that default Julia rand(Float32) and rand(Float64) have smallest nonzero values of ldexp(1.0f0,-32) and ldexp(1.0,-64) respectively.
The second is to create a package something like DetailedRand.jl where we create structures that represent the choice of different algorithms that are derived from AbstractRNG… Something like:
struct SimpleDetailedXoshiRand <: AbstractRNG
res32::Int32
res64::Int32
res128::Int32
resbf::Int64
rng::Xoshiro
Then let those who really want specific resolution near 0 use this kind of structure and algorithms designed in the package.
As it seems this thread is winding down (did we break the length record?) I’ll circle back to my first contribution:
using FixedPointNumbers
function rand_fpn(w::Val)
if w == Val(8) S, U = Int8, UInt8 ; end
if w == Val(16) S, U = Int16, UInt16 ; end
if w == Val(32) S, U = Int32, UInt32 ; end
if w == Val(64) S, U = Int64, UInt64 ; end
if w == Val(128) S, U = Int128, UInt128 ; end
f = 8sizeof(U)-1
return Fixed{S, f}(reinterpret(S, rand(U)>>1)|1, 0)
end
rand_fpn
returns a FixedPointNumber of any width (8,16,32,64 and even 128 - No Float128). Furthermore, it doesn’t return 0.0
or 1.0
, a true (0,1) sampler, with exactly uniform density, and symmetric rounding (it’s nice to do impossible things once in a while). As you can see:
julia> @btime rand_fpn(Val(8));
@btime rand_fpn(Val(16)); @btime rand_fpn(Val(32));
@btime rand_fpn(Val(64)); @btime rand_fpn(Val(128));
4.706 ns (0 allocations: 0 bytes)
4.707 ns (0 allocations: 0 bytes)
4.705 ns (0 allocations: 0 bytes)
4.709 ns (0 allocations: 0 bytes)
4.695 ns (0 allocations: 0 bytes)
timings are good and uniform.
I can even safely do a nice histogram of log(rand(.))
, on 8-bit samples! :
julia> histogram(log.([rand_fpn(Val(8)) for _ in 1:1_000_000]))
┌ ┐
[-5.0, -4.5) ┤█▎ 15 550
[-4.5, -4.0) ┤ 0
[-4.0, -3.5) ┤█▎ 15 754
[-3.5, -3.0) ┤█▎ 15 674
[-3.0, -2.5) ┤██▌ 31 160
[-2.5, -2.0) ┤████▉ 62 395
[-2.0, -1.5) ┤██████▎ 78 450
[-1.5, -1.0) ┤████████████▍ 156 195
[-1.0, -0.5) ┤██████████████████▌ 233 987
[-0.5, 0.0) ┤███████████████████████████████ 390 835
something which produces an exception with rand(Float32)
:
julia> histogram(log.([rand(Float32) for _ in 1:100_000_000]))
ERROR: InexactError: Int64(NaN)
Stacktrace:
[1] Int64
@ ./float.jl:900 [inlined]
[2] histrange(lo::Float32, hi::Float32, n::Int64, closed::Symbol)
@ StatsBase ~/.julia/packages/StatsBase/WLz8A/src/hist.jl:99
Of course, this goodness is mainly thanks to the great FixedPointNumbers developers.
Anyway, I’ve learned a lot. Glad to help make this the longest thread (so is it?)
I mostly agree with that. Yes, we should extend Float64 to at least 64-bit precision and Float32 to at least 32-bit precision; I’m actually rather in favor of 64-bit precision for all three types (Float64, Float32, Float16), which is what my proposed implementation did. That means perfect sampling for Float16 and sampling such that the least probable values of either Float64 or Float32 have probability 1/2^64.
The other point of debate is whether zero should be included in the output or not. I’m ok with continuing to emit zero sometimes and stick with a design where cdf(x) ≥ x
for all representable x
. But I do think that it’s kind of appealing to consider a design where cdf(x)
goes above and below x
in the range where we don’t densely sample the representable floats, since that minimizes the CDF error and has the side effect of avoiding emitting zero; it should continue to have cdf(x) = x
for all x
in the region where we sample densely. We don’t yet have a good design for that, although I have been playing with some approaches.
Since I keep advocating performance comparisons, here’s one. Since I already made a preliminary demonstration of SIMDable rejection sampling, I’m going to put that aside for a moment. Below is a series of scalar implementations of sampling schemes.
randf1
produces a XX-bit float using a full XX-bits of entropy. randf1_badrounding
does the same but uses native round-nearest-ties-even rounding so is not correct, although it is indicative of performance if we could change the rounding mode or a wizard could give us a better implementation. randf
produces a float with infinite precision and randf_badrounding
does the same with the same flawed rounding as the finite precision version.
using Random
function trunctofloat(u::U) where U<:Base.BitUnsigned
F = Base.floattype(U)
lomask = typemax(U) >>> (8sizeof(F) - Base.significand_bits(F) - 1) # masking with this number ensures exact representability
count_ones(lomask) == trailing_ones(lomask) >= 4sizeof(U) || error("invalid mask") # mask should be dense on the low bits and at least half the bitwidth
hi = F(u & ~lomask)
lo = F(u & lomask)
f0 = hi + lo # combine the parts
roundup = f0 - hi > lo # was the roundoff error of f0 upward?
f = reinterpret(F, reinterpret(U, f0) - roundup) # decrement if rounding was upward
return f
end
randf1(t::Type) = randf1(Random.default_rng(),t)
function randf1(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
U = Base.uinttype(T) # same-width Unsigned
scale = T(exp2(-8sizeof(U)))
u = rand(rng, U)
return trunctofloat(u) * scale # compute the value
end
randf1_badrounding(t::Type) = randf1_badrounding(Random.default_rng(),t)
function randf1_badrounding(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
U = Base.uinttype(T) # same-width Unsigned
scale = T(exp2(-8sizeof(U)))
u = rand(rng, U)
return T(u) * scale # compute the value
end
randf(t::Type) = randf(Random.default_rng(),t)
function randf(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
U = Base.uinttype(T) # same-width Unsigned
mineps = eps(zero(T)) # minimum spacing between values that we want to achieve
umax = U(trunctofloat(typemax(U)))
sparebits = trailing_zeros(umax) # how many "extra" bits we have per draw
usebits = 8sizeof(U) - sparebits # how many bits we can possibly consume per draw
redrawbelow = U(exp2(usebits)) # values smaller than this lose precision and must be redrawn
redrawscale = T(exp2(-sparebits)) # if a value fails, scale by this and redraw
scale = T(inv(redrawbelow))
todrop = Int(log2(scale) - log2(mineps)) # how many octaves we need to drop via `redrawscale`
numscales,leftover_bottom = fldmod(todrop, sparebits) # how many times we must apply `redrawscale` to get within reach of `mineps`
remainingbits = leftover_bottom + usebits # bits remaining beyond the lowest usable scale
lastlevelmask = typemax(U) >>> (8sizeof(U) - remainingbits) # mask for values at lowest scale
for _ in 1:numscales
scale *= redrawscale # prepare to evaluate next scale
u = rand(rng, U)
if u >= redrawbelow
return trunctofloat(u) * scale # compute the value
end
end
# if we made `scale` smaller it would underflow to 0
u = rand(rng, U)
u = u & lastlevelmask # we can't use all the bits at this level
return trunctofloat(u) * mineps # compute the value
end
randf_badrounding(t::Type) = randf_badrounding(Random.default_rng(),t)
function randf_badrounding(rng::Random.AbstractRNG, ::Type{T}) where T<:Base.IEEEFloat
U = Base.uinttype(T) # same-width Unsigned
mineps = eps(zero(T)) # minimum spacing between values that we want to achieve
umax = U(trunctofloat(typemax(U)))
sparebits = trailing_zeros(umax) # how many "extra" bits we have per draw
usebits = 8sizeof(U) - sparebits # how many bits we can possibly consume per draw
redrawbelow = U(exp2(usebits)) # values smaller than this lose precision and must be redrawn
redrawscale = T(exp2(-sparebits)) # if a value fails, scale by this and redraw
scale = T(inv(redrawbelow))
todrop = Int(log2(scale) - log2(mineps)) # how many octaves we need to drop via `redrawscale`
numscales,leftover_bottom = fldmod(todrop, sparebits) # how many times we must apply `redrawscale` to get within reach of `mineps`
remainingbits = leftover_bottom + usebits # bits remaining beyond the lowest usable scale
lastlevelmask = typemax(U) >>> (8sizeof(U) - remainingbits) # mask for values at lowest scale
for _ in 1:numscales
scale *= redrawscale # prepare to evaluate this scale
u = rand(rng, U)
if u >= redrawbelow
return T(u) * scale # compute the value
end
end
# if we made `scale` smaller it would underflow to 0
u = rand(rng, U)
u = u & lastlevelmask # we can't use all the bits at this level
return T(u) * mineps # compute the value
end
And a benchmark comparing them all on my 2020ish x86 laptop running Julia v1.9.1. I include vectorized rand
for reference, but really the comparison should be made to the scalarized versions (those using a comprehension).
julia> using BenchmarkTools
julia> @btime rand(Float32,2^16); # vectorized performance, for reference
15.700 μs (2 allocations: 256.05 KiB)
julia> @btime [rand(Float32) for _ in 1:2^16]; # current (24 bits)
77.400 μs (2 allocations: 256.05 KiB)
julia> @btime [randf1_badrounding(Float32) for _ in 1:2^16]; # invalid rounding (32 bits)
79.200 μs (2 allocations: 256.05 KiB)
julia> @btime [randf1(Float32) for _ in 1:2^16]; # valid (32 bits)
137.800 μs (2 allocations: 256.05 KiB)
julia> @btime [randf_badrounding(Float32) for _ in 1:2^16]; # invalid (infinite bits, 149 in practice)
192.200 μs (2 allocations: 256.05 KiB)
julia> @btime [randf(Float32) for _ in 1:2^16]; # valid (infinite bits, 149 in practice)
265.800 μs (2 allocations: 256.05 KiB)
julia> @btime rand(Float64,2^16); # vectorized performance, for reference
49.300 μs (2 allocations: 512.05 KiB)
julia> @btime [rand(Float64) for _ in 1:2^16]; # current (53 bits)
74.300 μs (2 allocations: 512.05 KiB)
julia> @btime [randf1_badrounding(Float64) for _ in 1:2^16]; # invalid rounding (64 bits)
95.100 μs (2 allocations: 512.05 KiB)
julia> @btime [randf1(Float64) for _ in 1:2^16]; # valid (64 bits)
151.800 μs (2 allocations: 512.05 KiB)
julia> @btime [randf_badrounding(Float64) for _ in 1:2^16]; # invalid rounding (64 bits)
206.100 μs (2 allocations: 512.05 KiB)
julia> @btime [randf(Float64) for _ in 1:2^16]; # valid (infinite bits, 1074 in practice)
305.200 μs (2 allocations: 512.05 KiB)
Note that these runtimes were quite variable so we should only compare the numbers roughly. Benchmark more on your own, if you like.
Since the vectorized implementations are largely just parallizing this entire function at the register level (and I have shown that this is possible), I expect the scalar comparison to be somewhat indicative of a comparison of vectorized algorithms.
The _badrounding
variants are only here to show a limit on how things might perform if a faster rounding scheme is used (either by changing the rounding mode or a better implementation). They are not suitable for current use, but they show that there is significant performance lost there.
I will call the scalarized rand
calls (24 or 53 bits) the baseline. Relative to the baseline, the “full-width” versions carry a small-to1.5x (if fast rounding is achieved) to 2x (with my current rounding) performance penalty. Relative to the baseline, the “infinite precision” variants carry a 2.5x (fast rounding) to 4x (my current rounding) performance penalty.
These gaps could shrink as people make superior implementations.
A 4x regression is getting to be a lot. But I could be convinced to eat a 2x or maybe 3x penalty to end this discussion once and for all. Or, if that’s too much, we make this functionality a separate function and a user can choose whether they want speed or precision.
This also shows that there is very little space (roughly a 2x performance gap to attempt to bridge) for implementations with precision wider than native but less than infinite.
I don’t think the rounding is a problem and I’d like to propose an alternative more statistical criterion a-la @foobar_lv2
Something along the lines of
If a random oracle produced a random real number between [0,1) and rounded that to the nearest Float32, the probability that the absolute error in the true cdf at that chosen x exceeds some epsilon is 0
Or similar. Right now the CDF is perfectly accurate at the output values but it’s well off for other values (on order of 2^-24) so if the cdf is closer to y=x than 3*2^-32 it’s way better than today even if rounding makes some grid points more probable than they should be by factors of 2 etc
This is a breaking change. I see no way that this will be accepted. If you want such a function, it cannot be rand(Float32)
because it changes the interval to [0,1]. People may have written their code assuming 0.0
was impossible out of ignorance, but other people have read the docs and done everything right and implemented code that is entirely dependent on always having rand() < 1
. We cannot break their correct and valid use.
Hoping to be “lucky enough” to never draw an endpoint is not a defensible coding practice. Especially when other people are counting on the exact opposite. As you point out, this also makes some values more likely than neighbors which is bad.
My point about breaking changes stands in addition to the philosophical objections that others have raised when advocating for round-down behavior. Round-up behavior could be equally valid except we have already committed to the [0,1) interval.
I think the separate function thing is unavoidable anyways? Just because GPU exists and would have a hard time replicating “ideal” random.
We already have various things in Random. If there are ClosedOpen01 and ClosedOpen02, then Open01 and FastClosedOpen01 are no big stretch?
On the other hand, if you’re doing the inverse cdf thing with singularity, then you need ideal random and probably should not go with compromises. But the vast majority of random uses are totally fine with the current distribution.
So we’re haggling about:
I like that part of the construction a lot, beautiful!
as mikmoore said, the issue is that 1 gets included in the output. That breaks API promises, can’t do.
We could multiply with prevfloat(1f0)
to fix that up, but that’s probably more expensive that the cool bitslinging tricks.
Use bitslinging for the floating point stuff, and use floating point instructions to emulate leading_zeros
/ trailing_zeros
on avx2/simd, I absolutely love it. Much cooler than my proposals.
I think you’re misreading my proposal. Another way of saying what I said is that cdf(x) = x ± epsilon for some chosen epsilon uniformly. Our current RNG satisfies this for epsilon = 2^-24 if another RNG satisfies it everywhere for epsilon = 2^-31 but doesn’t equal 0 exactly anywhere I would consider that a big improvement wouldn’t you? Does our current documentation preclude that? What do we promise? We were considering “debiasing” proposals.
Oh sure, I agree that 1 shouldn’t happen.
EDIT:
Technically speaking rand() satisfies its user contract provided it returns a number in the interval [0,1) so
so it would be satisfied by:
rand(Float32)
return 4.0f0/6 # randomly chosen by roll of cubical die
end
I mean, we all agree that’s stupid, it’s basically an XKCD joke… but all the stuff about cdf conditions are not imposed by any “contract” we’ve provided to the user, we don’t even mention that it’s UNIFORM, so adjustments to the CDF are technically allowed provided only that the CDF (using less than criterion) equals 0 at 0.0f0 and 1 at 1.0f0 it could technically be anything in between by contract.
randn and randexp on the other hand specify the distribution as Normal(0,1) and Exponential(1)
I’m not proposing to make rand less uniform though, I’m just proposing that the criterion we use to determine whether it’s uniform is basically like @Tamas_Papp suggested:
|CDF(x) - x| < \epsilon(x) \ \forall x \in [0,1)
So we should probably rewrite the description of rand() to say something like: "for floating point we return an approximation of Uniform over [0,1), the approximation is subject to change and may vary over the range but will never be worse than |CDF(x) - x| < 3\times 2^{-24} uniformly over floating point numbers the interval for Float32 or Float64 (this is satisfied by our current RNGs)
rather than that it equals 0 at a certain set of floating point values which is a plausible criterion but limits our choice of implementations and implementations that are much smoother and better than our current one might be excluded if we require “exact” values at a particular set of floats.
We might also provide something in the documentation along the lines of “in testing of the current algorithm the empirical CDF for 1_000_000_000 generated numbers had absolute error less than reported value in each of 100 batches”. It would take a few minutes to run that test but would provide useful actionable information to the user.
Another useful test might be to provide a Kolmogorov-Smirnov test statistic for 1e9 generations with a fixed seed equal to the date on which the algorithm was checked in (or other non-special value such as 314159265) (note K-S tests are about maximum deviation from CDF so basically similar but summarized differently)
So I spent some time coding up some K-S tests. Here I generate rand() and throw out anything over eps = pi *2^-20… once I’ve collected 100_000 of them I K-S test against the Uniform(0,eps) distribution. I also compare that to a modified version of ldexp(Float32(rand(UInt32)),-32) that rejects and retries 1.0f0 that occurs due to rounding
current rand(Float32) fails spectacularly. This test takes on the order of 37 seconds to run.
Results of testing the left tail of rand(Float32)
against Uniform(0,2.9960563e-6):
max deviation: 0.01866
p value = 1.1206444407203054e-30
While the rejection sampler that does ldexp(Float32(rand()),-32) and rejects 1.0f0 passes with flying colors. This took around 126 seconds to run. so about 3x slower.
Results of testing the left tail of rand(LdexpRNG,Float32)
against Uniform(0,2.9960563e-6):
max deviation: 0.0008175587654113969
p value = 0.9134684465077278
That difference in timing doesn’t show up in @benchmark, where mean of one execution was 4.98ns vs 2.94ns so more like a factor of 1.69
On the other hand, if you test 100_000_000 samples from the full range, both generators do ok as a model of Uniform(0,1).
Result of testing rand()
against Uniform(0,1)
max deviation: 9.6462083892912e-5
p value: 0.3098683626725186
Result of testing rand(ldexp)
against Uniform(0,1)
max deviation: 3.4763462677056545e-5
p value: 0.6469426326038142
I recommend the use of these kinds of tests to validate the uniformity of the distribution rather than the “exact at certain points” criterion. From a statistical perspective small smeared out deviations from perfect which nevertheless pass statistical tests like this are preferrable to the proof of exactness at certain locations along the interval. Ultimately tests of statistical hypotheses are appropriate methods to evaluate the success of monte-carlo methods as they are the sorts of things which one does with monte carlo simulations.
I’m building a Quarto document to summarize information about any proposed generators we come up with, and to compare timings… will be happy to share with others once it’s stabilized.
Sorry, I do not understand the motivation for this kind of test. Effectively you get very few possible values below your threshold for Float32
, so it fails. But this holds for any sufficiently small interval over the whole range, simply because floats are discrete, not continuous. Even with the proposed fixes in this thread, K-S will reject all over the place with narrow windows.
The uniformity is assumed in UInt32
etc space by the underlying RNG. We should test that the mapping preserves properties we are after. It should be possible to test this in a more straightforward manner. Stochastic tests are a last resort, they are a nightmare in CI.
I agree that using statistical tests here seems misguided. We know what the distribution we’re generating is. I think it’s not entirely accidental that people who are proposing generating values with “approximate” methods are also proposing statistical tests. Frankly, neither one is good enough. If you really understand the distribution you’re generating you should be able to predict the result of any statistical test. Running such tests as a sanity check is fine but it’s absolutely not how we should be designing these things.
Also: What two distributions are being tested for sameness with K-S test? Generated versus what? Mathematically uniform? Ideally generated uniform? How are you doing this?
And yet the alternative I coded up doesn’t fail. Most of the alternatives we are considering won’t fail, because they resolve down into the N=32 range. The point of this test is that a good float32 generator won’t fail it! The other point of this test is that it mimics the outcome of some statistical computation that a practitioner cares about, and that the criteria for the test is precisely maximum deviations from the theoretical cdf. I’m planning to add some more tests to the document.
For example when it comes to rounding modes, perhaps individual possible outcomes wind up overrepresented, effectively a sparse set of delta function holes or point masses. This breaks the “cdf equal to x at each generated quantity” heuristic. That heuristic isn’t what a statistician wants though. Current rand() satisfies it but fails my left tail uniformity test by a country mile.
Note I’m not suggesting doing these tests in CI, only generating a batch report which people can read so that they can have confidence in the quality of the numbers. After all the distribution isn’t going to change unless the code changes.
Happy to share the Quarto output once I’ve made it nicer and a little more general. This is using exact one sample KS vs exact cdf of a uniform distribution on the range being tested. The KS test statistic is precisely maximum deviation from the CDF.
Yes, if we are doing a good job we should be able to predict the outcome of the test. But suppose we find a fast way to get buttery smooth numbers which results in slight lumps around certain roundoff values or debiases the CDF in the process but results in a pattern of slightly over and slightly under density related to floating point representation patterns. Suppose this method is fast like current methods, debiased unlike current methods, results in vastly smaller maximum deviations from cdf(x) = x than the current 24 bit float resolution, but its exact bit for bit distribution is harder to analyze. Should we reject it? Or should we determine whether it acts like a uniform generator for the purpose it was designed for, which is to pass statistical tests and be a suitable generator for monte Carlo simulations?
My point is if we insist on y=x at each float we generate it limits our options and isn’t what a statistician wants, after all we have y=x at current rand(Float32) and it handily fails a test of uniformity in less than 30 seconds.
I really don’t understand why you want to use an approximate process based on heuristics when it’s possible to have a rigorously defined and exact method to generate arbitrarily many bits. The exact method is even looking to be faster.
The status quo fails your test, but we know why it fails and can predict it. That’s because the status quo is also an exact implementation of one such rigorous definition that’s lacking in precision at the low end.
Heuristics and approximations simply don’t pass muster and aren’t the path forward. More tests are always cool though.
The “exact” method satisfies cdf(x) >=x for all values below a certain value in your table, something substantially bigger than than 1e-6.
In other words, it’s not exact.
What mbauman discusses is “exact” in the sense that it satisfies a specific inequality, something like x <= cdf(x) < next(x)
, that holds universally across the entire domain. Here, next(x)
returns nextfloat(x)
(infinite precision) or the smallest strictly-larger representable multiple of 2^-N
(finite precision).
There may be “off-by-one” or swapped inequality type errors in my above statements, but this is the gist.
With these sorts of strict transformational guarantees, we define what it means for the float transformation to be correct (e.g., something like my bounds above) and outsource any further correctness discussion to the UInt
RNG itself. If someone applies a busted RNG (it’s a user input, after all), we’re powerless to deliver a correct distribution here.
I should have been more careful with the word exact. I mean “an exact implementation of a rigorously defined process.” For examples of what a rigorous definition might look like, see my docs post above.