Consensus building poll
The discussion has run in circles, a lot. I’d like to establish whether we have reached consensus on some basic points, such that we don’t need to endlessly re-litigate them. As tool for this, I will use public polls. Don’t fret, you can change your vote after thinking about it! Each of these basic points will have at least three options: To agree, to disagree, and to ask for clarification.
Ideally the consensus notes would be written by someone who disagrees with me. But someone has to start, so I’ll do that. But I will be happy to add your formulations. It is not the goal to reach consensus on everything – there are tradeoffs and valid different opinions. But let’s fix the non-contentions points first!
Does this procedure sound sane to you?
@mbauman Do we want to move this post to a separate thread that we try to keep low-noise and where we try to keep out too detailed implementation discussions?
-
I understand the goal of polling to build a minimal consensus, and I agree that this is a good idea.
-
I understand the goal of polling to build a minimal consensus, and I think this is a bad idea (for any reason – e.g. waste of time, or risk of leading questions warping the discussion, or privacy of the vote).
-
Other / I do not understand what these polls are about. Please clarify.
Cumulative distribution functions
It is important to talk about probability distributions X
of real numbers, extended to include -Inf
and Inf
. Such distributions are commonly described by their cumulative distribution function
cdf(p) = probability(X() < p)
cdfQ(p) = probability(X() <= p)
By construction, cdf(-Inf) = 0
and cdfQ(Inf) = 1
. By construction, both cdf
and cdfQ
are monotonically non-decreasing, and cdf <= cdfQ
. These functions can be recovered from each other, via cdfQ(p) = infimum {cdf(q): q > p}
and cdf(p) = supremum {cdfQ(q): q < p}
for finite p
, and the definitional constraints for infinite p
. These functions allow us to recover probability(X() == p) = cdfQ(p) - cdf(p)
, and similarly the probabilities for all intervals. We will assume without further mentioning that our cdf
is sane (corresponds to Lebesgue measureable distributions – bring your cantor staircases, but no axiom of choice monstrosities please).
-
I understand the relation between probability distributions and cdf
, and I agree that this is a good idea.
-
I understand the relation between probability distributions and cdf
, and I either think that this modeling choice is a bad idea for this discussion, or that the above description is wrong or misleading.
-
Other / I do not understand what you’re saying. Please clarify!
We ultimately want to talk about probability distributions X
of floating point numbers, including -Inf
and Inf
, and excluding all NaN
values and excluding -0.0
, i.e. negative signed zero. For this, we can use the same construction to build cdf
and cdfQ
, which now are maps from floating point numbers to the unit interval of real numbers [0, 1]
. Any probability distribution on a floating point type can be described exactly by its cdf
, and the cdf
is always defined on every floating point number in this type (excluding all NaN
values and signed negative zero).
The relation between cdf
and cdfQ
now becomes cdf(x) = cdfQ(prevfloat(x))
and cdfQ(x) = cdf(nextfloat(x))
, except for the special casescdf(-Inf) = 0
and cdfQ(Inf) = 1
.
The values cdf(p)
are mathematical real numbers, and are not necessarily exactly representable by the floating point number type from which p
or X()
is drawn. In fact, cdf(p)
need not be a rational number at all! (for practical purposes, it must be a computable real, though. Generating irrational probabilities can be a pain! Valid sampling algorithms that produce irrational probabilities will terminate with probability one, but cannot have any guaranteed upper bound on their runtime)
We are primarily (but not exclusively) interested in sampling from nonnegative floating point numbers that are smaller than 1. This is the unbreakable API promise of rand()
, and it is equivalent to the constraints cdf(0) = 0
and cdfQ(1) = 1
.
-
I understand the relation between probability and cdf
and cdfQ
for floating point numbers. This all seems right to me.
-
I understand what you’re saying, but something is wrong or misleading or there is something else I disagree with.
-
Other / I do not understand what you’re saying. Please clarify!
The ideal distribution
Absent any other tradeoffs, we would be very happy if we could sample from the unit interval such that cdf(p) = p
exactly for all floating point numbers.
For this distribution, probability(X() = p) = cdfQ(p) - cdf(p) = nextfloat(p) - p
for all floats in the unit interval. Therefore this distribution also has the following interpretation: Sample a random real number, uniformly (i.e. according to cdf(x) = x
). Then round it towards zero to the next floating point number.
-
I understand this, and I agree that cdf(p) = p
for all floating point numbers would be ideal, absent other tradeoffs.
-
I understand this, and I disagree – I’d like something else. For example, a plausible distribution would be defined by sampling a uniform real number, and rounding it to the nearest floating point number in the [0,1)
unit interval. (exercise: describe the cdf
defined by this procedure)
-
I understand this, and I disagree – something in this exposition is wrong!
-
Other / I do not understand what you’re saying. Please clarify!
This above ideal cdf
can be implemented. An unoptimized sample code is the following:
function idealSample(::Type{T}) where T <: Union{Float64, Float32}
u = rand(UInt64)
#the following conversion is exact
mantissa = convert(T, ( (Base.significand_mask(T) & u ) | (UInt64(1) << Base.significand_bits(T))) )
shift = Base.significand_bits(T) + 1
u = u | Base.significand_mask(T)
shift += leading_zeros(u)
u = u & ~Base.significand_mask(T)
while u == 0 && shift < 1200
u = rand(UInt64)
shift += leading_zeros(u)
end
ldexp(mantissa, -shift)
end
For what it’s worth, this implementation is not absurdly slow:
julia> @btime rand(Float32)
2.243 ns (0 allocations: 0 bytes)
0.27027202f0
julia> @btime idealSample(Float32)
4.023 ns (0 allocations: 0 bytes)
0.20275846f0
julia> @btime rand(Float64)
2.231 ns (0 allocations: 0 bytes)
0.08395345902826745
julia> @btime idealSample(Float64)
4.141 ns (0 allocations: 0 bytes)
0.8978442655101635
-
I understand this implementation, and am convinced that it generates the ideal distribution.
-
I understand this implementation, and it seems plausible that it generates the ideal distribution. But I am not convinced unless I get mathematical proof.
-
I understand this implementation, and I think it is wrong.
-
Other / I do not understand the implementation. Please clarify!
Bernoulli in the coalmine
We are discussing the issue of sampling random uniformly distributed floating point numbers from the unit interval, excluding 1 and potentially including zero. An important construction is Bernoulli sampling – not because people are supposed to use this construction, but rather because it nicely illustrates the properties of various sampling schemes, and can serve as a canary in the coalmine:
julia> struct Bernoulli{T} p::T end
julia> struct BernoulliQ{T} p::T end
julia> struct Negate{F} parent::F end
julia> (b::Bernoulli)() = rand(typeof(b.p)) < p
julia> (b::BernoulliQ)() = rand(typeof(b.p)) <= p
julia> (n::Negate)() = !n.parent()
By construction, Bernoulli(p)
will happen with probability cdf(p)
and BernoulliQ(p)
will happen with cdfQ(p)
. Likewise, Negate(Bernoulli(p))
will happen with probability 1 - cdf(p)
.
-
I understand and this sounds correct.
-
I understand and something is wrong.
-
Other / I do not understand what you’re saying. Please clarify!
Now there is a common misapprehension. This misapprehension is that Negate(Bernoulli(p))
should happen with the same probability as BernoulliQ(one(typeof(p)) - p)
. This is a misapprehension because, if we negate that again, we would obtain that Bernoulli(p)
and Bernoulli(1 - (1 - p))
would happen with the same probability. However, floating point addition is not associative; in fact the operation p -> one(T) - (one(T) - p)
is equivalent to rounding the floating point number 0 <= p <= 1
to a fixed point number with precision eps(one(T))
.
-
I understand this, and agree that it is a misapprehension. Negate(Bernoulli(p))
should not in general be equivalent to BernoulliQ(1-p)
.
-
I understand this, and I disagree. We should guarantee that Negate(Bernoulli(p))
happens with the same probability as BernoulliQ(1-p)
. If this means that Bernoulli(p)
will have a probability of 1-(1-p)
, i.e. that probabilities are effectively rounded to eps(1.0)
, then this is a bullet I’m willing to bite.
-
Other / I do not understand what you’re saying. Please clarify!
Discretization of probability
Suppose that we decide on tradeoffs that limit the number of random bits we’re willing to consume, e.g. to 64 bits. Then all probabilities of events over our sampling must be integer multiples of ldexp(1.0, -64) ~ 5e-20
, i.e. 2^(-64)
. This is by counting: We can simply count how many out of the 2^64
many possible bitpatterns map to our event.
Note that this is a separate thing than Shannon entropy. Shannon entropy of a discrete distribution is defined as the expectation of -log(p)
. The Shannon entropy of our ideal cdf
for Float64 is almost 54 bit (52 bits for the mantissa, 2 bits from the exponent by summing up the geometric series). This means that, if we plan to generate 1000 floats, we expect to consume about 5400 bits of random, and it is quite safe to assume that 6000 bit will suffice. It does not mean that 54 or 64 bit are enough to generate a single random Float64
! (analogy: rand()
has a mean of 0.5
; that does not mean that every single sample will be equal to 0.5
)
This tells us immediately that such a tradeoff will adversely impact our capability of exactly resolving the small tail of our cdf
. cdf(p)
for p > ldexp(1.0, -10) ~ 1e-3
can be implemented exactly, but e.g. cdf_tradeoff(ldexp(1.0, -65))
can be either zero (if we decide not to emit smaller values) or ldexp(1.0, -64)
(if we decide to emit this value). Overweighting something by a factor 2 doesn’t sound so bad, but the same holds for e.g. cdf_tradeoff(ldexp(1.0, -105))
– so if we overweight the tail by e.g. including zero in the output, then such super rare events are overweight by ldexp(1.0, 40)
, which is 12 orders of magnitude.
If we’re working in Float64
, then the ideal cdf would be exact at ldexp(1.0, -1000)
; overweighting that to ldexp(1.0, -64)
is quite a large mistake!
-
I understand this, and agree that this analysis of discretization of probability is basically correct.
-
I understand this, and I disagree. Something is wrong here.
-
Other / I do not understand what you’re saying. Please clarify!
Limits of accuracy in view of bounded sample sizes
In our context, probabilities and cdf
are functions from floating point numbers (embodied by bit patterns on a computer) to probabilities, typically modelled by real numbers (a mathematical abstraction living the Platonic realm).
As such, when discussing acceptability of tradeoffs, we should treat the input values of cdf
like we treat floating point numbers, including considerations of rounding modes, and absolute and relative tolerances; and we should treat the output values as probabilities.
Accuracy considerations on probabilities are in the realm of statistics, not in the realm of IEEE floating point numbers. We should use tools from statistics to reason about accuracy in the output of cdf
.
-
I understand this, and agree that this kind of treatment is appropriate in general.
-
I understand this, and I disagree that the realm of statistics is appropriate to discuss accuracy of probability values in this context.
-
Other / I do not understand what you’re saying, even though I have read the following text that contains further explanations. Please clarify!
In order to ask whether a cdf
is a good approximation to the ideal cdf
, I propose the following criterion, formulated in crypto-style:
Definition (adversarial distinguishibility). An approximate distribution is adversarially N
-indistinguishable from the ideal distribution if the following holds:
We play the following game. I sample N
outputs of the approximate distribution, and N
outputs of the ideal distribution. I give you both sample sets, but don’t tell you which is which.
You now attempt to guess which is which. You are very clever.
Irregardless of how clever you are, you will not succeed with signifiacant probability, e.g. better than 75%.
Definition (Hypothesis testing distinguishibility). An approximate distribution is statistically N
-indistinguishable from the ideal distribution if the following holds:
You draw up to N
samples from the approximate distribution. Now you do reasonable statistical tests and hypothesis testing to attempt to reject the hypothesis that the approximate distribution is the same as the ideal distribution. Your statistical tests generate some p
-value.
Your chance of drawing a sample that rejects the “identical”-hypothesis with p < 5%
is smaller than 10%
.
I would be very happy if people could chime in with the correct textbook-stats definition of hypothesis testing indistinguishability. Note that it is trivial (by definition of p
-value) to craft a test that rejects the “identical”-hypothesis with p = 5%
for 5% of the possible sample sets!
These two definitions differ in the following way: Adversarial distinguishibility will need to assume that our sampling procedure has access to a random oracle (otherwise, you are very clever and recover the PRNG state!).
Hypothesis testing, on the other hand, is a very vague concept – it does not specify what kind of tests are permissible, just that they are “reasonable”. This is the typical thing done in statistical evaluation of random number generators (e.g. bigcrush). There may be debate on e.g. whether it is reasonable to count parity of the last mantissa bit. If this is a valid test, then our current rand()
fails with flying colors! But standard testing suites accept our current distribution; hence the scientific consensus on this matter seems to be that this is not a reasonable test.
This kind of treatment implies that our distribution is only good up to some finite number N
of samples. But this is true for any kind of pseudorandom number generator – they are all periodic, with a period length bounded above by the number of possible internal states.
Also node that I am not saying that our eventual rand()
implementation must pass such stringent tests. I am saying that tradeoffs that are indistinguishable are free, i.e. that you cannot argue against them on grounds of accuracy.
-
I understand this, and agree these two kinds of goals are reasonable, and that “indistinguishable inaccuracies” are in actuality not inaccuracies at all.
-
I understand this, and agree these two kinds of goals are reasonable. But you, foobar_lv2, suck at statistics, and I’m gonna write in how it is done right.
-
I understand this, and disagree. Either something is factually wrong, or I have more philosophical objections.
-
Other / I do not understand what you’re saying, even though I have read the following text that contains further explanations. Please clarify!
Now we shold ask what kind of N
is reasonable. We do not need to converge on an exact number here, only on a range.
It is clearly reasonable to draw 2^32
samples. I can do this on my laptop in a handful of seconds.
It is clearly unreasonable to draw 2^120
samples. This should be common sense; but I will also offer an argument from authority: 128 bit AES is considered good enough for state secrets by the united states government / NIST.
-
I understand this, and agree.
-
I understand this, and agree, but you’re misinterpreting / misquoting the NIST recommendations.
-
I understand this, and disagree.
-
Other / I do not understand what you’re saying, even though I have read the following text that contains further explanations. Please clarify!
An alternative, much smaller upper bound on N
than 2^120
can be achieved by considering the prevalence of soft faults, i.e. the reliability of silicon. It may appear unreasonable to demand much higher accuracy from our algorithms than from the hardware they run on!
Note that this kind of argument does not apply to cryptography / key-length considerations: If your CPU hallucinates that it has cracked the key, due to a failing transistor or unlucky cosmic ray, then this will in fact not help you decrypt a message.
-
I understand this, and I think that the reliability of silicon should be an upper limit on the required N
.
-
I understand this, but I think that the reliability of silicon should not be taken into consideration.
-
Other / I do not understand what you’re saying, even though I have read the following text that contains further explanations. Please clarify!
Now there was some kerfuffle regarding the smallest floating point number in our output distribution, especially with respect to exact zeros.
By the above argument, it is indistinguishable for Float64
, and Float32
, and BFloat16
, if we decide to never emit a value smaller than ldexp(1.0, -120)
. This is because cdf(ldexp(1.0, -120)) = 2^(-120)
for these float types. This especially applies to the probability of emitting exact zero, but also includes all subnormal numbers (since ldexp(1.0, -120)
is not subnormal for these float types).
Therefore, it is impossible to argue on accuracy grounds against exclusion of exact zero and subnormals for these float types. One may still make an argument on performance / implementation considerations, though! And this is not an argument for exclusion of exact zero in the output.
Our current rand(Float32)
fails the test of indistinguishability from ideal, already on grounds of the probability of exact zeros (which should be on order nextfloat(0.0f0)
but are in fact of order eps(1.0f0)
which is much larger than 2^-32
, and we established that N = 2^32
is a reasonable sample size).
Whether our current rand(Float64)
fails the test of indistinguishability from ideal on grounds of exact zeros depends on what kind of N
you believe is reasonable. It is unlikely that we can achieve consensus on this point.
This also implies that there would be a real accuracy cost to excluding exact zero in the rand(Float16)
output. In this matter, Float16
is qualitatively different from the other floating point types.
-
I understand this, and agree.
-
I understand this, and disagree.
-
Other / I do not understand what you’re saying, even though I have read the following text that contains further explanations. Please clarify!
The consideration of statistical distinguishability also allows us to study the kind of precision in probabilities / cdf that are reasonable, away from the very unlikely tails of the distribution.
Suppose that we are drawing N
samples from Bernoulli(p)
, and p
is reasonably far away from 0
and 1
. Then the mean result is normally distributed, with mean p
and standard deviation sqrt(p(1-p)/N)
.
If we happen to discretize probabilities to multiples of 2^-64
, due to performance considerations, and if 0.1 < p < 0.9
and N = 2^120
, then this tells us that we do not need to care about off-by-one errors in the number of UInt64
that map to a specific result in this probability range.
-
I understand this, and agree.
-
I understand this, and disagree.
-
Other / I do not understand what you’re saying. Please clarify!
Ok, so much for stats for now. There is more to come on that topic, but I’d be very happy if people with more than a faint clue about statistics could take over here!
Also this already took a lot of my time to write, and will have taken a lot of your time to read. To be continued here…
Exact zeros and bugs
Another kerfuffle involving exact zeros was my claim that this predictably leads to bugs. Things like people writing 1/rand()
or log(rand())
, and expecting a finite number. Bugs are bad, and we already established that we can prevent this class of bugs without any loss of accuracy (performance impact might still be prohibitive, though!).
-
I believe this claim, on strength of argument alone.
-
This is an empirical claim. Proper evidence of exact zeros leading to bugs are bug reports involving improper zero-unsafe usage of rand()
in high-quality code. Proper evidence against this claim is a suspicious absence of such bugs.
-
This is a philosophical claim, to be evaluated on strength of argument alone. And I don’t buy it.
-
Other / I do not understand what you’re saying. Please clarify!
In order to make this point, I brought an example of such a bug in the implementation of randexp
in the julia stdlib.
-
This is a genuine bug in high-quality code and is sufficient to convince me of the claim.
-
This is a genuine bug in high-quality code. But only one such bug is insufficient to convince me. Bring it on!
-
This is not a bug. It is clearly intended, randexp
is supposed to have unbounded range, and might return values larger than floatmax
that correctly round to Inf
.
-
This may or may not be a bug. It does not matter – this behavior is only visible for N > 2^50
, and that would be an unreasonable sample size.
-
Lol, stdlib Random is not high-quality code.
-
Other