Output distribution of rand(Float32) and rand(Float64), thread 2

I don’t think appeals to small numbers and heuristics are useful when designing something this fundamental. want to signal boost the words of @StefanKarpinski since I think he put it very well

in fact, it seems to me that on the contrary, we should use Float16 and Float8 (which is even a thing people are starting to use thanks to neural nets) as reductio ad absurdum stress test cases for any proposal to see if it is a good idea. If it falls apart for Float16, then we shouldn’t do it for any float type

4 Likes

It’d be great if we could agree on our naming here — these recent spitballs are all attempts at “Linear N” algorithms that need to be optimizations of the correct but absurdly slow

rand_linear_n(::Type{F}, N) where F <: AbstractFloat =
    F(round(rand(0:big(2)^N-1)//big(2)^N, RoundDown; base=2, digits=Base.significand_bits(F)+1))

@Dan, check out @. abs(ldexp(Float32($(2^24:2^24+20)),-63))

The smaller the Float, the smaller the rounding issue is.
And the numbers are indeed of order 2^(-40).

But I agree, it isn’t pretty, and thanks for reminding me the start of the thread.

As for the 1.0 issue, that is real.

in that case, sanity check against Float128 or Float256. I think the point is that the implementation should make sense (and ideally be identical) no matter the bitsize

2 Likes

Consensus building poll

The discussion has run in circles, a lot. I’d like to establish whether we have reached consensus on some basic points, such that we don’t need to endlessly re-litigate them. As tool for this, I will use public polls. Don’t fret, you can change your vote after thinking about it! Each of these basic points will have at least three options: To agree, to disagree, and to ask for clarification.

Ideally the consensus notes would be written by someone who disagrees with me. But someone has to start, so I’ll do that. But I will be happy to add your formulations. It is not the goal to reach consensus on everything – there are tradeoffs and valid different opinions. But let’s fix the non-contentions points first!

Does this procedure sound sane to you?

@mbauman Do we want to move this post to a separate thread that we try to keep low-noise and where we try to keep out too detailed implementation discussions?

  • I understand the goal of polling to build a minimal consensus, and I agree that this is a good idea.

  • I understand the goal of polling to build a minimal consensus, and I think this is a bad idea (for any reason – e.g. waste of time, or risk of leading questions warping the discussion, or privacy of the vote).

  • Other / I do not understand what these polls are about. Please clarify.

0 voters

Cumulative distribution functions

It is important to talk about probability distributions X of real numbers, extended to include -Inf and Inf. Such distributions are commonly described by their cumulative distribution function


cdf(p) = probability(X() < p)

cdfQ(p) = probability(X() <= p)

By construction, cdf(-Inf) = 0 and cdfQ(Inf) = 1. By construction, both cdf and cdfQ are monotonically non-decreasing, and cdf <= cdfQ. These functions can be recovered from each other, via cdfQ(p) = infimum {cdf(q): q > p} and cdf(p) = supremum {cdfQ(q): q < p} for finite p, and the definitional constraints for infinite p. These functions allow us to recover probability(X() == p) = cdfQ(p) - cdf(p), and similarly the probabilities for all intervals. We will assume without further mentioning that our cdf is sane (corresponds to Lebesgue measureable distributions – bring your cantor staircases, but no axiom of choice monstrosities please).

  • I understand the relation between probability distributions and cdf, and I agree that this is a good idea.

  • I understand the relation between probability distributions and cdf, and I either think that this modeling choice is a bad idea for this discussion, or that the above description is wrong or misleading.

  • Other / I do not understand what you’re saying. Please clarify!

0 voters

We ultimately want to talk about probability distributions X of floating point numbers, including -Inf and Inf, and excluding all NaN values and excluding -0.0, i.e. negative signed zero. For this, we can use the same construction to build cdf and cdfQ, which now are maps from floating point numbers to the unit interval of real numbers [0, 1]. Any probability distribution on a floating point type can be described exactly by its cdf, and the cdf is always defined on every floating point number in this type (excluding all NaN values and signed negative zero).

The relation between cdf and cdfQ now becomes cdf(x) = cdfQ(prevfloat(x)) and cdfQ(x) = cdf(nextfloat(x)), except for the special casescdf(-Inf) = 0 and cdfQ(Inf) = 1.

The values cdf(p) are mathematical real numbers, and are not necessarily exactly representable by the floating point number type from which p or X() is drawn. In fact, cdf(p) need not be a rational number at all! (for practical purposes, it must be a computable real, though. Generating irrational probabilities can be a pain! Valid sampling algorithms that produce irrational probabilities will terminate with probability one, but cannot have any guaranteed upper bound on their runtime)

We are primarily (but not exclusively) interested in sampling from nonnegative floating point numbers that are smaller than 1. This is the unbreakable API promise of rand(), and it is equivalent to the constraints cdf(0) = 0 and cdfQ(1) = 1.

  • I understand the relation between probability and cdf and cdfQ for floating point numbers. This all seems right to me.

  • I understand what you’re saying, but something is wrong or misleading or there is something else I disagree with.

  • Other / I do not understand what you’re saying. Please clarify!

0 voters

The ideal distribution

Absent any other tradeoffs, we would be very happy if we could sample from the unit interval such that cdf(p) = p exactly for all floating point numbers.

For this distribution, probability(X() = p) = cdfQ(p) - cdf(p) = nextfloat(p) - p for all floats in the unit interval. Therefore this distribution also has the following interpretation: Sample a random real number, uniformly (i.e. according to cdf(x) = x). Then round it towards zero to the next floating point number.

  • I understand this, and I agree that cdf(p) = p for all floating point numbers would be ideal, absent other tradeoffs.

  • I understand this, and I disagree – I’d like something else. For example, a plausible distribution would be defined by sampling a uniform real number, and rounding it to the nearest floating point number in the [0,1) unit interval. (exercise: describe the cdf defined by this procedure)

  • I understand this, and I disagree – something in this exposition is wrong!

  • Other / I do not understand what you’re saying. Please clarify!

0 voters

This above ideal cdf can be implemented. An unoptimized sample code is the following:


function idealSample(::Type{T}) where T <: Union{Float64, Float32}

u = rand(UInt64)

#the following conversion is exact

mantissa = convert(T, ( (Base.significand_mask(T) & u ) | (UInt64(1) << Base.significand_bits(T))) )

shift = Base.significand_bits(T) + 1

u = u | Base.significand_mask(T)

shift += leading_zeros(u)

u = u & ~Base.significand_mask(T)

while u == 0 && shift < 1200

u = rand(UInt64)

shift += leading_zeros(u)

end

ldexp(mantissa, -shift)

end

For what it’s worth, this implementation is not absurdly slow:


julia> @btime rand(Float32)

2.243 ns (0 allocations: 0 bytes)

0.27027202f0

julia> @btime idealSample(Float32)

4.023 ns (0 allocations: 0 bytes)

0.20275846f0

julia> @btime rand(Float64)

2.231 ns (0 allocations: 0 bytes)

0.08395345902826745

julia> @btime idealSample(Float64)

4.141 ns (0 allocations: 0 bytes)

0.8978442655101635

  • I understand this implementation, and am convinced that it generates the ideal distribution.

  • I understand this implementation, and it seems plausible that it generates the ideal distribution. But I am not convinced unless I get mathematical proof.

  • I understand this implementation, and I think it is wrong.

  • Other / I do not understand the implementation. Please clarify!

0 voters

Bernoulli in the coalmine

We are discussing the issue of sampling random uniformly distributed floating point numbers from the unit interval, excluding 1 and potentially including zero. An important construction is Bernoulli sampling – not because people are supposed to use this construction, but rather because it nicely illustrates the properties of various sampling schemes, and can serve as a canary in the coalmine:


julia> struct Bernoulli{T} p::T end

julia> struct BernoulliQ{T} p::T end

julia> struct Negate{F} parent::F end

julia> (b::Bernoulli)() = rand(typeof(b.p)) < p

julia> (b::BernoulliQ)() = rand(typeof(b.p)) <= p

julia> (n::Negate)() = !n.parent()

By construction, Bernoulli(p) will happen with probability cdf(p) and BernoulliQ(p) will happen with cdfQ(p). Likewise, Negate(Bernoulli(p)) will happen with probability 1 - cdf(p).

  • I understand and this sounds correct.

  • I understand and something is wrong.

  • Other / I do not understand what you’re saying. Please clarify!

0 voters

Now there is a common misapprehension. This misapprehension is that Negate(Bernoulli(p)) should happen with the same probability as BernoulliQ(one(typeof(p)) - p). This is a misapprehension because, if we negate that again, we would obtain that Bernoulli(p) and Bernoulli(1 - (1 - p)) would happen with the same probability. However, floating point addition is not associative; in fact the operation p -> one(T) - (one(T) - p) is equivalent to rounding the floating point number 0 <= p <= 1 to a fixed point number with precision eps(one(T)).

  • I understand this, and agree that it is a misapprehension. Negate(Bernoulli(p)) should not in general be equivalent to BernoulliQ(1-p).

  • I understand this, and I disagree. We should guarantee that Negate(Bernoulli(p)) happens with the same probability as BernoulliQ(1-p). If this means that Bernoulli(p) will have a probability of 1-(1-p), i.e. that probabilities are effectively rounded to eps(1.0), then this is a bullet I’m willing to bite.

  • Other / I do not understand what you’re saying. Please clarify!

0 voters

Discretization of probability

Suppose that we decide on tradeoffs that limit the number of random bits we’re willing to consume, e.g. to 64 bits. Then all probabilities of events over our sampling must be integer multiples of ldexp(1.0, -64) ~ 5e-20, i.e. 2^(-64). This is by counting: We can simply count how many out of the 2^64 many possible bitpatterns map to our event.

Note that this is a separate thing than Shannon entropy. Shannon entropy of a discrete distribution is defined as the expectation of -log(p). The Shannon entropy of our ideal cdf for Float64 is almost 54 bit (52 bits for the mantissa, 2 bits from the exponent by summing up the geometric series). This means that, if we plan to generate 1000 floats, we expect to consume about 5400 bits of random, and it is quite safe to assume that 6000 bit will suffice. It does not mean that 54 or 64 bit are enough to generate a single random Float64! (analogy: rand() has a mean of 0.5; that does not mean that every single sample will be equal to 0.5)

This tells us immediately that such a tradeoff will adversely impact our capability of exactly resolving the small tail of our cdf. cdf(p) for p > ldexp(1.0, -10) ~ 1e-3 can be implemented exactly, but e.g. cdf_tradeoff(ldexp(1.0, -65)) can be either zero (if we decide not to emit smaller values) or ldexp(1.0, -64) (if we decide to emit this value). Overweighting something by a factor 2 doesn’t sound so bad, but the same holds for e.g. cdf_tradeoff(ldexp(1.0, -105)) – so if we overweight the tail by e.g. including zero in the output, then such super rare events are overweight by ldexp(1.0, 40), which is 12 orders of magnitude.

If we’re working in Float64, then the ideal cdf would be exact at ldexp(1.0, -1000); overweighting that to ldexp(1.0, -64) is quite a large mistake!

  • I understand this, and agree that this analysis of discretization of probability is basically correct.

  • I understand this, and I disagree. Something is wrong here.

  • Other / I do not understand what you’re saying. Please clarify!

0 voters

Limits of accuracy in view of bounded sample sizes

In our context, probabilities and cdf are functions from floating point numbers (embodied by bit patterns on a computer) to probabilities, typically modelled by real numbers (a mathematical abstraction living the Platonic realm).

As such, when discussing acceptability of tradeoffs, we should treat the input values of cdf like we treat floating point numbers, including considerations of rounding modes, and absolute and relative tolerances; and we should treat the output values as probabilities.

Accuracy considerations on probabilities are in the realm of statistics, not in the realm of IEEE floating point numbers. We should use tools from statistics to reason about accuracy in the output of cdf.

  • I understand this, and agree that this kind of treatment is appropriate in general.

  • I understand this, and I disagree that the realm of statistics is appropriate to discuss accuracy of probability values in this context.

  • Other / I do not understand what you’re saying, even though I have read the following text that contains further explanations. Please clarify!

0 voters

In order to ask whether a cdf is a good approximation to the ideal cdf, I propose the following criterion, formulated in crypto-style:

Definition (adversarial distinguishibility). An approximate distribution is adversarially N-indistinguishable from the ideal distribution if the following holds:

We play the following game. I sample N outputs of the approximate distribution, and N outputs of the ideal distribution. I give you both sample sets, but don’t tell you which is which.

You now attempt to guess which is which. You are very clever.

Irregardless of how clever you are, you will not succeed with signifiacant probability, e.g. better than 75%.

Definition (Hypothesis testing distinguishibility). An approximate distribution is statistically N-indistinguishable from the ideal distribution if the following holds:

You draw up to N samples from the approximate distribution. Now you do reasonable statistical tests and hypothesis testing to attempt to reject the hypothesis that the approximate distribution is the same as the ideal distribution. Your statistical tests generate some p-value.

Your chance of drawing a sample that rejects the “identical”-hypothesis with p < 5% is smaller than 10%.

I would be very happy if people could chime in with the correct textbook-stats definition of hypothesis testing indistinguishability. Note that it is trivial (by definition of p-value) to craft a test that rejects the “identical”-hypothesis with p = 5% for 5% of the possible sample sets!

These two definitions differ in the following way: Adversarial distinguishibility will need to assume that our sampling procedure has access to a random oracle (otherwise, you are very clever and recover the PRNG state!).

Hypothesis testing, on the other hand, is a very vague concept – it does not specify what kind of tests are permissible, just that they are “reasonable”. This is the typical thing done in statistical evaluation of random number generators (e.g. bigcrush). There may be debate on e.g. whether it is reasonable to count parity of the last mantissa bit. If this is a valid test, then our current rand() fails with flying colors! But standard testing suites accept our current distribution; hence the scientific consensus on this matter seems to be that this is not a reasonable test.

This kind of treatment implies that our distribution is only good up to some finite number N of samples. But this is true for any kind of pseudorandom number generator – they are all periodic, with a period length bounded above by the number of possible internal states.

Also node that I am not saying that our eventual rand() implementation must pass such stringent tests. I am saying that tradeoffs that are indistinguishable are free, i.e. that you cannot argue against them on grounds of accuracy.

  • I understand this, and agree these two kinds of goals are reasonable, and that “indistinguishable inaccuracies” are in actuality not inaccuracies at all.

  • I understand this, and agree these two kinds of goals are reasonable. But you, foobar_lv2, suck at statistics, and I’m gonna write in how it is done right.

  • I understand this, and disagree. Either something is factually wrong, or I have more philosophical objections.

  • Other / I do not understand what you’re saying, even though I have read the following text that contains further explanations. Please clarify!

0 voters

Now we shold ask what kind of N is reasonable. We do not need to converge on an exact number here, only on a range.

It is clearly reasonable to draw 2^32 samples. I can do this on my laptop in a handful of seconds.

It is clearly unreasonable to draw 2^120 samples. This should be common sense; but I will also offer an argument from authority: 128 bit AES is considered good enough for state secrets by the united states government / NIST.

  • I understand this, and agree.

  • I understand this, and agree, but you’re misinterpreting / misquoting the NIST recommendations.

  • I understand this, and disagree.

  • Other / I do not understand what you’re saying, even though I have read the following text that contains further explanations. Please clarify!

0 voters

An alternative, much smaller upper bound on N than 2^120 can be achieved by considering the prevalence of soft faults, i.e. the reliability of silicon. It may appear unreasonable to demand much higher accuracy from our algorithms than from the hardware they run on!

Note that this kind of argument does not apply to cryptography / key-length considerations: If your CPU hallucinates that it has cracked the key, due to a failing transistor or unlucky cosmic ray, then this will in fact not help you decrypt a message.

  • I understand this, and I think that the reliability of silicon should be an upper limit on the required N.

  • I understand this, but I think that the reliability of silicon should not be taken into consideration.

  • Other / I do not understand what you’re saying, even though I have read the following text that contains further explanations. Please clarify!

0 voters

Now there was some kerfuffle regarding the smallest floating point number in our output distribution, especially with respect to exact zeros.

By the above argument, it is indistinguishable for Float64, and Float32, and BFloat16, if we decide to never emit a value smaller than ldexp(1.0, -120). This is because cdf(ldexp(1.0, -120)) = 2^(-120) for these float types. This especially applies to the probability of emitting exact zero, but also includes all subnormal numbers (since ldexp(1.0, -120) is not subnormal for these float types).

Therefore, it is impossible to argue on accuracy grounds against exclusion of exact zero and subnormals for these float types. One may still make an argument on performance / implementation considerations, though! And this is not an argument for exclusion of exact zero in the output.

Our current rand(Float32) fails the test of indistinguishability from ideal, already on grounds of the probability of exact zeros (which should be on order nextfloat(0.0f0) but are in fact of order eps(1.0f0) which is much larger than 2^-32, and we established that N = 2^32 is a reasonable sample size).

Whether our current rand(Float64) fails the test of indistinguishability from ideal on grounds of exact zeros depends on what kind of N you believe is reasonable. It is unlikely that we can achieve consensus on this point.

This also implies that there would be a real accuracy cost to excluding exact zero in the rand(Float16) output. In this matter, Float16 is qualitatively different from the other floating point types.

  • I understand this, and agree.

  • I understand this, and disagree.

  • Other / I do not understand what you’re saying, even though I have read the following text that contains further explanations. Please clarify!

0 voters

The consideration of statistical distinguishability also allows us to study the kind of precision in probabilities / cdf that are reasonable, away from the very unlikely tails of the distribution.

Suppose that we are drawing N samples from Bernoulli(p), and p is reasonably far away from 0 and 1. Then the mean result is normally distributed, with mean p and standard deviation sqrt(p(1-p)/N).

If we happen to discretize probabilities to multiples of 2^-64, due to performance considerations, and if 0.1 < p < 0.9 and N = 2^120, then this tells us that we do not need to care about off-by-one errors in the number of UInt64 that map to a specific result in this probability range.

  • I understand this, and agree.

  • I understand this, and disagree.

  • Other / I do not understand what you’re saying. Please clarify!

0 voters

Ok, so much for stats for now. There is more to come on that topic, but I’d be very happy if people with more than a faint clue about statistics could take over here!

Also this already took a lot of my time to write, and will have taken a lot of your time to read. To be continued here…

Exact zeros and bugs

Another kerfuffle involving exact zeros was my claim that this predictably leads to bugs. Things like people writing 1/rand() or log(rand()), and expecting a finite number. Bugs are bad, and we already established that we can prevent this class of bugs without any loss of accuracy (performance impact might still be prohibitive, though!).

  • I believe this claim, on strength of argument alone.

  • This is an empirical claim. Proper evidence of exact zeros leading to bugs are bug reports involving improper zero-unsafe usage of rand() in high-quality code. Proper evidence against this claim is a suspicious absence of such bugs.

  • This is a philosophical claim, to be evaluated on strength of argument alone. And I don’t buy it.

  • Other / I do not understand what you’re saying. Please clarify!

0 voters

In order to make this point, I brought an example of such a bug in the implementation of randexp in the julia stdlib.

  • This is a genuine bug in high-quality code and is sufficient to convince me of the claim.

  • This is a genuine bug in high-quality code. But only one such bug is insufficient to convince me. Bring it on!

  • This is not a bug. It is clearly intended, randexp is supposed to have unbounded range, and might return values larger than floatmax that correctly round to Inf.

  • This may or may not be a bug. It does not matter – this behavior is only visible for N > 2^50, and that would be an unreasonable sample size.

  • Lol, stdlib Random is not high-quality code.

  • Other

0 voters
7 Likes

I think that all one can expect from a generic (non-crypto), practical implementation of rand(::Type{T}) where T <: AbstractFloat is that it is

  1. reasonably independent (cycle is infinity for practical purposes, and there is no statistical pattern between subsequent draws, or it is so slight that it can be ignored), and that

  2. when x = rand(T), | E[x \le y] - y | < \varepsilon(y) with some tolerance that is practically reasonable.

The argument in this thread is about whether that tolerance \varepsilon(y) is uniform (ie independent of y), and how large/small it should be.

What is pretty much irrelevant for an RNG like this:

  1. whether 0 is in the output, or not. We are talking about a very low probability event. If an algorithm depends on this in any practical way, it is flawed. Documenting whether 0 and/or 1 is possible makes sense so that a user can check for them, but relying on either is not advisable.

  2. Properties of f(y), y \sim [0,1] when f is a nonlinear function, perhaps with singularities etc. There are no guarantees about these things because the bound \varepsilon(y) not preserved by nonlinear transformations, so if one wants generate random numbers like this, extra care is needed. This should be understood by anyone who is programming computers. It is not more of a footgun than any ill-conditioned numerical problem resulting from translating algebra to code.

So, the bottom line for practical purposes that a low \varepsilon(y) is kind of nice, and so is a uniform \varepsilon(y) = \varepsilon. FWIW, I care more about uniformity than small error bounds, but I would not sacrifice a lot of speed for either.

As a solution, all we should do is document our \varepsilon(y), or at least an upper bound. But at the same time mention that changing this is not a breaking change.

5 Likes

Holy wall of text, @foobar_lv2.

I just have a few comments on your philosophy where I disagreed:

  • I find the cdf and cdfQ treatment a little messy because it’s already a floating point thing that embeds an assumption about how you’re rounding the real numbers. If it were a real number thing, there’d be no need for the distinction. Point in case: cdfQ(prevfloat(1.0)) == 1. I don’t think cdfQ is really a sensible measure.
  • For your Bernoulli arguments, I would say that BernoulliQ is a bug when used with a round-down rand as you presented it.
  • I don’t find the statistical arguments — by themselves — compelling. Point-in-case, “de-biasing” rand was motivated by a statistical argument and I think it’s a very good thing that we didn’t do that. I think we need a firm grounding on the principle of what a rand() means. This is particularly important because:
  • I agree people are using rand() in ways that cause bugs. I believe this is because programming languages have never given sufficient guidance on what rand() means and how you should work with it. This is where I agree most whole-heartedly with @Tamas_Papp — first and foremost we need to be able to describe what rand() means and how folks should safely perform arithmetic on it.
5 Likes

I agree (as the person who suggested it at one point; I changed my mind since).

Floating point random numbers are imperfect approximations of a process that generates uniform real draws from [0, 1]. Reasonable choices can be made with respect to various trade-offs, but there isn’t a single “best” one, and performance concerns are also highly relevant.

This will be particularly apparent in nonlinear transformations which require specialized methods, see eg randexp. Often these can be implemented in a generic way using a bunch of random bits (an Int64, etc), depending on what properties are desired.

2 Likes

one takeaway I have from this thread is that basically each distribution needs to be implemented thoughtfully, and rand() should not be considered suitable to be naively used as a subroutine for randn, randexp, etc…

so with that being said, could be useful to add a method

julia> rand(Float64, 1.5:2.5)
ERROR: MethodError: no method matching rand(::Type{Float64}, ::StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64})

which will draw from uniform [a, b) with the same precision / approximation / uniformity properties as rand() has on [0, 1), given that the naive (rand() + a) / (b-a) may not be ideal

1 Like

As to the randexp-returns-+Inf concern, I opened a PR for that late last week. It’s since been merged and should appear by v1.11.

4 Likes

So this is interesting — I’m pretty sure that the naive thing would be ideal with rounding-down floating point arithmetic (if you distribute the division to prevent losing precision). If only our processors could easily do such a thing.

It’s also worth pointing out that log(rand()) isn’t necessarily a bug with our current rand(). There is a sensible meaning to that — you’re just now in a right-closed, rounding-up floating point arithmetic world! This means that the cdf of a log(rand()) should be tested with a <= like log(rand()) <= value!

1 Like

Not for 32 bit floats generated like ldexp(Float32(rand(UInt32)),-32). I mean 1/4B or so is small, but you can generate 4B floats in a few seconds on a desktop computer.

For example:

@time let x = 0.0
       while !isinf(x)
       x += -log(rand(Float32))
       end
       end
  0.042344 seconds

If I use Float64 I kill this after 20 seconds and I’m guessing it takes decades or millions of years or whatever to overflow.

But Float32 can represent numbers as small as nextfloat(0.0f0) ~ 1.0f-45 so with a good implementation that code above shouldn’t overflow in some reasonable amount of computing time… on order 10^28 years

That’s the main discussion here I think, try to make the probability of rare events have a better relative error so that people doing simulations who want to observe far tail behavior CAN get far tail behavior even if using Float32

julia> function rand32_for()
           local exponent, a
           for outer exponent in -32:-32:-196
               (a = rand(UInt32)) == 0 || break
           end
           return ldexp(Float32(a), exponent)
       end
rand32_for (generic function with 1 method)

julia> @time let x = 0.0
                     while !isinf(x)
                     x += -log(rand32_for())
                     end end

... goes for a long time then I kill it, should be trillions of years
1 Like

Aren’t you risking conflating this issue with

julia> mapfoldl(Returns(1f0),+,1:1f10)
1.6777216f7

? Also, I don’t think anyone disagreed that log(rand(T)) was going to grossly over-represent +Inf. Your loop will exit as soon as this occurs, but probably no sooner due to the floating point accumulation errors (I used Float32 to demonstrate this because I didn’t have the patience to add up enough Float64 values to manifest it, but it will). I think most of us concurred that log1p(-rand(T)) was much more appropriate.


I’m starting to believe that a satisfactory exit from this debate will not happen. Eventually it will come down to someone (whether someone here or reviewers to an eventual PR) “pulling rank” and saying “we are[n’t] doing this.”

Is the solution to provide a randf function that permits one to specify the desired precision manually? It will necessarily be slower than rand, but can be used when high dynamic range and/or precision are required.

With such a function in-hand, I think we might be able to agree to leave the current rand untouched or, at most, be extended to consume the full bitwidth of a matching UIntXX. In either case, we should better document rand to denote the output distribution. I’d also suggest that we try to mark the precise set of output values (i.e., anything more specific than [0,1)) as non-stable and direct people to randf for fine control of the output set.

I think the main reason we’re talking past eachother here is that everyone has a different view on what P(\texttt{rand(T)} = 0) currently means for the floating point representation of the theoretically ideal uniform real-numbered [0, 1) distribution, what it should mean, and what it could mean (and we’re constantly shifting between these three views without stating it).

  • Status quo: it currently means that a real number in [0, \texttt{eps(T)/2}) was drawn from a theoretically ideal [0,1) real number uniform distribution. This is meaningful and can be fully described, documented, advised upon, and defensively programmed against.
  • Floating point ideal #1: it should mean that a real number in [0, \texttt{floatmin(T)}) was drawn.
  • Floating point ideal #2: or it should mean that a real number in [0, \texttt{nextfloat(zero(T))}) was drawn.
  • Gradient-of-compromises: it could mean that a real number in [0, 2^{-N}) was drawn. The status quo lives on this gradient at N = Base.significand_bits(T)+1. The ideals also live on this gradient.
  • Definitionally impossible: it could be defined to never happen by messing about with the probability of P(\texttt{rand(T)} = \varepsilon) is, where \varepsilon is the next-biggest output of rand(T) or by very slightly biasing all outputs.
2 Likes

My post was just supposed to be an example of how “rand(Float32) never produces small values less than ldexp(1.0,-32) other than 0.0”

It’s all the values between 0 and ldexp(1.0,-32) that are missing that are the real issue. If I want to observe something that should happen around every 15 billion random numbers, the only event I can observe is “random float 32 equal to 0”

Another way to say this is.

let a = rand(Float32)
   while a < 1e-11 || a > 1e-10
       a = rand(Float32)
   end
end

Is an infinite loop when it need not be because Float32 can represent numbers in that range but rand doesn’t produce them.

Consider rand32_for() instead:

julia> function rand32_for()
                  local exponent, a
                  for outer exponent in -32:-32:-196
                      (a = rand(UInt32)) == 0 || break
                  end
                  return ldexp(Float32(a), exponent)
              end
rand32_for (generic function with 1 method)

julia> @time let a = rand32_for()
          while a < 1e-11 || a > 1e-10
              a = rand32_for()
          end
       end
 43.391817 seconds

This is well explained and insightful and makes the case for not emitting zero better than any argument I’ve seen so far. I don’t really buy the “zero is a dangerous value” arguments. We’re used to that and can and should program around it. This argument is subtler: the key point is that if we emit zero, then that imposes a non-zero floor of 1/2^N on the cdf and unless we are sampling from the ideal distribution, this can have exponentially large relative cdf error for small values so P(\mathrm{rand}() < p) for very small p can be way too large. On the flip side, however, if we choose not to emit zero, then for any p less than 1/2^N we have P(\mathrm{rand}() < p) = 0 which is also wrong by an equally large scale. It’s a little unclear to me whether oversampling is really so much worse than undersampling, so it’s not a slam dunk, but it is compelling and a really good observation.

Minor point, but this isn’t true anymore—the last bit isn’t always zero anymore, you can try doing bitstring(rand()) a few times and you’ll see different last bit values.

3 Likes

This is what the rand32_for() tries to avoid if it gets a zero it recursively generates in the interval 0 to 1/2^32 and again if it gets a zero recursively generates until it hits some limit so it fully fills in the lower left tail of the uniform random distribution and ultimately it doesn’t exclude zero but it makes it so astronomically unlikely to observe that it might as well be excluded.

As far as I’m concerned it’s fast enough and it’s a job well done it seems like it solves all of the major problems and I think we should use the same algorithm for 64-bit. Also this loop algorithm is somewhat faster than the “perfect” algorithm listed above and somewhat easier to explain and verify

My randf proposal was only 4% slower on my system and could potentially be optimized further, so I have some hope that this can actually be done without much slowdown. I still need to do vectorized benchmarks, if I can figure out how to plug this into the XoshiroSimd intrastructure…

I’m actually more inclined to use all the bits of a UInt64 regardless of the bit size of the output float type. Which is how I designed my randf and means that the chance of emitting zero is 1/2^{64} for Float64 or Float32 which seems like it might be a good thing—the tail behavior doesn’t depend on the type, the type only affects the precision of values produced. For Float16 it’s a bit different since eps(Float16(0)) is bigger than 1/2^{64}.

3 Likes

I agree that this is maybe too much text for too fine a point. But there where discussions on the interpretation up-thread, whether we want cdf(p) = p or cdfQ(p) = p, and I wanted to at least establish consensus on the language we use to describe this distinction.

Regarding this not making a difference on the real numbers: Maybe not for the uniform distribution, but e.g. mixtures of uniform and dirac-delta are perfectly valid and important probability distributions!

I have chosen not to view floating point rand() as a linear combination of dirac-deltas; but you could, and excluding that modeling a-priori seemed presumptions to me.

Can you explain where I do or appear to do that? This was not my intention. I wanted to start with probability-101 language for describing probability distributions, both on the real line and on linearly ordered finite sets. Then I wanted to move on on to normative questions (what should our output distribution be?) and floating point issues.

Sweet, sounds like an interminable debate running in circles and devolving into a flame-war, years ago. Do you have a link?

Yes! This discussion is really not the job of some programmer forum, it’s the job of IEEE. The scientific consensus has really dropped the ball on this! (doesn’t help, we still need to pick up that ball)

This is a good point. We can completely separate both arguments:

There is a world where we exclude zero, and instead place our smallest 64-bit quantum of probability at e.g. ldexp(1.0, -120). This would be neutral on the oversampling/undersampling issue, but it would fix the Inf / NaN problem. On statistical grounds I’d still argue for placing the smallest quantum at roughly ldexp(1.0, -64), but I would be fully mollified with the (perceived) footgun.

julia> function foo(N)
       h = 0
       for i=1:N
       (reinterpret(UInt64, rand()) & 0x01) == 0 && (h += 1)
       end
       @show N, h, h/N
       end
foo (generic function with 1 method)

julia> foo(1<<30)
(N, h, h / N) = (1073741824, 805275439, 0.7499711951240897)
(1073741824, 805275439, 0.7499711951240897)

Oh no, my arguments are self-defeating :wink:

Yes! I am arguing that it should be roughly eps(y) instead of eps(1) – we’re generating floating point not fixed point numbers, and therefore we should try to offer floating point accuracy, not fixed point accuracy.

Absolutely. However, I’d argue that it is reasonable for people who understand in detail how their specific singular nonlinear function f maps floating point numbers to expect that this understanding translates to the output of rand(). And with the “ideal” distribution, it would translate perfectly!

The way the float-rand() works now, every user has to start anew with rand(UInt64) :frowning:

Is this an empirical claim that programmers understand this, open to refutation-by-bug-report? Is this a claim that it is impossible to fix this specific class of bugs on the language level, open to refutation by a proposed fix?

Or are we missing each others points?

I mean, a person who samples a pareto distribution via (1 - (1 - rand(Float32))^-2 cannot be helped by the language, that’s a computer literacy problem. A person who samples the same via rand(Float32)^-2 failed to read the source code of rand(Float32) and could be helped by changing the source code (not even the documentation).

If you do figure this out, can you post the modifications so that we can try alternatives? The annoying part of plugging this into the Simd infra is that we need to change all the pointer loop stuff. This is because the Float32 simd code consumes only 32 bit of rand for a Float32 (the sequential code consumes 64 bit from the random stream and throws away half of it, so there is no perf or infrastructure impact from that).

Please also consider my proposal. I have not yet analyzed its output distribution; but it should vectorize well on avx2 machines, while your proposal needs avx512. (we also need somebody with an apple device)

I still think that analyzing output distributions or performance of proposed compromise algorithms is premature – we’re still in the process of building consensus on the proper language and benchmarks for describing the desiderata and quality of hypothetical distributions.

If we don’t even have consensus on the yardsticks, then it is too early to run around with a measuring tape, or even to heatedly argue whether a specific piece of wood is good enough to build a house.

Here’s the magic to jam @mikmoore’s latest Float32 N=32 version into XoshiroSimd, using a very naive copy-the-code_llvm approach:

magic

Here’s the complete repl session:

▶ ./julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.0-DEV.595 (2023-10-03)
 _/ |\__'_|_|_|\__'_|  |  Commit b790cf8f0c* (26 days old master)
|__/                   |

julia> using Random, BenchmarkTools

julia> A = Array{Float32}(undef, 2^24);

julia> @benchmark rand!(A)
BenchmarkTools.Trial: 949 samples with 1 evaluation.
 Range (min … max):  5.200 ms …  6.274 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.240 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.268 ms ± 83.680 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▃█▅▃▁▂
  ▄▅▅███████▆▇▆▆▆▇▅▆▆▆▄▆▅▇▆▆▅▄▄▅▁▅▄▄▅▄▁▄▄▁▁▄▄▄▁▄▁▁▁▁▄▁▄▁▁▄▁▄ ▇
  5.2 ms       Histogram: log(frequency) by time      5.7 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @eval Random.XoshiroSimd begin
       @inline _bits2float(x::UInt64, ::Type{Float32}) = _f32bits(x)

       @inline function _f32bits(u::UInt32)
               # map UIntXX linearly into FloatXX on range [0,1)
               U = UInt32
               F = Base.floattype(U)
               bitwidth = 8 * sizeof(U) # bits in type U
               lomask = typemax(U) >>> (bitwidth ÷ 2) # mask of low bits of u
               scale = F(exp2(-bitwidth))
               lo = F(u &  lomask) * scale # must scale here for UInt16, because Float16(typemax(UInt16)) overflows
               hi = F(u & ~lomask) * scale # must scale here for UInt16, because Float16(typemax(UInt16)) overflows
               f0 = lo + hi # full-width value
               efflo = f0 - hi # get effective value of lo used
               roundedup = efflo > lo # determine whether f0 was rounded upward
               fu = reinterpret(U, f0)
               fu -= roundedup # if we rounded up, decrement to round down instead
               return fu
       end
       @inline function _f32bits(x::UInt64)
           ui = (x>>>32) % UInt32
           li = x % UInt32
           return (UInt64(_f32bits(ui)) << 32) | UInt64(_f32bits(li))
       end
       @inline _f32bits(x::VecElement{UInt64}) = _f32bits(x.value)
       end
       for N in [4,8,16]
           VT = :(NTuple{$N, VecElement{UInt64}})
           code = """
               %a = bitcast <$N x i64> %0 to <$(2N) x i32>
               %lowmask = shufflevector <1 x i32> <i32 65535>, <1 x i32> undef, <$(2N) x i32> zeroinitializer
               %b = and <$(2N) x i32> %a, %lowmask
               %c = uitofp <$(2N) x i32> %b to <$(2N) x float>
               %scale = shufflevector <1 x float> <float 0x3DF0000000000000>, <1 x float> undef, <$(2N) x i32> zeroinitializer
               %d = fmul <$(2N) x float> %c, %scale
               %himask = shufflevector <1 x i32> <i32 -65536>, <1 x i32> undef, <$(2N) x i32> zeroinitializer
               %e = and <$(2N) x i32> %a, %himask
               %f = uitofp <$(2N) x i32> %e to <$(2N) x float>
               %g = fmul <$(2N) x float> %f, %scale
               %h = fadd <$(2N) x float> %d, %g
               %i = fsub <$(2N) x float> %h, %g
               %j = fcmp olt <$(2N) x float> %d, %i
               %k = bitcast <$(2N) x float> %h to <$(2N) x i32>
               %l = sext <$(2N) x i1> %j to <$(2N) x i32>
               %m = add <$(2N) x i32> %l, %k
               %n = bitcast <$(2N) x i32> %m to <$N x i64>
               ret <$N x i64> %n
           """
           @eval Random.XoshiroSimd @inline _bits2float(x::$VT, ::Type{Float32}) = llvmcall($code, $VT, Tuple{$VT}, x)
       end

julia> @benchmark rand!(A)
BenchmarkTools.Trial: 665 samples with 1 evaluation.
 Range (min … max):  7.472 ms …   9.067 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.492 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.523 ms ± 140.133 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▇▄▁
  █████▇█▇▆▆▄▅▆▅▅▁▅▅▅▄▁▄▆▁▁▄▁▅▁▄▄▄▄▁▄▁▁▁▁▄▄▄▄▁▅▁▁▁▁▁▁▁▁▁▄▁▄▄▅ ▇
  7.47 ms      Histogram: log(frequency) by time      8.14 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

status quo:

julia> A = Array{Float32}(undef, 2^24);

julia> @benchmark rand!(A)
BenchmarkTools.Trial: 949 samples with 1 evaluation.
 Range (min … max):  5.200 ms …  6.274 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.240 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.268 ms ± 83.680 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▃█▅▃▁▂
  ▄▅▅███████▆▇▆▆▆▇▅▆▆▆▄▆▅▇▆▆▅▄▄▅▁▅▄▄▅▄▁▄▄▁▁▄▄▄▁▄▁▁▁▁▄▁▄▁▁▄▁▄ ▇
  5.2 ms       Histogram: log(frequency) by time      5.7 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

After the above patch:

julia> @benchmark rand!(A)
BenchmarkTools.Trial: 665 samples with 1 evaluation.
 Range (min … max):  7.472 ms …   9.067 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.492 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.523 ms ± 140.133 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▇▄▁
  █████▇█▇▆▆▄▅▆▅▅▁▅▅▅▄▁▄▆▁▁▄▁▅▁▄▄▄▄▁▄▁▁▁▁▄▄▄▄▁▅▁▁▁▁▁▁▁▁▁▄▁▄▄▅ ▇
  7.47 ms      Histogram: log(frequency) by time      8.14 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
4 Likes