Output distribution of rand(Float32) and rand(Float64), thread 2

Sure, but my implementation was just meant to be indicative of the type of thing to be done. instead of literally calling rand(Float64) we could call rand(UInt64) and work with that, something like:

i = rand(UInt64)
exp = -10
if i < 1 <<32
   exp -=32
else
   i = i >> 32
end
r = ldexp(Float32(i),exp)

The bigger point is my last two paragraphs — it’s hard to preserve such an open-open interval through any transformation, even trivial fully linear monotonic ones. And I’d say it’s impossible to give general guidance on how to do so.

Half-open intervals are much easier to preserve and I think it’s possible to describe a general algorithm to do so for a limited set of well-behaved functions f.

1 Like

The use of transformations that take numbers near 0 to large numbers is common though (specifically log of a probability is a really common transform such as when calculating entropy), and the main thing there is to avoid rounding up to Inf. I mean, Inf is a lot bigger than for example

julia> -log(Float64(nextfloat(0.0f0)))
103.27892990343184
1 Like

Inf is not a problem if you know what it means. With our current rand(), getting -Inf from log(rand()) means you may have drawn a value in \big(-36.7368005696771, -\text{Inf}\big]. With an “ideal” [0, 1) rand, that’d be \big(-744.4400719213812, -\text{Inf}\big].

The fact that I can tell you this is valuable.

5 Likes

If we have an ideal RNG you should never get Inf ever ever ever ever ever no matter how big your cluster. (Or rather, the real cause of it would be hardware failure or a bug) Right now you get it in less than a minute on a desktop. That’s the real concern.

You’re misinterpreting both me and the Inf.

1 Like

That’s been patched. It’s no longer a concern. Once it hits release, you won’t get values bigger than about 44 with the current rand(Float64). If we went all the way down into subnormal Float64s we could go as high as about 752.

If someone goes around randexp to write -log(rand(T)) and rand is documented to return 0.0 and that user doesn’t want +Inf, it’s bad usage on the user’s part. We can’t protect everyone from every floating point pitfall that exists.

3 Likes

I mean maybe, but as a guy who might want to calculate a log of a probability coming from a “perfect Float32” generator I don’t want it to ever return Inf because that would be observing something so unlikely that it’s vastly vastly more likely my computer will catch fire and kill me in my sleep by many many orders of magnitude. Inf is not something I should ever get. It’s vastly more likely that a correct implementation of rand(Float32) would return 84.3 because of bit flip errors.

Calculating the log of a probability is like one of the most common things you could do, it’s not just for generating exponentials it’s inherent in entropy calculations or in working with posterior distributions of Bayesian models. Admittedly you don’t necessarily calculate it by this method but my point is more about the inherent aspect of doing general nonlinear transforms with singularity near 0.

1 Like

You’re losing the forest in the trees. What should the probability that -log(rand(Float64)) == -log(0.5) be? It can’t be the statistically correct answer 0.

3 Likes

Yeah, the float generation code assumes that it has access to a kind of random oracle producing uniform random bits (rather an infinite read-only extra tape filled with random, when viewed through the lens of turing machines).

Most considerations should not attempt to peek into the internals and instead be studied in such an oracle model.

You cannot make any choice. Once you have decided that cdf(p) = p, this already implies that probability(rand() == p) = cdf(nextfloat(p)) - cdf(p) = nextfloat(p) - p. Luckily this coincides with the choice you wanted to make anyways!

Correct for Float16. As I repeatedly said, I refuse to have any opinions on Float16.

For BFloat16, Float32 and Float64 we can either redistribute the probability mass to e.g. the smallest non-subnormal number, or we could do println("Your RNG is ill-seeded!"); exit(1);, since this will only happen with a probability < ldexp(1.0, -120), which I consider negligible, and which is also considered negligible by the united states government with respect to the encryption of state secrets.

1 Like

This isn’t the kind of question that concerns people who do floating point computations, but how often should -sum(log.([rand(Float32) for i in 1:1000000])) return Inf is the kind of thing you might care about. If the rand() implementation is “perfect” this will never happen in my lifetime or indeed before the heat death of the universe so I’m ok with it.

(fun fact: this thread is very nearly the longest non-locked thread of all time!)

2 Likes

Let me give a more meaningful applied example… Suppose you have a potential energy function with a singularity near 0 and someone generates millions of random particle locations on [0,1) and hands them to your potential energy function and then wants to calculate ensemble sums or averages … You wind up with this kind of calculation except it’s a user defined weird function instead of log

Once again this thread has exploded because we’re argumenting past eachother on descriptive vs. normative behaviors. I’m trying to describe what it truly means to perform log(rand()) with the status quo rand() and how you can actually interpret its output with the status quo… and then projecting that to see how it might apply if we change rand() to output more values. You’re arguing about what its output should mean — but I don’t think this is a sensible thing to do because it’s not a symbolic computation and you do need to be aware of particulars. I get it — I would love to assume that any transformation of rand() is exactly equivalent to sampling from the transformed uniform distribution.

The very same arguments apply to perfect_open_open_rand() + 1. Theoretically, this should never return 2, but even with a theoretically ideal (0, 1) rand, it will with great regularity!

My arguments are that we need to be able to meaningfully and concretely describe why both these things happen beyond a hand-wavy “floating points aren’t accurate” mumbo jumbo.

7 Likes

Maybe I never made it clear, but I’m 100% OK with your cdf(p) = p at the generated quantities criterion and including 0.0f0 when applied to any rand(Float32) implementation that makes 0.0f0 have probability low enough that a million core cluster can’t observe it in at least a week of computing but preferably closer to a year. It’s possible in Float32 to make this probability be unobservable in the lifetime of the universe. But we might not have to go that far.

This criteria makes it irrelevant for almost everyone in the universe except people with tremendous resources who can take care of the question themselves.

I also think it would be good to put explicitly in the docs for rand(Float32) that "note that this generator can generate 0.0f0 with probability about X and if that’s a problem we suggest wrapping it with something like

randno0(Float32) = begin f = rand(Float32); f = (f == 0.0f0) ? nextfloat(0.0f0) : f; end

which is the smallest perturbation possible to the cdf affecting only the one possible value 0.0f0.

what I don’t think is ideal is if a physics grad student has to deal with running his particle simulation for a weekend and it’s virtually guaranteed to give Inf as the ensemble average because there’s about a 1e-9 probability to get exactly 0.0f in a calculation where 0.0 makes no sense such as a potential energy calculation and multiple billions of random numbers will be calculated (this is a weekend project for grad students these days).

Here’s a very rough concept of a SIMD-able full-representation (i.e., with values approaching floatmin(T) or nextfloat(zero(T))) implementation. Even if the concept is valid, THIS PROBABLY CONTAINS SOME ERROR. In particular, it probably doesn’t handle the bottom scale correctly. It also wastes some entropy, only observing the bottom Base.significand_bits(T)+1 bits of each generated UIntXX, but this might be faster anyway due to the resulting simplicity. The chances of re-drawing are SIMDWIDTH * 2^-(Base.significand_bits(T)+1), so even with 512-bit registers this is 2^{-6} for Float16 and 2^{-20} for Float32. Float16 should probably use something slightly different anyway, since at most 0 or 1 redraws are required given UInt16s (depending on whether we emit subnormals).

It does appear to SIMD except for the generation of rand(UIntXX). But we already have a solution for that somewhere…

block_randf(t::Type{T},n::Val) where {T} = block_randf(Random.default_rng(),t,n)
function block_randf(rng::Random.AbstractRNG, ::Type{T}, N::Val) where {T<:Base.IEEEFloat}
	U = Base.uinttype(T) # same-width Unsigned
	maxint = U(maxintfloat(T)) # largest representable integer T
	scaleincr = inv(maxintfloat(T)) # if we try to return a zero, drop by this factor and try again
	smallesteps = eps(zero(T)) # minimum spacing between values that we want to achieve
	
	scale = one(scaleincr) # initialize
	scalepow = Int(-log2(scaleincr))
	todrop = Int(-log2(smallesteps) + log2(scale))
	numscales = fld(todrop, scalepow) # how many times we can apply scaleincr while above smallestreturn
	maxint_bottom = U(exp2(todrop - numscales*scalepow)) # at bottom scale, this number may be different

	f = ntuple(Returns(zero(T)), N) # the values we plan to return
	good = ntuple(Returns(false), N) # track which lanes are valid
	for _ in 1:numscales
		scale *= scaleincr # prepare to evaluate this scale
		u = ntuple(_->rand(rng, U), N) # TODO: need a SIMD version of this
		u = u .% maxint # mask to representable values in T
		fnew = T.(u) .* scale # lossless conversion to float at appropriate scale
		f = ifelse.(good, f, fnew) # replace bad values from previous iterations
		good = good .| .!iszero.(u) # mark values as good or bad
		all(good) && return f # all values are valid
	end
	# handle bottom scale
	u = ntuple(_->rand(rng, U), N) # TODO: need a SIMD version of this
	u = u .% maxint_bottom # mask to representable values in T
	fnew = T.(u) .* smallesteps # lossless conversion to float at appropriate scale
	f = ifelse.(good, f, fnew) # replace bad values from previous iterations
	return f
end

EDIT: improved handling of bottom range, although this still misses many representable values in the lower ranges. See following post by mbauman.

I think something like this could be performant enough for general use. It’s probably even competitive (average case) to an emulated-rounding same-width int->float conversion scheme like what I presented above.

1 Like

That’s clever — and a promising direction — but I don’t think it’s quite doing what we want because it returns values from the Set([...; 2^-48:2^-48:2^-24-2^-48; 2^-24:2^-24:1-2^-24]) for Float32. It’s not “fully saturating” the floating point values around the beginning of each of the ranges.

It’ll also eventually round down its scale to 0 (with Float16s, anyhow)… so perhaps the while loop should just be a for loop over the maximum possible iterations, which would allow it to theoretically return 0s and terminate (which might also make the compiler happier).

But the key insight for me is that we actually can (of course) SIMD with the possibility of redraws and we can structure the probabilities such that it’s rare (even with 4x or 8x at a time).

The same concept should apply to the multiplicative UIntXX->FloatXX version you already posted, and we can only choose to re-sample when we dive below the fully-sampled regions of the floating point number. I think.

2 Likes

Indeed. I was worried about this and should have given it more thought, but I’m a touch fuzzy in the mind today. You’re right that It definitely has sparse support and some more thought is necessary to get that right.

To your point, it appears the missing piece here is probably resampling small values where the support becomes incomplete. My mind isn’t up to the task today (I already got it wrong once), but someone else can give it a crack. I’m a tad concerned that we’ll only be extracting roughly an exponent_bits amount of dynamic range per step. This would mean a more substantial re-draw frequency, but perhaps still small enough to carry a low cost. And maybe we get more dynamic range than I’m thinking.

I was already working out the finite-number-of-loops thing (I updated the above post) so that I could handle the bottom scale properly. I’m not sure the old scheme got the 0 frequency correct, but hopefully now it’s better on that front.

2 Likes

Yeah, that’s why my mind immediately jumped to the full-bit-width versions, because then you get a dense sampling of all values greater than 2^{(23-32)} = 0.001953125 and 2^{(52-64)} = 0.000244140625. So even at 8x generation, less than a 1% chance of a re-draw per block. In other words, this shouldn’t be a meaningful hit to the performance numbers I already posted.

2 Likes