Interface for a distribution defined by its cdf

Hi,

I do have an issue where I need to sample from a distribution that is defined by its cumulative distribution function:

``````function unknown_cdf(x)
if x < 0
return 0
else
# do stuff...
end
end
``````

The only information that i have is that the random variablβe X is non-negative, but it could be Continuous, discrete, or a mix of the two.

I want to plug this distribution through Distributions.jlβs interface, and in particular sample from it. Is there an already-existing simple way to do that ?

2 Likes

You mean empirical CDFs?

We provide a Distributions.jl-compatible version in our TableTransforms.jl package here:

Well, no, this is not what i mean. I do not have access to a sample, but only to the (black-box) cdf function.

Edit: What i need is more of a numerical inversion method to sample from the cdf. This should not be too hard to do, but reimplementing the Distributions.jl interface on top of itwould take some time and i was wandering if someone already did it.

1 Like

I believe that is what the `quantile` implementation above does in TableTransforms.jl? You can sample p in [0,1] and then call `quantile(d, p)` to get a value from the cdf.

EDIT: now I see that you donβt have samples.

Pardon me, but i think you are mistaking: what you propose goes from a dataset to the distribution through a --rather standard-- estimator that is the empirical distribution function.

What i want is to sample from a distribution defined by its cdf, this is completely different.

As a MWE, my goal is to sample from the following CDF (which is the cdf of an `Exponential(1)` distribution, but we are not supposed to know that):

``````F(x) = (1 - exp(-x)) * (x > 0)
``````
1 Like

I donβt know of any specific method in this case. Can only think of the naive approach where you sample `p in [0,1]` and then solve the root problem `F(x) = p` with some simple optimization package, assuming that you know the support of the function to some extent.

Yes this is what i want to do, I was asking if there was a package that does it βcorrectlyβ, or if i am to write a version of it myself.

The support is non-negative reals.

I donβt know whether the functionality is implemented in some package, but this is the relevant Wikipedia page, perhaps it makes your search easier:

Inverse transform sampling (also known as inversion sampling, the inverse probability integral transform, the inverse transformation method, Smirnov transform, or the golden rule) is a basic method for pseudo-random number sampling, i.e., for generating sample numbers at random from any probability distribution given its cumulative distribution function.

1 Like

The following seems to work for sampling the CDF:

``````using Roots

F(x) = (1 - exp(-x)) * (x > 0)
``````

To sample a single value:

``````julia> t = rand(); find_zero(x->F(x)-t, (0.0, 1e8))
1.4652071325835407
``````

To sample many values and plot them:

``````julia> using UnicodePlots

julia> S = [(t = rand(); find_zero(x->F(x)-t, (0.0, 1e8))) for _ in 1:1000];

julia> histogram(S)
β                                        β
[0.0, 0.5) β€βββββββββββββββββββββββββββββββββββ  383
[0.5, 1.0) β€ββββββββββββββββββββββ 237
[1.0, 1.5) β€βββββββββββββββ 158
[1.5, 2.0) β€ββββββββ 86
[2.0, 2.5) β€βββββ 52
[2.5, 3.0) β€ββββ 37
[3.0, 3.5) β€ββ 17
[3.5, 4.0) β€ββ 12
[4.0, 4.5) β€β 9
[4.5, 5.0) β€β 4
[5.0, 5.5) β€β 2
[5.5, 6.0) β€β 2
[6.0, 6.5) β€β 1
β                                        β
Frequency

``````

`find_zero` uses bisection appropriately, but needs an initial interval to kick it off.

As for the discrete bits, the tolerance of `find_zero` is `0.0` which would make it catch those discrete jumps perhaps (didnβt check).

2 Likes

For discrete cases, there are jumps in the cdf. Say from 0.4 to 0.6 directly, so if the wanted value is 0.5 then find_zeros might not be able to find a zero. I will investigate a bit more to find a suitable solution for discrete cases.

Thanks btw

1 Like

So I ended up with this interface:

``````
import Distributions
import Roots
import StatsBase
import Plots

struct FromCDF{TF} <: Distributions.ContinuousUnivariateDistribution
F::TF
end
function Distributions.rand(rng::Distributions.AbstractRNG, d::FromCDF)
u = rand(rng)
Roots.find_zero(x -> d.F(x) - u, 1.0)
end

# An easy case:

F(x) = (1 - exp(-x)) * (x > 0)
X = FromCDF(F)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F)

# A more involved one:
F(x) = 1*(x > 1) # Dirac(1)
X = FromCDF(F)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F)

# A more involved one:
F(x) = (x > 0.5)/2 + (x > 1.5)/2 # Dirac(0.5) + Dirac(1.5) distribution
X = FromCDF(F)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F)

``````

If the first example works perfectly, the second and third do not, which is problematicβ¦

There is still two issues:

• I need a working solution for discrete or even mixed cases.
• This method does not cache computations and thus is very very slow to sample say 100000 random variables. Since we have the random variable object `X`, we could store more things in it (like the steps of the root finding algorithm) to avoid starting everything over at every sampleβ¦

There is still something to be done hereβ¦

• Did you try using an interval (like `(0.2,0.5)`) as an initial value for `find_zero`? It might improve the convergence.
• Keeping a vector of memoized values, can allow an initial simple search to zoom into a smaller interval. The maximum size of such a cache should be decided. There is a bit of cleverness in the cache eviction strategy: we would like it to be uniformly distributed along the support of the univariate CDF. It might also be considered to keep cache in a tree data-structure.
1 Like

I was just discovering that the (0,Inf) interval did the trick indeed :

``````import Distributions
import Roots
import StatsBase
import Plots
struct FromCDF{TF} <: Distributions.ContinuousUnivariateDistribution
F::TF
end
function Distributions.rand(rng::Distributions.AbstractRNG, d::FromCDF)
u = rand(rng)
Roots.find_zero(x -> (d.F(x) - u), (0, Inf))
end

# An easy case:

F1(x) = (1 - exp(-x)) * (x > 0)
X = FromCDF(F1)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F1)

# A more involved one:
F2(x) = 1*(x >= 2) # Dirac(1)
X = FromCDF(F2)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 4)
Plots.plot!(F2)

# A more involved one:
F3(x) = Distributions.cdf(Distributions.Binomial(10,0.7),x)
X = FromCDF(F3)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F3)

# Final try:
F4(x) = (F1(x)+F2(x)+F3(x))/3
X = FromCDF(F4)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F4)

``````

At least this is working and giving correct results, yay

The cache in a tree-data structure might be the right way to go indeed, but I do not know how to do it yet.

2 Likes

I guess you could try using the ApproxFun package or my FindMinimaxPolynomial package to approximate the quantile function with a polynomial. So you would presumably find the polynomial approximation before constructing `FromCDF` and use it in the `rand` method.

Personally I donβt have any experience with ApproxFun, but I suppose itβd be a less laborious solution than FindMinimaxPolynomial. To use FindMinimaxPolynomial, youβd presumably want to do domain splitting on your support, so split it into a number of subintervals and then find a different polynomial for each subinterval.

Due to the fact that the functions (and its inverse) might not be continuous (and discontinuities must be found correctly to sample correctly), approximating with a polynomials does not seems like a good ideaβ¦

1 Like

If I understand correctly, our call to `Roots.find_zero(x -> (d.F(x) - u), (0, Inf))` is doing a bisection as defined there : https://github.com/JuliaMath/Roots.jl/blob/master/src/Bracketing/bisection.jl

So I think the right way would be to somehow rewrite this bisection to keep its history and reuse it / complete it on each sample, in a tree-like format.

1 Like