Interface for a distribution defined by its cdf

Hi,

I do have an issue where I need to sample from a distribution that is defined by its cumulative distribution function:

function unknown_cdf(x)
    if x < 0
        return 0
    else
        # do stuff...
    end
end

The only information that i have is that the random variabl’e X is non-negative, but it could be Continuous, discrete, or a mix of the two.

I want to plug this distribution through Distributions.jl’s interface, and in particular sample from it. Is there an already-existing simple way to do that ?

2 Likes

You mean empirical CDFs?

We provide a Distributions.jl-compatible version in our TableTransforms.jl package here:

Well, no, this is not what i mean. I do not have access to a sample, but only to the (black-box) cdf function.

Edit: What i need is more of a numerical inversion method to sample from the cdf. This should not be too hard to do, but reimplementing the Distributions.jl interface on top of itwould take some time and i was wandering if someone already did it.

1 Like

I believe that is what the quantile implementation above does in TableTransforms.jl? You can sample p in [0,1] and then call quantile(d, p) to get a value from the cdf.

EDIT: now I see that you don’t have samples.

Pardon me, but i think you are mistaking: what you propose goes from a dataset to the distribution through a --rather standard-- estimator that is the empirical distribution function.

What i want is to sample from a distribution defined by its cdf, this is completely different.

As a MWE, my goal is to sample from the following CDF (which is the cdf of an Exponential(1) distribution, but we are not supposed to know that):

F(x) = (1 - exp(-x)) * (x > 0)
1 Like

I don’t know of any specific method in this case. Can only think of the naive approach where you sample p in [0,1] and then solve the root problem F(x) = p with some simple optimization package, assuming that you know the support of the function to some extent.

Yes this is what i want to do, I was asking if there was a package that does it β€œcorrectly”, or if i am to write a version of it myself.

The support is non-negative reals.

I don’t know whether the functionality is implemented in some package, but this is the relevant Wikipedia page, perhaps it makes your search easier:

Inverse transform sampling (also known as inversion sampling, the inverse probability integral transform, the inverse transformation method, Smirnov transform, or the golden rule) is a basic method for pseudo-random number sampling, i.e., for generating sample numbers at random from any probability distribution given its cumulative distribution function.

1 Like

The following seems to work for sampling the CDF:

using Roots

F(x) = (1 - exp(-x)) * (x > 0)

To sample a single value:

julia> t = rand(); find_zero(x->F(x)-t, (0.0, 1e8))
1.4652071325835407

To sample many values and plot them:

julia> using UnicodePlots

julia> S = [(t = rand(); find_zero(x->F(x)-t, (0.0, 1e8))) for _ in 1:1000];

julia> histogram(S)
              β”Œ                                        ┐ 
   [0.0, 0.5) β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  383  
   [0.5, 1.0) β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ 237                
   [1.0, 1.5) β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ 158                       
   [1.5, 2.0) β”€β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š 86                               
   [2.0, 2.5) β”€β–ˆβ–ˆβ–ˆβ–ˆβ–Š 52                                  
   [2.5, 3.0) β”€β–ˆβ–ˆβ–ˆβ– 37                                   
   [3.0, 3.5) β”€β–ˆβ–Œ 17                                     
   [3.5, 4.0) β”€β–ˆβ–Ž 12                                     
   [4.0, 4.5) β”€β–Š 9                                       
   [4.5, 5.0) ─▍ 4                                       
   [5.0, 5.5) β”€β–Ž 2                                       
   [5.5, 6.0) β”€β–Ž 2                                       
   [6.0, 6.5) β”€β–Ž 1                                       
              β””                                        β”˜ 
                               Frequency                 

find_zero uses bisection appropriately, but needs an initial interval to kick it off.

As for the discrete bits, the tolerance of find_zero is 0.0 which would make it catch those discrete jumps perhaps (didn’t check).

2 Likes

For discrete cases, there are jumps in the cdf. Say from 0.4 to 0.6 directly, so if the wanted value is 0.5 then find_zeros might not be able to find a zero. I will investigate a bit more to find a suitable solution for discrete cases.

Thanks btw :slight_smile:

1 Like

So I ended up with this interface:


import Distributions
import Roots
import StatsBase
import Plots

struct FromCDF{TF} <: Distributions.ContinuousUnivariateDistribution
    F::TF
end
function Distributions.rand(rng::Distributions.AbstractRNG, d::FromCDF)
    u = rand(rng)
    Roots.find_zero(x -> d.F(x) - u, 1.0)
end

# An easy case:

F(x) = (1 - exp(-x)) * (x > 0)
X = FromCDF(F)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F)

# A more involved one: 
F(x) = 1*(x > 1) # Dirac(1)
X = FromCDF(F)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F)

# A more involved one: 
F(x) = (x > 0.5)/2 + (x > 1.5)/2 # Dirac(0.5) + Dirac(1.5) distribution
X = FromCDF(F)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F)

If the first example works perfectly, the second and third do not, which is problematic…

There is still two issues:

  • I need a working solution for discrete or even mixed cases.
  • This method does not cache computations and thus is very very slow to sample say 100000 random variables. Since we have the random variable object X, we could store more things in it (like the steps of the root finding algorithm) to avoid starting everything over at every sample…

There is still something to be done here…

  • Did you try using an interval (like (0.2,0.5)) as an initial value for find_zero? It might improve the convergence.
  • Keeping a vector of memoized values, can allow an initial simple search to zoom into a smaller interval. The maximum size of such a cache should be decided. There is a bit of cleverness in the cache eviction strategy: we would like it to be uniformly distributed along the support of the univariate CDF. It might also be considered to keep cache in a tree data-structure.
1 Like

I was just discovering that the (0,Inf) interval did the trick indeed :

import Distributions
import Roots
import StatsBase
import Plots
struct FromCDF{TF} <: Distributions.ContinuousUnivariateDistribution
    F::TF
end
function Distributions.rand(rng::Distributions.AbstractRNG, d::FromCDF)
    u = rand(rng)
    Roots.find_zero(x -> (d.F(x) - u), (0, Inf))
end

# An easy case:

F1(x) = (1 - exp(-x)) * (x > 0)
X = FromCDF(F1)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F1)

# A more involved one: 
F2(x) = 1*(x >= 2) # Dirac(1)
X = FromCDF(F2)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 4)
Plots.plot!(F2)

# A more involved one: 
F3(x) = Distributions.cdf(Distributions.Binomial(10,0.7),x)
X = FromCDF(F3)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F3)


# Final try: 
F4(x) = (F1(x)+F2(x)+F3(x))/3
X = FromCDF(F4)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F4)

At least this is working and giving correct results, yay :smiley:

The cache in a tree-data structure might be the right way to go indeed, but I do not know how to do it yet.

2 Likes

I guess you could try using the ApproxFun package or my FindMinimaxPolynomial package to approximate the quantile function with a polynomial. So you would presumably find the polynomial approximation before constructing FromCDF and use it in the rand method.

Personally I don’t have any experience with ApproxFun, but I suppose it’d be a less laborious solution than FindMinimaxPolynomial. To use FindMinimaxPolynomial, you’d presumably want to do domain splitting on your support, so split it into a number of subintervals and then find a different polynomial for each subinterval.

Due to the fact that the functions (and its inverse) might not be continuous (and discontinuities must be found correctly to sample correctly), approximating with a polynomials does not seems like a good idea…

1 Like

If I understand correctly, our call to Roots.find_zero(x -> (d.F(x) - u), (0, Inf)) is doing a bisection as defined there : https://github.com/JuliaMath/Roots.jl/blob/master/src/Bracketing/bisection.jl

So I think the right way would be to somehow rewrite this bisection to keep its history and reuse it / complete it on each sample, in a tree-like format.

1 Like