Interface for a distribution defined by its cdf

lrnv · October 25, 2023, 6:40pm

Hi,

I do have an issue where I need to sample from a distribution that is defined by its cumulative distribution function:

function unknown_cdf(x)
    if x < 0
        return 0
    else
        # do stuff...
    end
end

The only information that i have is that the random variabl’e X is non-negative, but it could be Continuous, discrete, or a mix of the two.

I want to plug this distribution through Distributions.jl’s interface, and in particular sample from it. Is there an already-existing simple way to do that ?

juliohm · October 25, 2023, 6:49pm

You mean empirical CDFs?

juliohm · October 25, 2023, 6:50pm

We provide a Distributions.jl-compatible version in our TableTransforms.jl package here:

github.com

JuliaML/TableTransforms.jl/blob/2b628ad7f8485dcf2a7768309062083e5788a35b/src/distributions.jl

# ------------------------------------------------------------------
# Licensed under the MIT License. See LICENSE in the project root.
# ------------------------------------------------------------------

"""
    EmpiricalDistribution(values)

An empirical distribution holding continuous values.
"""
struct EmpiricalDistribution{T<:Real} <: ContinuousUnivariateDistribution
  values::Vector{T}

  function EmpiricalDistribution{T}(values) where {T<:Real}
    @assert !isempty(values) "values must be provided"
    new(sort(values))
  end
end

EmpiricalDistribution(values) = EmpiricalDistribution{eltype(values)}(values)

This file has been truncated. show original

lrnv · October 25, 2023, 6:52pm

Well, no, this is not what i mean. I do not have access to a sample, but only to the (black-box) cdf function.

Edit: What i need is more of a numerical inversion method to sample from the cdf. This should not be too hard to do, but reimplementing the Distributions.jl interface on top of itwould take some time and i was wandering if someone already did it.

juliohm · October 25, 2023, 6:59pm

I believe that is what the quantile implementation above does in TableTransforms.jl? You can sample p in [0,1] and then call quantile(d, p) to get a value from the cdf.

EDIT: now I see that you don’t have samples.

lrnv · October 25, 2023, 7:02pm

Pardon me, but i think you are mistaking: what you propose goes from a dataset to the distribution through a --rather standard-- estimator that is the empirical distribution function.

What i want is to sample from a distribution defined by its cdf, this is completely different.

As a MWE, my goal is to sample from the following CDF (which is the cdf of an Exponential(1) distribution, but we are not supposed to know that):

F(x) = (1 - exp(-x)) * (x > 0)

juliohm · October 25, 2023, 7:05pm

I don’t know of any specific method in this case. Can only think of the naive approach where you sample p in [0,1] and then solve the root problem F(x) = p with some simple optimization package, assuming that you know the support of the function to some extent.

lrnv · October 25, 2023, 7:06pm

Yes this is what i want to do, I was asking if there was a package that does it “correctly”, or if i am to write a version of it myself.

The support is non-negative reals.

nsajko · October 25, 2023, 8:36pm

I don’t know whether the functionality is implemented in some package, but this is the relevant Wikipedia page, perhaps it makes your search easier:

Inverse transform sampling (also known as inversion sampling, the inverse probability integral transform, the inverse transformation method, Smirnov transform, or the golden rule) is a basic method for pseudo-random number sampling, i.e., for generating sample numbers at random from any probability distribution given its cumulative distribution function.

Dan · October 25, 2023, 8:57pm

The following seems to work for sampling the CDF:

using Roots

F(x) = (1 - exp(-x)) * (x > 0)

To sample a single value:

julia> t = rand(); find_zero(x->F(x)-t, (0.0, 1e8))
1.4652071325835407

To sample many values and plot them:

julia> using UnicodePlots

julia> S = [(t = rand(); find_zero(x->F(x)-t, (0.0, 1e8))) for _ in 1:1000];

julia> histogram(S)
              ┌                                        ┐ 
   [0.0, 0.5) ┤███████████████████████████████████  383  
   [0.5, 1.0) ┤█████████████████████▋ 237                
   [1.0, 1.5) ┤██████████████▌ 158                       
   [1.5, 2.0) ┤███████▊ 86                               
   [2.0, 2.5) ┤████▊ 52                                  
   [2.5, 3.0) ┤███▍ 37                                   
   [3.0, 3.5) ┤█▌ 17                                     
   [3.5, 4.0) ┤█▎ 12                                     
   [4.0, 4.5) ┤▊ 9                                       
   [4.5, 5.0) ┤▍ 4                                       
   [5.0, 5.5) ┤▎ 2                                       
   [5.5, 6.0) ┤▎ 2                                       
   [6.0, 6.5) ┤▎ 1                                       
              └                                        ┘ 
                               Frequency

find_zero uses bisection appropriately, but needs an initial interval to kick it off.

As for the discrete bits, the tolerance of find_zero is 0.0 which would make it catch those discrete jumps perhaps (didn’t check).

lrnv · October 25, 2023, 9:21pm

For discrete cases, there are jumps in the cdf. Say from 0.4 to 0.6 directly, so if the wanted value is 0.5 then find_zeros might not be able to find a zero. I will investigate a bit more to find a suitable solution for discrete cases.

Thanks btw

lrnv · October 26, 2023, 8:06am

So I ended up with this interface:


import Distributions
import Roots
import StatsBase
import Plots

struct FromCDF{TF} <: Distributions.ContinuousUnivariateDistribution
    F::TF
end
function Distributions.rand(rng::Distributions.AbstractRNG, d::FromCDF)
    u = rand(rng)
    Roots.find_zero(x -> d.F(x) - u, 1.0)
end

# An easy case:

F(x) = (1 - exp(-x)) * (x > 0)
X = FromCDF(F)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F)

# A more involved one: 
F(x) = 1*(x > 1) # Dirac(1)
X = FromCDF(F)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F)

# A more involved one: 
F(x) = (x > 0.5)/2 + (x > 1.5)/2 # Dirac(0.5) + Dirac(1.5) distribution
X = FromCDF(F)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F)

If the first example works perfectly, the second and third do not, which is problematic…

There is still two issues:

I need a working solution for discrete or even mixed cases.
This method does not cache computations and thus is very very slow to sample say 100000 random variables. Since we have the random variable object X, we could store more things in it (like the steps of the root finding algorithm) to avoid starting everything over at every sample…

There is still something to be done here…

Dan · October 26, 2023, 8:34am

Did you try using an interval (like (0.2,0.5)) as an initial value for find_zero? It might improve the convergence.
Keeping a vector of memoized values, can allow an initial simple search to zoom into a smaller interval. The maximum size of such a cache should be decided. There is a bit of cleverness in the cache eviction strategy: we would like it to be uniformly distributed along the support of the univariate CDF. It might also be considered to keep cache in a tree data-structure.

lrnv · October 26, 2023, 8:35am

I was just discovering that the (0,Inf) interval did the trick indeed :

import Distributions
import Roots
import StatsBase
import Plots
struct FromCDF{TF} <: Distributions.ContinuousUnivariateDistribution
    F::TF
end
function Distributions.rand(rng::Distributions.AbstractRNG, d::FromCDF)
    u = rand(rng)
    Roots.find_zero(x -> (d.F(x) - u), (0, Inf))
end

# An easy case:

F1(x) = (1 - exp(-x)) * (x > 0)
X = FromCDF(F1)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F1)

# A more involved one: 
F2(x) = 1*(x >= 2) # Dirac(1)
X = FromCDF(F2)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 4)
Plots.plot!(F2)

# A more involved one: 
F3(x) = Distributions.cdf(Distributions.Binomial(10,0.7),x)
X = FromCDF(F3)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F3)


# Final try: 
F4(x) = (F1(x)+F2(x)+F3(x))/3
X = FromCDF(F4)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F4)

At least this is working and giving correct results, yay

The cache in a tree-data structure might be the right way to go indeed, but I do not know how to do it yet.

nsajko · October 26, 2023, 8:41am

I guess you could try using the ApproxFun package or my FindMinimaxPolynomial package to approximate the quantile function with a polynomial. So you would presumably find the polynomial approximation before constructing FromCDF and use it in the rand method.

Personally I don’t have any experience with ApproxFun, but I suppose it’d be a less laborious solution than FindMinimaxPolynomial. To use FindMinimaxPolynomial, you’d presumably want to do domain splitting on your support, so split it into a number of subintervals and then find a different polynomial for each subinterval.

lrnv · October 26, 2023, 8:43am

Due to the fact that the functions (and its inverse) might not be continuous (and discontinuities must be found correctly to sample correctly), approximating with a polynomials does not seems like a good idea…

lrnv · October 26, 2023, 8:48am

If I understand correctly, our call to Roots.find_zero(x -> (d.F(x) - u), (0, Inf)) is doing a bisection as defined there : https://github.com/JuliaMath/Roots.jl/blob/master/src/Bracketing/bisection.jl

So I think the right way would be to somehow rewrite this bisection to keep its history and reuse it / complete it on each sample, in a tree-like format.

Topic		Replies	Views
Sampler for arbitrary univariate distribution? General Usage	8	614	April 23, 2022
Empirical distribution type for continuous variables Statistics question , proposal	18	4849	February 14, 2023
User-defined distribution function example Statistics question	14	4888	March 21, 2019
Inverse transform sampling (discrete distributions sampling)? Performance question	4	849	March 1, 2021
Using a (normalized) Histogram as a Distribution General Usage	29	3328	September 3, 2019

Interface for a distribution defined by its cdf

Related topics