I do have an issue where I need to sample from a distribution that is defined by its cumulative distribution function:

function unknown_cdf(x)
if x < 0
return 0
else
# do stuff...
end
end

The only information that i have is that the random variablβe X is non-negative, but it could be Continuous, discrete, or a mix of the two.

I want to plug this distribution through Distributions.jlβs interface, and in particular sample from it. Is there an already-existing simple way to do that ?

Well, no, this is not what i mean. I do not have access to a sample, but only to the (black-box) cdf function.

Edit: What i need is more of a numerical inversion method to sample from the cdf. This should not be too hard to do, but reimplementing the Distributions.jl interface on top of itwould take some time and i was wandering if someone already did it.

I believe that is what the quantile implementation above does in TableTransforms.jl? You can sample p in [0,1] and then call quantile(d, p) to get a value from the cdf.

Pardon me, but i think you are mistaking: what you propose goes from a dataset to the distribution through a --rather standard-- estimator that is the empirical distribution function.

What i want is to sample from a distribution defined by its cdf, this is completely different.

As a MWE, my goal is to sample from the following CDF (which is the cdf of an Exponential(1) distribution, but we are not supposed to know that):

I donβt know of any specific method in this case. Can only think of the naive approach where you sample p in [0,1] and then solve the root problem F(x) = p with some simple optimization package, assuming that you know the support of the function to some extent.

I donβt know whether the functionality is implemented in some package, but this is the relevant Wikipedia page, perhaps it makes your search easier:

Inverse transform sampling (also known as inversion sampling, the inverse probability integral transform, the inverse transformation method, Smirnov transform, or the golden rule) is a basic method for pseudo-random number sampling, i.e., for generating sample numbers at random from any probability distribution given its cumulative distribution function.

For discrete cases, there are jumps in the cdf. Say from 0.4 to 0.6 directly, so if the wanted value is 0.5 then find_zeros might not be able to find a zero. I will investigate a bit more to find a suitable solution for discrete cases.

import Distributions
import Roots
import StatsBase
import Plots
struct FromCDF{TF} <: Distributions.ContinuousUnivariateDistribution
F::TF
end
function Distributions.rand(rng::Distributions.AbstractRNG, d::FromCDF)
u = rand(rng)
Roots.find_zero(x -> d.F(x) - u, 1.0)
end
# An easy case:
F(x) = (1 - exp(-x)) * (x > 0)
X = FromCDF(F)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F)
# A more involved one:
F(x) = 1*(x > 1) # Dirac(1)
X = FromCDF(F)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F)
# A more involved one:
F(x) = (x > 0.5)/2 + (x > 1.5)/2 # Dirac(0.5) + Dirac(1.5) distribution
X = FromCDF(F)
x = rand(X,10000)
Plots.plot(t -> StatsBase.ecdf(x)(t), 0, 10)
Plots.plot!(F)

If the first example works perfectly, the second and third do not, which is problematicβ¦

There is still two issues:

I need a working solution for discrete or even mixed cases.

This method does not cache computations and thus is very very slow to sample say 100000 random variables. Since we have the random variable object X, we could store more things in it (like the steps of the root finding algorithm) to avoid starting everything over at every sampleβ¦

Did you try using an interval (like (0.2,0.5)) as an initial value for find_zero? It might improve the convergence.

Keeping a vector of memoized values, can allow an initial simple search to zoom into a smaller interval. The maximum size of such a cache should be decided. There is a bit of cleverness in the cache eviction strategy: we would like it to be uniformly distributed along the support of the univariate CDF. It might also be considered to keep cache in a tree data-structure.

I guess you could try using the ApproxFun package or my FindMinimaxPolynomial package to approximate the quantile function with a polynomial. So you would presumably find the polynomial approximation before constructing FromCDF and use it in the rand method.

Personally I donβt have any experience with ApproxFun, but I suppose itβd be a less laborious solution than FindMinimaxPolynomial. To use FindMinimaxPolynomial, youβd presumably want to do domain splitting on your support, so split it into a number of subintervals and then find a different polynomial for each subinterval.

Due to the fact that the functions (and its inverse) might not be continuous (and discontinuities must be found correctly to sample correctly), approximating with a polynomials does not seems like a good ideaβ¦

So I think the right way would be to somehow rewrite this bisection to keep its history and reuse it / complete it on each sample, in a tree-like format.