Empirical distribution type for continuous variables

Is there a functionality somewhere in StatsBase.jl or Distributions.jl for computing statistics from an empirical distribution type? Specifically, suppose I have samples in the real line, and that I want to compute the quantile at another location. An empirical distribution type would interpolate the inverse CDF between the sample points.

StatsBase.jl provides ecdf for empirical CDF, would it be interesting to have a distribution type on top of it? Please let me know if something is already available.

Distributions.EmpiricalUnivariateDistribution might be what you are looking for.

2 Likes

Took a closer look at the EmpiricalUnivariateDistribution and I might be a little careful with that distribution. There seems to be a couple of issues with it.

1 Like

@andreasnoack on the spot, exactly what I need. What are the issues that you are seeing with it?

The quantile is not doing interpolation, that is the issue you meant, right? It is just picking the closest point in the samples.

Also, the rand is incorrect, it is not doing the inverse sampling, I will submit a PR.

3 Likes

https://github.com/JuliaStats/Distributions.jl/pull/662

I have changed the EmpiricalUnivariteDistribution to be discrete instead of continuous because it relies on ecdf from StatsBase.jl, which doesn’t perform interpolation. Please let me know if something is missing or incorrect.

What do you think of defining two empirical distribution types? One for continuous and one for discrete variables. The former would use some interpolation model (e.g. piecewise linear) to get the CDF and inverse CDF. Please let me know if that would be interesting, I need this functionality in my GeoStats.jl package, but I will implement it in Distributions.jl if it is useful to others.

I agree that the ecdf based empirical distribution is a discrete distribution. However, I don’t think interpolating the ecdf is the better way to produce a continuous version. I think it would be better to define distributions based on density estimates, i.e. Histogram and KDE.

1 Like

That is a very nice idea @andreasnoack, could you please add this functionality with KDEs? It sounds great.

I am particularly interested in the quantile method for empirical distributions, I need to transform samples from a distribution type A to a distribution type B. The way I do it is as follows:

"""
    transform(x, dist)

Transform the empirical distribution of samples `x` into `dist`.
"""
function transform(x::Vector, dist::ContinuousUnivariateDistribution)
    idx = sortperm(x)
    
    N = length(x)
    xcdf = [1:N-1; .99N] / N
   
    quantile(dist, xcdf[idx])
end

I can currently use this function with parametric distributions like Normal(), Gamma(), etc. But because we don’t have a EmpiricalContinuousDistribution yet, I cannot transform back. As you can see, the transformation only relies on the quantile.

For others who come along this path…
Distributions.EmpiricalUnivariateDistribution has been removed (Dec 2018).

It appears that Distributions.DiscreteNonParametric might be closest available, with a caveat.

Where you get documentation for the Distributions.DiscreteNonParametric ?

It should have a docstring, eg

https://github.com/JuliaStats/Distributions.jl/blob/f0cf26385ae47ab411a6e99d34cf49f2c651e440/src/univariate/discrete/discretenonparametric.jl#L1

Thank you. I was looking in the package documentation instead of using the Julia integrated documentation system… my error…

If anyone is interested… how to create the distribution object from a sample of data:

using StatsBase, StatsPlots, Distributions
# Creating original samples...
N = Normal(0,1)
obs = rand(N,10000)
# Cdf..
myCdf = ecdf(obs)
cdf_obs = map(x -> myCdf(x), sort(obs))
# Cdf --> Pmf..
pmf_obs = Float64[]
for i in 1:length(obs)
    if i == 1 
        append!(pmf_obs,cdf_obs[1])
    else
        append!(pmf_obs,cdf_obs[i]-cdf_obs[i-1])
    end
end
# Creating a full distribution object..
d = DiscreteNonParametric(sort(obs),pmf_obs)
# Sampling from the distribution object..
out = rand(d,100)
# Test it...
plot(sort(out))
quantile(d,0.5)
quantile(d,1-0.025)
cdf(d,0)
2 Likes

It seems to be documented in the generated docs, too:

https://juliastats.org/Distributions.jl/latest/univariate/#Distributions.DiscreteNonParametric

thanks for sharing @sylvaticus. I’ve cleaned it up a bit and wrapped it in a function in case any body needs it:

using StatsBase, Distributions
"""
    EmpiricalDistribution(data::Vector{T} where T <: Real)

Create a discrete empirical distribution based on observations.
"""
function EmpiricalDistribution(data::Vector{T} where T <: Real)
    sort!(data) #sort the observations
    empirical_cdf = ecdf(data) #create empirical cdf
    data_clean = unique(data) #remove duplicates to avoid allunique error
    cdf_data = empirical_cdf.(data_clean) #apply ecdf to data
    pmf_data = vcat(cdf_data[1],diff(cdf_data)) #create pmf from the cdf
    DiscreteNonParametric(data_clean,pmf_data) #define distribution
end
4 Likes

Maybe we could add something like that to EmpiricalDistributions.jl? Currently, it only has binned distributions, but it doesn’t have to stay limited to that. EmpiricalDistributions could be moved into the JuliaStatistics org.

Hi @oschulz , if you are referring to the previous post, that is what I did in the PR: Add empirical discrete distribution by hdavid16 · Pull Request #16 · oschulz/EmpiricalDistributions.jl · GitHub

Ah, right, sorry, I forgot - you showed that DiscreteNonParametric in Distributions might do the trick already (but should support more functions).

Sorry to revive an old topic. I was looking for a way to get an approximate interpolation of an empirical histogram, as a continuous density (PDF) that I can evaluate on new points.

Was a replacement of Distributions.EmpiricalUnivariateDistribution added later on to Distributions? Or is EmpiricalDistributions.UvBinnedDist the current recommended way to do this?

Unfortunately EmpiricalDistributions.UvBinnedDist does not do any kind of smoothing. Is there another method I can use that does some smoothed interpolation?

Thanks.