Empirical distribution type for continuous variables

juliohm · September 1, 2017, 11:34pm

Is there a functionality somewhere in StatsBase.jl or Distributions.jl for computing statistics from an empirical distribution type? Specifically, suppose I have samples in the real line, and that I want to compute the quantile at another location. An empirical distribution type would interpolate the inverse CDF between the sample points.

StatsBase.jl provides ecdf for empirical CDF, would it be interesting to have a distribution type on top of it? Please let me know if something is already available.

andreasnoack · September 2, 2017, 1:29am

Distributions.EmpiricalUnivariateDistribution might be what you are looking for.

andreasnoack · September 2, 2017, 1:38am

Took a closer look at the EmpiricalUnivariateDistribution and I might be a little careful with that distribution. There seems to be a couple of issues with it.

juliohm · September 2, 2017, 2:30am

@andreasnoack on the spot, exactly what I need. What are the issues that you are seeing with it?

juliohm · September 2, 2017, 2:39am

The quantile is not doing interpolation, that is the issue you meant, right? It is just picking the closest point in the samples.

juliohm · September 2, 2017, 2:41am

Also, the rand is incorrect, it is not doing the inverse sampling, I will submit a PR.

juliohm · September 2, 2017, 3:55pm

https://github.com/JuliaStats/Distributions.jl/pull/662

I have changed the EmpiricalUnivariteDistribution to be discrete instead of continuous because it relies on ecdf from StatsBase.jl, which doesn’t perform interpolation. Please let me know if something is missing or incorrect.

What do you think of defining two empirical distribution types? One for continuous and one for discrete variables. The former would use some interpolation model (e.g. piecewise linear) to get the CDF and inverse CDF. Please let me know if that would be interesting, I need this functionality in my GeoStats.jl package, but I will implement it in Distributions.jl if it is useful to others.

andreasnoack · September 2, 2017, 5:37pm

I agree that the ecdf based empirical distribution is a discrete distribution. However, I don’t think interpolating the ecdf is the better way to produce a continuous version. I think it would be better to define distributions based on density estimates, i.e. Histogram and KDE.

juliohm · September 2, 2017, 5:47pm

That is a very nice idea @andreasnoack, could you please add this functionality with KDEs? It sounds great.

I am particularly interested in the quantile method for empirical distributions, I need to transform samples from a distribution type A to a distribution type B. The way I do it is as follows:

"""
    transform(x, dist)

Transform the empirical distribution of samples `x` into `dist`.
"""
function transform(x::Vector, dist::ContinuousUnivariateDistribution)
    idx = sortperm(x)
    
    N = length(x)
    xcdf = [1:N-1; .99N] / N
   
    quantile(dist, xcdf[idx])
end

I can currently use this function with parametric distributions like Normal(), Gamma(), etc. But because we don’t have a EmpiricalContinuousDistribution yet, I cannot transform back. As you can see, the transformation only relies on the quantile.

taqtiqa-mark · April 28, 2019, 8:53am

For others who come along this path…
Distributions.EmpiricalUnivariateDistribution has been removed (Dec 2018).

It appears that Distributions.DiscreteNonParametric might be closest available, with a caveat.

sylvaticus · November 13, 2019, 12:57pm

Where you get documentation for the Distributions.DiscreteNonParametric ?

Tamas_Papp · November 13, 2019, 1:01pm

It should have a docstring, eg

https://github.com/JuliaStats/Distributions.jl/blob/f0cf26385ae47ab411a6e99d34cf49f2c651e440/src/univariate/discrete/discretenonparametric.jl#L1

sylvaticus · November 13, 2019, 1:45pm

Thank you. I was looking in the package documentation instead of using the Julia integrated documentation system… my error…

If anyone is interested… how to create the distribution object from a sample of data:

using StatsBase, StatsPlots, Distributions
# Creating original samples...
N = Normal(0,1)
obs = rand(N,10000)
# Cdf..
myCdf = ecdf(obs)
cdf_obs = map(x -> myCdf(x), sort(obs))
# Cdf --> Pmf..
pmf_obs = Float64[]
for i in 1:length(obs)
    if i == 1 
        append!(pmf_obs,cdf_obs[1])
    else
        append!(pmf_obs,cdf_obs[i]-cdf_obs[i-1])
    end
end
# Creating a full distribution object..
d = DiscreteNonParametric(sort(obs),pmf_obs)
# Sampling from the distribution object..
out = rand(d,100)
# Test it...
plot(sort(out))
quantile(d,0.5)
quantile(d,1-0.025)
cdf(d,0)

Tamas_Papp · November 13, 2019, 2:49pm

It seems to be documented in the generated docs, too:

https://juliastats.org/Distributions.jl/latest/univariate/#Distributions.DiscreteNonParametric

hdavid16 · August 11, 2020, 5:49pm

thanks for sharing @sylvaticus. I’ve cleaned it up a bit and wrapped it in a function in case any body needs it:

using StatsBase, Distributions
"""
    EmpiricalDistribution(data::Vector{T} where T <: Real)

Create a discrete empirical distribution based on observations.
"""
function EmpiricalDistribution(data::Vector{T} where T <: Real)
    sort!(data) #sort the observations
    empirical_cdf = ecdf(data) #create empirical cdf
    data_clean = unique(data) #remove duplicates to avoid allunique error
    cdf_data = empirical_cdf.(data_clean) #apply ecdf to data
    pmf_data = vcat(cdf_data[1],diff(cdf_data)) #create pmf from the cdf
    DiscreteNonParametric(data_clean,pmf_data) #define distribution
end

oschulz · May 1, 2021, 7:37pm

Maybe we could add something like that to EmpiricalDistributions.jl? Currently, it only has binned distributions, but it doesn’t have to stay limited to that. EmpiricalDistributions could be moved into the JuliaStatistics org.

hdavid16 · May 1, 2021, 8:13pm

Hi @oschulz , if you are referring to the previous post, that is what I did in the PR: Add empirical discrete distribution by hdavid16 · Pull Request #16 · oschulz/EmpiricalDistributions.jl · GitHub

oschulz · May 2, 2021, 12:48am

Ah, right, sorry, I forgot - you showed that DiscreteNonParametric in Distributions might do the trick already (but should support more functions).

e3c6 · February 14, 2023, 2:48pm

Sorry to revive an old topic. I was looking for a way to get an approximate interpolation of an empirical histogram, as a continuous density (PDF) that I can evaluate on new points.

Was a replacement of Distributions.EmpiricalUnivariateDistribution added later on to Distributions? Or is EmpiricalDistributions.UvBinnedDist the current recommended way to do this?

Unfortunately EmpiricalDistributions.UvBinnedDist does not do any kind of smoothing. Is there another method I can use that does some smoothed interpolation?

Thanks.

Topic		Replies	Views
Using a (normalized) Histogram as a Distribution General Usage	29	3324	September 3, 2019
Interface for a distribution defined by its cdf Statistics distributions	16	920	October 26, 2023
How does one use empirical distributions? Statistics	1	855	December 6, 2018
Discrete empirical distribution General Usage	3	295	May 14, 2023
Distributions.jl structure Statistics distributions	5	1304	May 5, 2020

Empirical distribution type for continuous variables

Related topics