Empirical distribution type for continuous variables

question
proposal

#1

Is there a functionality somewhere in StatsBase.jl or Distributions.jl for computing statistics from an empirical distribution type? Specifically, suppose I have samples in the real line, and that I want to compute the quantile at another location. An empirical distribution type would interpolate the inverse CDF between the sample points.

StatsBase.jl provides ecdf for empirical CDF, would it be interesting to have a distribution type on top of it? Please let me know if something is already available.


#2

Distributions.EmpiricalUnivariateDistribution might be what you are looking for.


#3

Took a closer look at the EmpiricalUnivariateDistribution and I might be a little careful with that distribution. There seems to be a couple of issues with it.


#4

@andreasnoack on the spot, exactly what I need. What are the issues that you are seeing with it?


#5

The quantile is not doing interpolation, that is the issue you meant, right? It is just picking the closest point in the samples.


#6

Also, the rand is incorrect, it is not doing the inverse sampling, I will submit a PR.


#7

https://github.com/JuliaStats/Distributions.jl/pull/662

I have changed the EmpiricalUnivariteDistribution to be discrete instead of continuous because it relies on ecdf from StatsBase.jl, which doesn’t perform interpolation. Please let me know if something is missing or incorrect.

What do you think of defining two empirical distribution types? One for continuous and one for discrete variables. The former would use some interpolation model (e.g. piecewise linear) to get the CDF and inverse CDF. Please let me know if that would be interesting, I need this functionality in my GeoStats.jl package, but I will implement it in Distributions.jl if it is useful to others.


#8

I agree that the ecdf based empirical distribution is a discrete distribution. However, I don’t think interpolating the ecdf is the better way to produce a continuous version. I think it would be better to define distributions based on density estimates, i.e. Histogram and KDE.


#9

That is a very nice idea @andreasnoack, could you please add this functionality with KDEs? It sounds great.

I am particularly interested in the quantile method for empirical distributions, I need to transform samples from a distribution type A to a distribution type B. The way I do it is as follows:

"""
    transform(x, dist)

Transform the empirical distribution of samples `x` into `dist`.
"""
function transform(x::Vector, dist::ContinuousUnivariateDistribution)
    idx = sortperm(x)
    
    N = length(x)
    xcdf = [1:N-1; .99N] / N
   
    quantile(dist, xcdf[idx])
end

I can currently use this function with parametric distributions like Normal(), Gamma(), etc. But because we don’t have a EmpiricalContinuousDistribution yet, I cannot transform back. As you can see, the transformation only relies on the quantile.