Is there a functionality somewhere in StatsBase.jl or Distributions.jl for computing statistics from an empirical distribution type? Specifically, suppose I have samples in the real line, and that I want to compute the quantile at another location. An empirical distribution type would interpolate the inverse CDF between the sample points.

StatsBase.jl provides ecdf for empirical CDF, would it be interesting to have a distribution type on top of it? Please let me know if something is already available.

Took a closer look at the EmpiricalUnivariateDistribution and I might be a little careful with that distribution. There seems to be a couple of issues with it.

I have changed the EmpiricalUnivariteDistribution to be discrete instead of continuous because it relies on ecdf from StatsBase.jl, which doesnâ€™t perform interpolation. Please let me know if something is missing or incorrect.

What do you think of defining two empirical distribution types? One for continuous and one for discrete variables. The former would use some interpolation model (e.g. piecewise linear) to get the CDF and inverse CDF. Please let me know if that would be interesting, I need this functionality in my GeoStats.jl package, but I will implement it in Distributions.jl if it is useful to others.

I agree that the ecdf based empirical distribution is a discrete distribution. However, I donâ€™t think interpolating the ecdf is the better way to produce a continuous version. I think it would be better to define distributions based on density estimates, i.e. Histogram and KDE.

That is a very nice idea @andreasnoack, could you please add this functionality with KDEs? It sounds great.

I am particularly interested in the quantile method for empirical distributions, I need to transform samples from a distribution type A to a distribution type B. The way I do it is as follows:

"""
transform(x, dist)
Transform the empirical distribution of samples `x` into `dist`.
"""
function transform(x::Vector, dist::ContinuousUnivariateDistribution)
idx = sortperm(x)
N = length(x)
xcdf = [1:N-1; .99N] / N
quantile(dist, xcdf[idx])
end

I can currently use this function with parametric distributions like Normal(), Gamma(), etc. But because we donâ€™t have a EmpiricalContinuousDistribution yet, I cannot transform back. As you can see, the transformation only relies on the quantile.

Thank you. I was looking in the package documentation instead of using the Julia integrated documentation systemâ€¦ my errorâ€¦

If anyone is interestedâ€¦ how to create the distribution object from a sample of data:

using StatsBase, StatsPlots, Distributions
# Creating original samples...
N = Normal(0,1)
obs = rand(N,10000)
# Cdf..
myCdf = ecdf(obs)
cdf_obs = map(x -> myCdf(x), sort(obs))
# Cdf --> Pmf..
pmf_obs = Float64[]
for i in 1:length(obs)
if i == 1
append!(pmf_obs,cdf_obs[1])
else
append!(pmf_obs,cdf_obs[i]-cdf_obs[i-1])
end
end
# Creating a full distribution object..
d = DiscreteNonParametric(sort(obs),pmf_obs)
# Sampling from the distribution object..
out = rand(d,100)
# Test it...
plot(sort(out))
quantile(d,0.5)
quantile(d,1-0.025)
cdf(d,0)

thanks for sharing @sylvaticus. Iâ€™ve cleaned it up a bit and wrapped it in a function in case any body needs it:

using StatsBase, Distributions
"""
EmpiricalDistribution(data::Vector{T} where T <: Real)
Create a discrete empirical distribution based on observations.
"""
function EmpiricalDistribution(data::Vector{T} where T <: Real)
sort!(data) #sort the observations
empirical_cdf = ecdf(data) #create empirical cdf
data_clean = unique(data) #remove duplicates to avoid allunique error
cdf_data = empirical_cdf.(data_clean) #apply ecdf to data
pmf_data = vcat(cdf_data[1],diff(cdf_data)) #create pmf from the cdf
DiscreteNonParametric(data_clean,pmf_data) #define distribution
end

Maybe we could add something like that to EmpiricalDistributions.jl? Currently, it only has binned distributions, but it doesnâ€™t have to stay limited to that. EmpiricalDistributions could be moved into the JuliaStatistics org.