Using a (normalized) Histogram as a Distribution

Do we have a wrapper somewhere that turns a StatsBase.Histogram into a Distributions.Distribution?

Sampling from a Histogram is basically just randomly sampling from the underlying data, no? Why not do that?

I’ve not used either, but StatsBase contains an ecdf (empirical CDF) function, and there is also EmpiricalCDFs.jl.

After reviewing EmpiricalCDFs.jl, I think the current implementation of random sampling is unnecessarily complicated. I think tbeason is correct.

Sampling from a Histogram is basically just randomly sampling from the underlying data, no?

EDIT: I just did a benchmark and, complicated or no, the current implementation is 10% faster in one test than the straightforward and naive way: data[rand(1:length(data))].

But, the OP may have other reasons for wanting to use the interface in Distributions.

DiscreteNonParametric does not do what the OP asks for.

1 Like

A good reason for not doing that is if you have a lot of data initially and you have a large number of ties in your data. Then a simple empirical distribution will greatly reduce what needs to saved and can greatly speed up various operations on the distribution.

Before thinking about the implementation, you should tackle the conceptual question of what you want to do exactly, why you would want to go through a histogram, and what it means to sample from one. Eg one could imagine a histogram as a mixture distribution of uniforms (for each bin). Is this what you need?

Nonparametric methods all have various trade-offs you should consider, ie you get some nice theoretical properties when you can assume some regularity/smoothness about the distribution. Bootstrap (treating the sample as the distribution) assumes little but one can usually do better.

1 Like

Thank you. I was looking for the ecdf function in the distribution package, but could not find it (as it is in StatsBase). I must have misinterpreted the meaning of DiscreteNonParametric

Before thinking about the implementation, you should tackle the conceptual question of what you want to do exactly

What I want to do is using a histogram as a prior, in a context where priors are represented by Distribution. So I need a histogram with a Distribution API, mostly for logpdf and rand.

why you would want to go through a histogram

As for why I’m using a histogram, it’s because in some cases that’s just the prior information we have. Of course having the original information, before histogramming, may be preferable - alas, that’s not always readily available.

and what it means to sample from one.

Well, in this case it need to sample to draw starting values for MCMC chains according to my prior (histogram). But in general, interpreting a histogram as an ecdf and sampling from it - while maybe lacking in statistical purity - is quite a common use case, right?

one could imagine a histogram as a mixture distribution of uniforms (for each bin). Is this what you need?

Basically, yes. It’s not hard to implement, of course - I was just curious if we already had that available in some package.

Sampling from a Histogram is basically just randomly sampling from the underlying data, no? Why not do that?

I usually encounter two reasons:

  • The underlying data is not available

  • The underlying data is very large (GB to TB) . Using a decently fine-binned histogram is basically just as precise, but comes with a very significant gain in performance.

First, there is no reason to go through the ECDF, you can sample from a histogram directly (once you decide how you interpret it, eg a mixture of uniforms).

Second, statistical interpretation (not necessarily “purity”) is key here. If the procedure is ad hoc, it can of course be useful in some contexts, but I would not expect a readily available implementation, nor would I consider it a common use case. YMMV.

First, there is no reason to go through the ECDF, you can sample from a histogram directly

Yes, that was my intention. Sorry, my wording was a bit sloppy, of course a histogram isn’t really an ECDF. In any case, in many use cases I’ll probably have 2D (or sometimes even 3D) histograms, so probably best not to talk about this in term of CDFs.

Second, statistical interpretation (not necessarily “purity”) is key here.

Well, the statistical interpretation would be that the histogram represents a step-wise approximation to the pdf of the underlying (prior) distribution. Though I agree that in some cases it may not be that simple - a step-wise function wouldn’t be very nice for HMC, for example, so maybe one my need to smoothe/approximate the histogram. That would be purely empirical, of course, but that doesn’t mean it wouldn’t work. :slight_smile:

Perhaps it would be better if, instead of focusing on this specific question, you explained the underlying problem.

I don’t fully get the context here: why do you need to use data as a prior, is the issue Bayesian inference on large data (there are methods for that), what the model is, what kind of MCMC you want to use, etc.

I think the discussion has veered a bit off-topic - maybe I wasn’t specific enough about what I was looking for in the first place. So, using the interpretation of a histogramm as a mixture distribution of uniforms (for each bin), as suggested by @Tamas_Papp: Do we have a Distributions compatible wrapper somewhere that does it already? If not, I’ll make one, I just wanted to avoid duplicating existing work.

The “data” may, e.g., be the marginalized posterior of a prior analysis, to be combined with new measurement data in a knowledge update process. You can of course also make it part of the likelihood (depending on your definition of prior and likelihood in the specific case). Or sometimes you do get non-marginalized MCMC output as a data source (e.g. cosmic parameter MC chains based on cosmic microwave background sets from Planck) - call that your prior or data - that you want to combine with other data, and one may decide to use histograms of the raw MC chains for performance reasons (e.g. to fit into CPU cache).

FWIW, I am not aware of any. But this should not be that complicated, eg

using Distributions, StatsBase, IterTools, LinearAlgebra

function make_mixture(h)
    MixtureModel([Uniform(e...) for e in partition(h.edges[1], 2, 1)],
                 normalize(h.weights, 1))
end

should do it,

julia> h = fit(Histogram, randn(1000))
Histogram{Int64,1,Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}
edges:
  -4.0:1.0:4.0
weights: [1, 17, 145, 346, 330, 143, 17, 1]
closed: left
isdensity: false

julia> m = make_mixture(h)
MixtureModel{Uniform{Float64}}(K = 8)
components[1] (prior = 0.0010): Uniform{Float64}(a=-4.0, b=-3.0)
components[2] (prior = 0.0170): Uniform{Float64}(a=-3.0, b=-2.0)
components[3] (prior = 0.1450): Uniform{Float64}(a=-2.0, b=-1.0)
components[4] (prior = 0.3460): Uniform{Float64}(a=-1.0, b=0.0)
components[5] (prior = 0.3300): Uniform{Float64}(a=0.0, b=1.0)
components[6] (prior = 0.1430): Uniform{Float64}(a=1.0, b=2.0)
components[7] (prior = 0.0170): Uniform{Float64}(a=2.0, b=3.0)
components[8] (prior = 0.0010): Uniform{Float64}(a=3.0, b=4.0)


julia> mean(m)
-0.010999999999999937

for univariate and it should be easy to extend to multivariate.

FWIW, I am not aware of any.

Good to know, thanks, Tamas!

But this should not be that complicated, eg […]

Yep, something like that. I’ll code up something a bit more customized than using MixtureModel (logpdf would be too slow), but sure, it’s pretty straightforward.

In case someone else is interested in this: Our Phd student @lmh91 implemented this now as part of BAT.jl: https://github.com/bat/BAT.jl/blob/master/src/distributions/HistogramAsUvDistribution.jl .

We’ll let this mature and extend it further (n-dim histograms as multivariate distributions), then we’ll see if we can find a home for it in a more general package than BAT.jl.

1 Like

Taking this opportunity to brainstorm empirical CDFs in the Julia Stats ecosystem.

There are multiple implementations:

  • StatsBase.jl: ecdf
  • EmpiricalCDFs.jl: EmpiricalCDF
  • GeoStatsBase.jl: EmpiricalDistribution

Would you like to join efforts to create a single high-performance, well-tested, widely-available implementation? Would it make sense to have it in Distributions.jl?

2 Likes

I’ve been thinking whether these things (ECDDF’s, histograms as distributions, etc.) could be combined in a new package “EmpiricalDistributions” or so.

I think it would be nice if ECDFs would implement the Distributions API, in general (as far as possible).