Using a (normalized) Histogram as a Distribution

oschulz · August 18, 2019, 4:02pm

Do we have a wrapper somewhere that turns a StatsBase.Histogram into a Distributions.Distribution?

tbeason · August 18, 2019, 7:00pm

Sampling from a Histogram is basically just randomly sampling from the underlying data, no? Why not do that?

jkbest2 · August 18, 2019, 7:01pm

I’ve not used either, but StatsBase contains an ecdf (empirical CDF) function, and there is also EmpiricalCDFs.jl.

jlapeyre · August 19, 2019, 1:43am

After reviewing EmpiricalCDFs.jl, I think the current implementation of random sampling is unnecessarily complicated. I think tbeason is correct.

Sampling from a Histogram is basically just randomly sampling from the underlying data, no?

EDIT: I just did a benchmark and, complicated or no, the current implementation is 10% faster in one test than the straightforward and naive way: data[rand(1:length(data))].

But, the OP may have other reasons for wanting to use the interface in Distributions.

DiscreteNonParametric does not do what the OP asks for.

pistacliffcho · August 19, 2019, 4:09am

A good reason for not doing that is if you have a lot of data initially and you have a large number of ties in your data. Then a simple empirical distribution will greatly reduce what needs to saved and can greatly speed up various operations on the distribution.

Tamas_Papp · August 19, 2019, 4:51am

Before thinking about the implementation, you should tackle the conceptual question of what you want to do exactly, why you would want to go through a histogram, and what it means to sample from one. Eg one could imagine a histogram as a mixture distribution of uniforms (for each bin). Is this what you need?

Nonparametric methods all have various trade-offs you should consider, ie you get some nice theoretical properties when you can assume some regularity/smoothness about the distribution. Bootstrap (treating the sample as the distribution) assumes little but one can usually do better.

bernhard · August 19, 2019, 4:51am

Thank you. I was looking for the ecdf function in the distribution package, but could not find it (as it is in StatsBase). I must have misinterpreted the meaning of DiscreteNonParametric

oschulz · August 19, 2019, 7:59am

Before thinking about the implementation, you should tackle the conceptual question of what you want to do exactly

What I want to do is using a histogram as a prior, in a context where priors are represented by Distribution. So I need a histogram with a Distribution API, mostly for logpdf and rand.

why you would want to go through a histogram

As for why I’m using a histogram, it’s because in some cases that’s just the prior information we have. Of course having the original information, before histogramming, may be preferable - alas, that’s not always readily available.

and what it means to sample from one.

Well, in this case it need to sample to draw starting values for MCMC chains according to my prior (histogram). But in general, interpreting a histogram as an ecdf and sampling from it - while maybe lacking in statistical purity - is quite a common use case, right?

one could imagine a histogram as a mixture distribution of uniforms (for each bin). Is this what you need?

Basically, yes. It’s not hard to implement, of course - I was just curious if we already had that available in some package.

oschulz · August 19, 2019, 8:04am

Sampling from a Histogram is basically just randomly sampling from the underlying data, no? Why not do that?

I usually encounter two reasons:

The underlying data is not available
The underlying data is very large (GB to TB) . Using a decently fine-binned histogram is basically just as precise, but comes with a very significant gain in performance.

Tamas_Papp · August 19, 2019, 8:08am

First, there is no reason to go through the ECDF, you can sample from a histogram directly (once you decide how you interpret it, eg a mixture of uniforms).

Second, statistical interpretation (not necessarily “purity”) is key here. If the procedure is ad hoc, it can of course be useful in some contexts, but I would not expect a readily available implementation, nor would I consider it a common use case. YMMV.

oschulz · August 19, 2019, 9:18am

First, there is no reason to go through the ECDF, you can sample from a histogram directly

Yes, that was my intention. Sorry, my wording was a bit sloppy, of course a histogram isn’t really an ECDF. In any case, in many use cases I’ll probably have 2D (or sometimes even 3D) histograms, so probably best not to talk about this in term of CDFs.

Second, statistical interpretation (not necessarily “purity”) is key here.

Well, the statistical interpretation would be that the histogram represents a step-wise approximation to the pdf of the underlying (prior) distribution. Though I agree that in some cases it may not be that simple - a step-wise function wouldn’t be very nice for HMC, for example, so maybe one my need to smoothe/approximate the histogram. That would be purely empirical, of course, but that doesn’t mean it wouldn’t work.

Tamas_Papp · August 19, 2019, 9:22am

Perhaps it would be better if, instead of focusing on this specific question, you explained the underlying problem.

I don’t fully get the context here: why do you need to use data as a prior, is the issue Bayesian inference on large data (there are methods for that), what the model is, what kind of MCMC you want to use, etc.

oschulz · August 19, 2019, 9:47am

I think the discussion has veered a bit off-topic - maybe I wasn’t specific enough about what I was looking for in the first place. So, using the interpretation of a histogramm as a mixture distribution of uniforms (for each bin), as suggested by @Tamas_Papp: Do we have a Distributions compatible wrapper somewhere that does it already? If not, I’ll make one, I just wanted to avoid duplicating existing work.

oschulz · August 19, 2019, 9:59am

The “data” may, e.g., be the marginalized posterior of a prior analysis, to be combined with new measurement data in a knowledge update process. You can of course also make it part of the likelihood (depending on your definition of prior and likelihood in the specific case). Or sometimes you do get non-marginalized MCMC output as a data source (e.g. cosmic parameter MC chains based on cosmic microwave background sets from Planck) - call that your prior or data - that you want to combine with other data, and one may decide to use histograms of the raw MC chains for performance reasons (e.g. to fit into CPU cache).

Tamas_Papp · August 19, 2019, 11:37am

FWIW, I am not aware of any. But this should not be that complicated, eg

using Distributions, StatsBase, IterTools, LinearAlgebra

function make_mixture(h)
    MixtureModel([Uniform(e...) for e in partition(h.edges[1], 2, 1)],
                 normalize(h.weights, 1))
end

should do it,

julia> h = fit(Histogram, randn(1000))
Histogram{Int64,1,Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}
edges:
  -4.0:1.0:4.0
weights: [1, 17, 145, 346, 330, 143, 17, 1]
closed: left
isdensity: false

julia> m = make_mixture(h)
MixtureModel{Uniform{Float64}}(K = 8)
components[1] (prior = 0.0010): Uniform{Float64}(a=-4.0, b=-3.0)
components[2] (prior = 0.0170): Uniform{Float64}(a=-3.0, b=-2.0)
components[3] (prior = 0.1450): Uniform{Float64}(a=-2.0, b=-1.0)
components[4] (prior = 0.3460): Uniform{Float64}(a=-1.0, b=0.0)
components[5] (prior = 0.3300): Uniform{Float64}(a=0.0, b=1.0)
components[6] (prior = 0.1430): Uniform{Float64}(a=1.0, b=2.0)
components[7] (prior = 0.0170): Uniform{Float64}(a=2.0, b=3.0)
components[8] (prior = 0.0010): Uniform{Float64}(a=3.0, b=4.0)


julia> mean(m)
-0.010999999999999937

for univariate and it should be easy to extend to multivariate.

oschulz · August 19, 2019, 3:03pm

FWIW, I am not aware of any.

Good to know, thanks, Tamas!

But this should not be that complicated, eg […]

Yep, something like that. I’ll code up something a bit more customized than using MixtureModel (logpdf would be too slow), but sure, it’s pretty straightforward.

oschulz · August 31, 2019, 8:37pm

In case someone else is interested in this: Our Phd student @lmh91 implemented this now as part of BAT.jl: https://github.com/bat/BAT.jl/blob/master/src/distributions/HistogramAsUvDistribution.jl .

We’ll let this mature and extend it further (n-dim histograms as multivariate distributions), then we’ll see if we can find a home for it in a more general package than BAT.jl.

juliohm · August 31, 2019, 10:57pm

Taking this opportunity to brainstorm empirical CDFs in the Julia Stats ecosystem.

There are multiple implementations:

StatsBase.jl: ecdf
EmpiricalCDFs.jl: EmpiricalCDF
GeoStatsBase.jl: EmpiricalDistribution

Would you like to join efforts to create a single high-performance, well-tested, widely-available implementation? Would it make sense to have it in Distributions.jl?

oschulz · September 1, 2019, 6:58am

I’ve been thinking whether these things (ECDDF’s, histograms as distributions, etc.) could be combined in a new package “EmpiricalDistributions” or so.

oschulz · September 1, 2019, 6:59am

I think it would be nice if ECDFs would implement the Distributions API, in general (as far as possible).

Topic		Replies	Views
Empirical distribution type for continuous variables Statistics question , proposal	18	4845	February 14, 2023
Random variables in Julia (working list) Statistics distributions	36	6105	November 27, 2022
Interface for a distribution defined by its cdf Statistics distributions	16	920	October 26, 2023
Questions about contributing to Distributions.jl Statistics question	41	2712	June 19, 2020
Use a Histogram as a function General Usage	4	2185	October 21, 2019

Using a (normalized) Histogram as a Distribution

Related topics