Do we have a wrapper somewhere that turns a StatsBase.Histogram
into a Distributions.Distribution?
Sampling from a Histogram is basically just randomly sampling from the underlying data, no? Why not do that?
I’ve not used either, but StatsBase
contains an ecdf
(empirical CDF) function, and there is also EmpiricalCDFs.jl.
After reviewing EmpiricalCDFs.jl
, I think the current implementation of random sampling is unnecessarily complicated. I think tbeason is correct.
Sampling from a Histogram is basically just randomly sampling from the underlying data, no?
EDIT: I just did a benchmark and, complicated or no, the current implementation is 10% faster in one test than the straightforward and naive way: data[rand(1:length(data))]
.
But, the OP may have other reasons for wanting to use the interface in Distributions
.
DiscreteNonParametric
does not do what the OP asks for.
A good reason for not doing that is if you have a lot of data initially and you have a large number of ties in your data. Then a simple empirical distribution will greatly reduce what needs to saved and can greatly speed up various operations on the distribution.
Before thinking about the implementation, you should tackle the conceptual question of what you want to do exactly, why you would want to go through a histogram, and what it means to sample from one. Eg one could imagine a histogram as a mixture distribution of uniforms (for each bin). Is this what you need?
Nonparametric methods all have various trade-offs you should consider, ie you get some nice theoretical properties when you can assume some regularity/smoothness about the distribution. Bootstrap (treating the sample as the distribution) assumes little but one can usually do better.
Thank you. I was looking for the ecdf function in the distribution package, but could not find it (as it is in StatsBase). I must have misinterpreted the meaning of DiscreteNonParametric
Before thinking about the implementation, you should tackle the conceptual question of what you want to do exactly
What I want to do is using a histogram as a prior, in a context where priors are represented by Distribution
. So I need a histogram with a Distribution
API, mostly for logpdf
and rand
.
why you would want to go through a histogram
As for why I’m using a histogram, it’s because in some cases that’s just the prior information we have. Of course having the original information, before histogramming, may be preferable - alas, that’s not always readily available.
and what it means to sample from one.
Well, in this case it need to sample to draw starting values for MCMC chains according to my prior (histogram). But in general, interpreting a histogram as an ecdf and sampling from it - while maybe lacking in statistical purity - is quite a common use case, right?
one could imagine a histogram as a mixture distribution of uniforms (for each bin). Is this what you need?
Basically, yes. It’s not hard to implement, of course - I was just curious if we already had that available in some package.
Sampling from a Histogram is basically just randomly sampling from the underlying data, no? Why not do that?
I usually encounter two reasons:
-
The underlying data is not available
-
The underlying data is very large (GB to TB) . Using a decently fine-binned histogram is basically just as precise, but comes with a very significant gain in performance.
First, there is no reason to go through the ECDF, you can sample from a histogram directly (once you decide how you interpret it, eg a mixture of uniforms).
Second, statistical interpretation (not necessarily “purity”) is key here. If the procedure is ad hoc, it can of course be useful in some contexts, but I would not expect a readily available implementation, nor would I consider it a common use case. YMMV.
First, there is no reason to go through the ECDF, you can sample from a histogram directly
Yes, that was my intention. Sorry, my wording was a bit sloppy, of course a histogram isn’t really an ECDF. In any case, in many use cases I’ll probably have 2D (or sometimes even 3D) histograms, so probably best not to talk about this in term of CDFs.
Second, statistical interpretation (not necessarily “purity”) is key here.
Well, the statistical interpretation would be that the histogram represents a step-wise approximation to the pdf of the underlying (prior) distribution. Though I agree that in some cases it may not be that simple - a step-wise function wouldn’t be very nice for HMC, for example, so maybe one my need to smoothe/approximate the histogram. That would be purely empirical, of course, but that doesn’t mean it wouldn’t work.
Perhaps it would be better if, instead of focusing on this specific question, you explained the underlying problem.
I don’t fully get the context here: why do you need to use data as a prior, is the issue Bayesian inference on large data (there are methods for that), what the model is, what kind of MCMC you want to use, etc.
I think the discussion has veered a bit off-topic - maybe I wasn’t specific enough about what I was looking for in the first place. So, using the interpretation of a histogramm as a mixture distribution of uniforms (for each bin), as suggested by @Tamas_Papp: Do we have a Distributions
compatible wrapper somewhere that does it already? If not, I’ll make one, I just wanted to avoid duplicating existing work.
The “data” may, e.g., be the marginalized posterior of a prior analysis, to be combined with new measurement data in a knowledge update process. You can of course also make it part of the likelihood (depending on your definition of prior and likelihood in the specific case). Or sometimes you do get non-marginalized MCMC output as a data source (e.g. cosmic parameter MC chains based on cosmic microwave background sets from Planck) - call that your prior or data - that you want to combine with other data, and one may decide to use histograms of the raw MC chains for performance reasons (e.g. to fit into CPU cache).
FWIW, I am not aware of any. But this should not be that complicated, eg
using Distributions, StatsBase, IterTools, LinearAlgebra
function make_mixture(h)
MixtureModel([Uniform(e...) for e in partition(h.edges[1], 2, 1)],
normalize(h.weights, 1))
end
should do it,
julia> h = fit(Histogram, randn(1000))
Histogram{Int64,1,Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}
edges:
-4.0:1.0:4.0
weights: [1, 17, 145, 346, 330, 143, 17, 1]
closed: left
isdensity: false
julia> m = make_mixture(h)
MixtureModel{Uniform{Float64}}(K = 8)
components[1] (prior = 0.0010): Uniform{Float64}(a=-4.0, b=-3.0)
components[2] (prior = 0.0170): Uniform{Float64}(a=-3.0, b=-2.0)
components[3] (prior = 0.1450): Uniform{Float64}(a=-2.0, b=-1.0)
components[4] (prior = 0.3460): Uniform{Float64}(a=-1.0, b=0.0)
components[5] (prior = 0.3300): Uniform{Float64}(a=0.0, b=1.0)
components[6] (prior = 0.1430): Uniform{Float64}(a=1.0, b=2.0)
components[7] (prior = 0.0170): Uniform{Float64}(a=2.0, b=3.0)
components[8] (prior = 0.0010): Uniform{Float64}(a=3.0, b=4.0)
julia> mean(m)
-0.010999999999999937
for univariate and it should be easy to extend to multivariate.
FWIW, I am not aware of any.
Good to know, thanks, Tamas!
But this should not be that complicated, eg […]
Yep, something like that. I’ll code up something a bit more customized than using MixtureModel
(logpdf
would be too slow), but sure, it’s pretty straightforward.
In case someone else is interested in this: Our Phd student @lmh91 implemented this now as part of BAT.jl: https://github.com/bat/BAT.jl/blob/master/src/distributions/HistogramAsUvDistribution.jl .
We’ll let this mature and extend it further (n-dim histograms as multivariate distributions), then we’ll see if we can find a home for it in a more general package than BAT.jl.
Taking this opportunity to brainstorm empirical CDFs in the Julia Stats ecosystem.
There are multiple implementations:
- StatsBase.jl:
ecdf
- EmpiricalCDFs.jl:
EmpiricalCDF
- GeoStatsBase.jl:
EmpiricalDistribution
Would you like to join efforts to create a single high-performance, well-tested, widely-available implementation? Would it make sense to have it in Distributions.jl?
I’ve been thinking whether these things (ECDDF’s, histograms as distributions, etc.) could be combined in a new package “EmpiricalDistributions” or so.
I think it would be nice if ECDFs would implement the Distributions
API, in general (as far as possible).