Using a (normalized) Histogram as a Distribution

… could be combined in a new package “EmpiricalDistributions” or so.

I think it would be nice if ECDFs would implement the Distributions API, in general (as far as possible).

More precisely, I think instead of an ECDF-type we may want to have a distribution type based on observed samples. It would use an ECDF internally, but it would be a subtype of Distribution. It would only implement cdf(), though, but not pdf()` - at least not without some form of kernel density estimation.

We could have other types of empirical distributions, with support for weighted samples (that a high-performance ECDF-based one may not provide), possibly based on kernel density estimation, etc.

The “histogram as distribution” would also be a type of empirical distribution. It would provide pdf() since a histogram is effectively a (simple) kernel density estimation.

This way, different approaches could be used in a generic fashion, though each might come with specific limitations.

1 Like

I for example implemented the Distribution.jl API in GeoStatsBase.jl: https://github.com/juliohm/GeoStatsBase.jl/blob/master/src/distributions.jl

Of course this implementation is hanging there just because at that time I couldn’t find a better place. I should definitely migrate this implementation to somewhere else, or use it as a baseline along with other implementations to design a best of all worlds implementation.

Someone with more experience in empirical distribution APIs could lead here.

While ad hoc methods have their place, ultimately I think it is better to go with nonparametric methods that have proven themselves to be robust and preferably have some theory behind them.

Eg if sample realization is used as a “distribution”, it can be understood as a kernel density estimator with bandwidth going to 0, but for most purposes one can do much better. Similarly, binning into a histogram can be understood as estimating a mixture distribution of uniforms, but given some basic smoothness assumptions I think it is possible to dominate this very easily with rather cheap nonparametric methods.

Because of this, I don’t think that including ad hoc methods in packages intended for a general audience is a good idea.

1 Like

While ad hoc methods have their place,

Exactly.

I think it is better to go with nonparametric methods that have proven themselves to be robust and preferably have some theory behind them.
[…]
but for most purposes one can do much better […] I think it is possible to dominate this very easily with rather cheap nonparametric methods.

Could you make a few suggestions? Let’s say I want to use some MCMC output (weighted samples, 2 relevant parameters) as a prior in a different MCMC analysis - which method would you recommend to estimate the log-density? I’m aware that there is a wide variety of methods to estimate point-cloud densities, but you seem to be very experienced in this area, so I’d be glad to get some recommendations regarding fast and robust methods (ideally methods for which Julia implementations exist already).

Because of this, I don’t think that including ad hoc methods in packages intended for a general audience is a good idea.

I strongly disagree with that statement - ad hoc methods are not that uncommon, and there are valid use cases for them. Saying that they have no place in packages in the general registry seems to be - sorry to be so blunt - a bit narrow-minded and incompatible with the open spirit of the Julia community. Of course an “EmpiricalDistributions” package should also offer move advanced methods, but saying that ad-hoc methods must not be in there doesn’t make sense to me.

1 Like

AFAIK importance sampling is the generic solution for this (with SMC methods as kind of a special case that can handle this in a very natural way). But this of course can decimate your effective sample size. In some cases it may be more efficient to just rerun the analysis with the new model.

In a pinch, I would consider approximating the previous posterior with a KDE, which is nice and smooth. A histogram could have all sorts of discontinuities. YMMV.

I am afraid you misunderstood what I was saying: I meant that one should be careful about putting ad hoc methods without theoretical foundations in packages intended for a general audience (eg something like Distributions.jl).

I don’t know why you understood this as referring to General Registry: of course there are no restrictions on that. Anyone can register a package with whatever methods they like.

Ah, sorry for the misunderstanding! I misinterpretated what you meant with “general audience”. And I do agree with you that Distributions may be too general a package for something like this.

I think a less general package EmpiricalDistributions or so could (maybe should) include both ad-hoc and other methods.

I think it is better to go with nonparametric methods

I’m curious though - why would using a histogram not count as a non-parametric method? Isn’t a histogram a (fairly primitive) nonparametric estimate of a probability distribution? Why would it have less “theoretical foundations” than a kernel density estimation with some arbitrarily chosen kernel?

Sure, the histogram isn’t smooth, but if you use gradient-free methods (e.g. because you have few parameters and the gradient is difficult to obtain) that won’t matter so much. I guess we agree that it depends very much on the use case.

My main concern would be that for \mathbb{R}^n with n \gg 1 (say, 5 :wink:) the histogram will be either very crude or very sparse. For univariate it should not matter much, but for that you need very simple methods anyway.

My main concern would be that for \mathbb{R}^n with n \gg 1 […]

Oh, sure, this is only workable for 1 to 3 (4 at the very extreme) parameters. But often, that’s enough, because this may be the result of something that had a lot of nuisance parameters (that can be marginalized away) and only very few parameters that are relevant, resp. will enter the current problem.

For univariate it should not matter much, but for that you need very simple methods anyway.

Actually, we encounter that case (univariate empirical distribution that becomes be a prior for one parameter in an analysis with many other parameters) quite often. Ideally one would use a density estimation or some other kind of fit - but sometimes, only the histogram is available (though it’s often both well-filled and finely binned, so quite usable), or a histogram is preferred for other reasons.

I think the discussion of what is ad-hoc versus what is not, is not relevant. Many ad-hoc methods are widely used in many fields, take neural networks as an example. Just because something is not yet derived from axioms or proved somehow, it doesn’t mean it is not useful, nor widely used.

I don’t see any issue in including empirical distributions in Distributions.jl If users want to use it, they can. They are not forced to use an ad-hoc method, or a specific feature. If they want just analytical distsributions, fine. Let them use what they want. The point I am trying to make is that empirical distributions are everywhere in some domains, please don’t diminish their importance.

1 Like