Estimating a Multivariate Distribution

Is there a package that can estimate a multivariate distribution? The input to this estimation isn’t really set in stone right now, but preferably it’s something that Optim.jl could work with. I essentially want something that can do something similar to KernelDensity.jl or AverageShiftedHistograms.jl, but with a multivariate distribution. There are only two dimensions to the specific project I’m working on, but they’re dependent, so the bivariate capabilities of the two aforementioned packages don’t apply, I think. Is defining a custom MultivariateDistribution the way to go here?

Any multivariate distribution? Such a problem is not well-specified. Eg the discrete distribution of the sample points could be a valid answer.

You would need add some details. Eg multivariate normal or t distributions are quite easy to estimate using likelihood-based methods, but may not fit your data well (this depends on your data, which we do not know). Mixtures or nonparametric methods would handle this, as well as other, more advanced methods.

3 Likes

it seems you should try the bivariate KDE first. if you’re just trying to make plots or evaluate a density at some points this might be enough. the dependency should be evident if the data are sufficient.

1 Like

Sorry, I think the word “fit” here was the wrong choice. What I’m looking for here is a way to convert a two-dimensional histogram (or similar data, like a two dimensional matrix) into a continuous Multivariate Distribution or its pdf (so that I can select randomly from said distribution). In such a way, it is “any multivariate distribution” but not in the sense that I’m trying to “fit” to an infinite number of distributions. I believe the two dimensions are dependent, but I’ll try using the Bivariate KDE as @dlakelan suggested here.

In 1D the bandwidth parameter controls the “width” of the kernel. But in 2D not only is there a width in each dimension, but also there’s a covariance structure to the kernel. If you can adapt the covariance structure to the data, your kernel can do a better job. But in general that’s a hard problem, and the covariance can change from place to place. Imagine a distribution that looks like a banana. In one place in x,y space the data stretches out in a vertical direction, as you move around the banana it may stretch out in a diagonal direction, and then later along the banana it may be horizontal… In N dimensions this obviously just gets worse and worse.

But for 2D with sufficient data, you can get a smooth KDE with a bivariate independent kernel that works well enough for many purposes. Give it a try. Better yet, come back and give us a plot of what you got and how well it worked! :wink:

1 Like

I don’t know what I was thinking, but I know understand that there is no distinction between a Bivariate and two-dimensional Multivariate distribution, least of all in dependence. I think I was just confused by the inputs to KernelDensity.jl’s bivariate kde being two vectors. Thank you, and sorry for the misunderstanding

If you don’t need the PDF but only need to sample from the distribution you can add gaussian noise to a random point sampled from your data. This is equivalent to sampling from the KDE estimate.

If you only have a histogram you could sample a bin according to its mass, sample a point uniformly from within that bin and then add gaussian noise.

2 Likes