I am not an expert in statistics, so maybe this question is a bit naive. I have a data vector with ~10^6 entries (between [0,1]) and want to identify a distribution which fits best to that data. I wonder if there is a possibility to do this automatically, i.e., passing the data to a function which returns the best fitting distribution and its parameters. I’m aware of the Distributions.jl package but I have not found an option there.
The difficulty here is how one defines the word “best”.
Some distributions/cases you want to remove the first bin of your data, other’s not. Sometimes you can fit the bulk of a PDF but miss out on a critical effect. While a runner up may do more poorly on the bulk and better on that effect. Which one is the best? Depends on the use case.
It’s really best not to automate this decision in my eyes. If you are fitting a distribution you are likely making assumptions involving your data which should likely go further then “gives the lowest figure or merit”. I’m sure others would disagree.
In addition to @anon92994695’s excellent answer, also note that there is an infinite number of distributions; it’s just that some of them have a name and a parametric family for historical/convenience reasons. Any increasing function F: \mathbb{R} \to [0,1] (with some technical properties related to measurability) defines a distribution. People usually use “named” distributions/families as building blocks.
Depending on your problem, I would either recommend
a simple transformation to the real line (eg with logit) and fitting (a family) of normals: this you can do directly with Distributions.jl, but you need to make sure there is no mass near the endpoints,
a Bayesian approach.
What you end up doing will depend on how much you care about modeling your data correctly. If you are working at a company, I would suggest hiring a statistician do to this; otherwise read up on some methodology. Some recommendations in an earlier topic starting here:
Generally yes, but it all depends. Eg if the tails are relevant, KDE may be biased. Contrived example:
using KernelDensity, StatsFuns
z = randn(1000_000)
x = @. logistic(z .* 10) * 0.8 + 0.1 # 0.1 ≤ x ≤ 0.9
k = kde(x)
pdf(k, 0.05) # outside support
There is no free lunch with nonparametric methods. Which is not surprising — what @mike_k is asking for is actually the fundamental question of statistics
You may find the work on normalizing flows (NFs) by @torfjelde useful (https://github.com/TuringLang/Bijectors.jl#normalizing-flows). A lot of distributions can be described in terms of NFs of each other, e.g. uniform, normal, lognormal, gamma, exponential, etc. So a universal approximator NF like a neural network can give you a lot of fitting power. @torfjelde knows a lot more about this than I do.
Thank you for all your answers. There is seemingly a lot to consider. I was hoping that one can simply compute some R^2 that describes the explained variance of the estimated (parametric) distribution and my data. I will give your’re suggestions a try.
Maybe it also helps if I roughly explain the application: I have some twitter-graphs for which I computed the probabilities that a node j retweets something from or replies to node i. I wanted to take these probabilities as a guess of how messages propagate through social networks. I have several other social network graphs (without propagation probabilities), so I wanted to randomly generated them according to the distribution of the ones I observed/computed.
I’m not a statistician, but I did wrestle a lot with this same idea for my PhD and basically came to the realization that what matters most is understanding the process that produced the data. Once you understand this, you fit the distributions that make sense and see how that works. Simulation can help a lot with this. I understand this is not always easy or known, but when you do just random distribution fitting, is hard to understand why a distribution fits and the parameters don’t always make sense, but if you do it the other way around, things are clearer, IMO.
One approach which has been proposed for deciding which distribution to fit to 1d data is to use the Cullen and Frey graph (Cullen AC, Frey HC (1999)). It represents distributions in terms of typical skewness and kurtosis and shows you which distribution your data may be close to.
There’s hardly a magical answer to your problem, though, all the more if you data lies in [0,1]. You might go for a Beta distribution or a Beta kernel density estimate (I think you can do it with the package suggested by @mcreel).
AutoFit is to be used with care and with a presumption on what the distributions should be ahead of time. That being said your algo is very clean and works well against the generated trials better than say AD or KS GoFit tests. Thanks for sharing.