More flexible Distributions package

Dear friends, I am a statistician and I have all my training in this area. In my PhD I worked with probability distribution generators and recently I became interested in the Julia language. I come from the R language and sometimes I “dirty” my hands with C++ when I need, in R, a little more performance.

I came across, in Julia, the package Distributions that even allows us to extend the universe of distributions, based on their abstract types and internal functions .

Currently, since the early 2000s, probability distributions have expanded very widely. Today we have several probability distribution generators, which generate new families of much more flexible probability distributions, however, with properties that need to be calculated, in general, numerically.

For example, if G is a probability distribution, then G^a, with a > 0 is a new probability distribution with one more shape parameter, with G as a special case. This generator is called Exp-G. If G is Weibul, we can generate the Exp-Weibul(alpha, beta, a), and so on. Other than that, there are thousands of other probability distribution generators and a new one comes out every day.

In these new complex distributions, there is no need to have the maximum likelihood estimators in closed form, moments in closed form, etc. In these cases, there is no need to obtain the log-likelihood function, since it can be obtained numerically.

For example, many probability distributions do not have a quantile function defined in closed form. This would prevent, for example, generating observations of random variables that follow these distributions. But we have numerical methods to generate observations of a distribution, for example, the accept and reject method, for univariate distributions that are widely used.

Direct questions: wouldn’t it be interesting for the Distributions package to allow the inclusion of generic distributions so that these quantities are estimated numerically? For example, in the case of univariate distributions, wouldn’t it be interesting to have numerical ways of generating observations using the method of acceptance and rejection? Wouldn’t obliging the implementation of new functions in the package only for distributions that have exact moments and maximum likelihood estimators be limiting?

3 Likes

You might be interested in

Which is an attempt at a more flexible package for probability (and other) distributions

5 Likes

And as for your other questions, there are plenty of packages in Julia related to MCMC methods or probabilistic programming. I guess these would fit the criterion of “numerical” computation or moments and other quantities?

Indeed; I also think @cscherrer 's package is worth looking at.

1 Like

Thanks for the library suggestion. It really seems to be very generic and therefore very flexible. As I work daily with probability distributions, it is very tedious to use the Distributions package. There’s tremendous overhead that makes experimenting tiring.

A data scientist, before he cares about performance, he cares about modeling and wants speed in experimentation, too. In fact, I believe that 90% of the time what you want is speed and practicality in experimentation. There are languages ​​like R and Python being widely used and solving many problems until today.

From what I’ve seen, the MeasureTheory library weighs this, which is pretty good.

I will try to persevere in my studies of Julia to come up with libraries like this and thus convince more data scientists to use Julia.

Yours sincerely.

1 Like

It’s true, I thought that the Distributions package was the most famous and that I would like to get closer to the reality of use that statisticians and data scientists do on a daily basis.

Perhaps having an abstract NumericalDist type allowing more flexibility of use and some packed numeric routines to work with the generic distribution, making it easier to experiment:

1 - Introduce a new distribution easily, without great demands;
2 - If the type is NumericalDist, the basic properties can be estimated numerically;
3 - Generate observations by alternative methods, such as the acceptance and rejection method. In many practical situations, we don’t have F^{-1};
4 - Easily estimate the maximum likelihood estimators, returning whether or not there was convergence. This would allow within a Monte Carlo simulation, an iteration to be easily discarded. Of course, the tuning could be done separately by packages like Optim.jl and the discard could be done easily in case of non-convergence or less in case of error, as we can use error handling. However, I think it’s positive to make things easy to use. Companies like Rstudio Ltda, which will now be called Posit Ltda, have been very successful in their tools, as they value the ease of experimentation.

Best regards.

I’m not an expert in distributions so I’m not sure I understand exactly the needs discussed here, but do you really need Distributions.jl to change in order to do what you mention? Can’t that be implemented in a separate package which should just define new distributions that implement the Distributions.jl interfaces? The strength of Julia is that there’s nothing special about Base or classic packages, you can extend them with the same performance from other packages.

1 Like

The reason for my speech is that it would be very convenient to have this functionality in Distributions.jl, as it would fit well in the package that is perhaps Julia’s best known in this topic.

As for the possibility of implementing and doing it outside the package, or even creating another package, of course it is possible. What I’m arguing is that Distributions.jl could incorporate the current designer and allow a more flexible designer with some numerical computation routines on probability distributions. This would most likely duplicate the use of the package, allowing, for example, the generic implementation of a probability distribution that could be handled numerically, as well as allowing rapid experimentation with new distribution families.

1 Like

I really do think MeasureTheory is what you’re looking for. If not, maybe take a look at the packages that import GitHub - JuliaMath/DensityInterface.jl: Interface for mathematical/statistical densities in Julia

I don’t quite understand your concerns. As far as I know, Distributions.jl does not require or enforce that closed formulas are available. In the end, each distribution just has to implement a small interface for sampling, density evaluations and the like.
Indeed, rejection sampling or numerical methods are already used in some/many distributions. Just check for instance the implementation of the von Mises distribution or even the standard normal. Even if the quantile function happens to be known analytically it is often more efficient to sample differently, e.g. using rejection sampling. For the standard normal, the quantile function can be approximated numerically - it is not analytic as far as I know - and sampling is often done via the Box-Muller transform or the Marsaglia polar method - a form of rejection sampling.

Guess you could just implement (parts of) the interface for the distributions you mentioned in a similar fashion … and maybe open a pull request.

Thanks for the feedback.

I’m thinking, at some point propose something. As soon as I have more time, and I finish the classes I’m teaching and the theses I’m supervising, I want to see something along these lines. Maybe even in partnership with a student.

Just to be clear, I think I was, but I can try to be a little more. I think the Distributions.jl package could allow embedding a generic distribution, so that I can calculate quantities, numerically, without needing to specify the function that generates observations from that distribution, without needing to specify the log-likelihood function, etc.

For example, one could easily use the package to obtain the maximum likelihood estimates of any distribution. It is possible to implement, using closures and varargs functions, a generic fit function that takes any probability density function. A tuple of arbitrary size would be useful, as we would not know how many parameters of the probability density function that the function user would pass as an argument to the fit function.

Hence, regardless of what the probability density function is, fit could always find the maximum likelihood estimates. It is not necessary to analytically unravel the log-likelihood function, as it can be easily implemented in a generic way for any probability density function, including those that have not yet been created. At first, only for the univariate case.

The Box-Muller and Polar method are great for generating normal observations. They are exact methods, which make use of transformations of random variables. Random variable functions are random variables with a specific probability distribution. In the case of Box-Muller and the Polar method, the transformations lead to functions of random variables that have a Normal distribution. It would make no sense, computationally, for someone to use the method of acceptance and rejection to generate observations of a random variable with Normal distribution. The accept and reject method works for any distribution, but would be costly (unnecessary) to generate normal observations.

However, it is possible, regardless of who the distribution is, to implement a function that takes any probability function or probability density function, and thus generates observations from that distribution. You can do this, generalizing even to receive functions from univariate distributions that are yet to be discovered. Who would use this function would only specify the density or probability function and would pass to the function that would implement the method of acceptance and rejection.

Hence, if the Distributions.jl library also had some more generic method to generate observations (for the univariate case it would be very useful) for any distribution, such as the accept and reject method, it would be something very interesting, as it would be possible to generate for any distribution of univariate probability.

I will end this thread with @baggepinnen’s suggestion which seems to be the most convenient for my question. If eventually I come to do something along these lines, I will return here at some point.