Distributions.loglikelihood

Can someone please explain Distributions.loglikelihood? A likelihood is a function of the parameters, but here it seems to be just another name for a broadcasted logpdf, a function of the data. Maybe there’s a history behind this? The standard definition is very useful, and the current use makes things confusing and prevents future implementation of an actual loglikelihood. Could there be a path to renaming this?

3 Likes

You can compute the likelihood of different samples at a given parameter value, but you can also compute the likelihood of a given sample at different parameter values, which is what you need to do for MLE or Bayesian analysis. For example,

julia> using Distributions

julia> y = rand(Exponential(3.),100);

julia> avglnℒ = θ -> Distributions.loglikelihood(Exponential(θ), y)/100
#15 (generic function with 1 method)

julia> avglnℒ(5.)
-2.2716712727549306

julia> avglnℒ(4.)
-2.214086061520928

julia> avglnℒ(3.)
-2.2023345558694927

julia> 

Ok, but you unpacked it manually. So you used the function to build a log-likelihood, but the function itself is not one.

A proper implementation of this loglikelihood would be really useful. It would start with e.g.

julia> suffstats(Normal, randn(1000000))
NormalStats(2025.52, 0.00202552, 1.0019e6, 1.0e6)

Given the family and data, you get sufficient statistics that give you a way to efficiently evaluate on different parameters.

1 Like

The log likelihood function is a function of both the parameters and the data, which is what my example used. I don’t understand what you mean by unpacking it, I just called it with two specific argument values, the parameter value, and the sample data. I’m sorry, but I don’t understand your criticism of the function as is. To me, it seems perfectly correct and useful as it is.

Likelihoods are different from densities in two ways:

  1. Conceptually, they’re a function of the parameters instead of the data.
  2. Computationally, they’re optimized for computations with fixed data and changing parameters.

The current loglikelihood approach doesn’t “know” it’s a likelihood. It has no concept of the data being fixed. So in your example, each evaluation traverse the data. In your case, that’s just 100 values, but it could have been millions or billions.

Compare with

julia> suff = suffstats(Exponential, x)
ExponentialStats(44.7615, 100.0)

The family and the data together “know” they don’t need 100 values (or a million or a billion), but just two. The “right” way to build likelihoods treats the data as fixed, allowing optimizations like this.

And if you don’t treat logpdf and loglikelihood as different things, there’s really no reason to have two different names.

2 Likes

Looks interesting but could you give a complete example of what it would look like to calculate a likelihood with your idea?

So your point is basically that to compute the loglikelihood of a Normal sample, I only need to know the sample mean and variance. Is that what you are getting at?

I disagree that the likelihood is a function only of the parameters. The log likelihood is the log of the joint density of the sample, given a parameter vector. To evaluate it, you need both the entire data set and the parameter values, in general. So, the implementation in Distributions.jl is correct, I believe. There’s no need for a distinction between logpdf and loglikelihood, because the family already tells us whether the sample is of discrete or continuous R.V.s.

In the case where sufficient statistics exist, then knowing the family and the statistics means that the whole data set is not need, and that the likelihood function could be computed without knowing the entire data set. But there are many cases where sufficient statistics don’t exist. But whether this is possible or not doesn’t really have implications about the correctness of the implementation of loglikelihood in Distributions.jl. It just means that there might be a faster way to compute the likelihood function at different parameter values, given the data, through using sufficient statistics.

1 Like

Yes, and what you need to compute depends on the family.

julia> using Distributions

julia> x = rand(1000000);

julia> suff = suffstats(Exponential, x)
ExponentialStats(5.00513e5, 1.0e6)

julia> loglik(ss::Distributions.ExponentialStats, r) = -ss.sw * log(r) - ss.sx / r
loglik

julia> using BenchmarkTools

julia> @btime loglikelihood(Exponential(0.3), x)

  7.021 ms (1 allocation: 16 bytes)
-4.64405e5

julia> @btime loglik(suff, 0.3)
  3.286 ns (0 allocations: 0 bytes)
-4.64405e5
3 Likes

I seem to recall, but am not sure and can’t find a reference for it, that the original intention was that loglikelihood would differ in that:

  • it could act on arrays, and would reduce via sum (i.e. would be the log-likelihood of iid samples)
  • it could drop any terms that don’t depend on the parameters (e.g. the factorial in the Poisson distribution)
1 Like

Thanks @simonbyrne, this is helpful. It makes sense to have this function available, it’s just unfortunate it prevents defining loglikelihood in the usual sense.

I certainly agree with that, I’m not sure how widely it is used.

Personally, I think something like:

loglikelihood(D, data, params...)

where data is can be either an array or a sufficient statistic, or even

LogLikelihood(D, data)(params...)

which would make it easy to fit with optimization routines.

Here is some very old discussion of the related issues:

3 Likes

This sounds similar to what we’re doing in MeasureTheory. For us, a Likelihood is a struct taking a Kernel and an observation (which might be a vector or matrix, etc). We want to specialize this to different kinds of kernels, using a suffstats-like approach.

3 Likes

Using sufficient statistics, when possible, makes sense, to save having to work with a potentially large sample. So, perhaps a method could be added to loglikelihood to take sufficient statistics as an argument, instead of the whole sample.

An example of a case where there are no sufficient statistics is the independent, non-identically distributed case. For example,

using Distributions, Optim
n = 100
x = [ones(n) randn(n)]
θ = [1., 2.]
y = rand.(Exponential.(exp.(x*θ)))
ℒ = θ -> -mean(loglikelihood.(Exponential.(exp.(x*θ)), y))
optimize(ℒ, zeros(2)).minimizer 

You need the full sample, y and x, to get the ML estimates, and the present form of loglikelihood allows for working with this case. So, extending the function to accept sufficient statistics seems useful, but I think that the present form is also useful.

Regarding allowing loglikelihood to drop contant parts, that could be problematic if one wanted to use information criteria to compare models. Retaining the constant parts is very low cost.

1 Like

Imagine loglikelihood was defined in this way, optimized for “fixed data”, and taking parameters as a final argument. Would you be arguing to add methods that treat loglikelihood and logpdf as the same thing? To me this just adds confusion, and I don’t see any benefit.

That’s sometimes true; the canonical exception to this is Gaussian linear regression.

I think this is only true because Distributions doesn’t have the machinery you’d need for making likelihoods as expressive as they can be. Ideally, there should be a general way of specifying likelihoods (function of the parameter, with fixed data), and special cases with convenient sufficient statistics would “just work” through dispatch, with no special attention needed from the end user.

I don’t see any reason we can’t do this, but the current “likelihood is just a pdf” approach is definitely an obstacle, at least for doing this in Distributions.

3 Likes

I’m not really sure why both loglikelihood and logpdf are defined in Distributions.jl, I haven’t ever studied the package carefully enough to know why that was done. That last example I posted works identically if loglikelihood is replaced with logpdf. That is good, I would say, because that is what would be expected using standard definitions from theory. However, I am confused as to why both are defined in the package, as, to my mind, they are the same thing. Maybe it has to do with broadcasting…?

Some people are very careful to distinguish these, and the distinction is consistent. If these were the same thing, the decision of which to use in a given context would just be a coin flip. And again, allowing these terms to be swapped arbitrarily just leads to lots of confusion.

  • The probability density is a function of the data, given the parameters. It integrates to one.
  • The likelihood is a function of the parameters, for fixed data. In general, it does not integrate to one.

As you showed in your first example, it’s possible to compute either of these from the other. But that doesn’t make them conceptually or computationally identical.

2 Likes