Can someone please explain Distributions.loglikelihood
? A likelihood is a function of the parameters, but here it seems to be just another name for a broadcasted logpdf
, a function of the data. Maybe there’s a history behind this? The standard definition is very useful, and the current use makes things confusing and prevents future implementation of an actual loglikelihood
. Could there be a path to renaming this?
You can compute the likelihood of different samples at a given parameter value, but you can also compute the likelihood of a given sample at different parameter values, which is what you need to do for MLE or Bayesian analysis. For example,
julia> using Distributions
julia> y = rand(Exponential(3.),100);
julia> avglnℒ = θ -> Distributions.loglikelihood(Exponential(θ), y)/100
#15 (generic function with 1 method)
julia> avglnℒ(5.)
-2.2716712727549306
julia> avglnℒ(4.)
-2.214086061520928
julia> avglnℒ(3.)
-2.2023345558694927
julia>
Ok, but you unpacked it manually. So you used the function to build a log-likelihood, but the function itself is not one.
A proper implementation of this loglikelihood
would be really useful. It would start with e.g.
julia> suffstats(Normal, randn(1000000))
NormalStats(2025.52, 0.00202552, 1.0019e6, 1.0e6)
Given the family and data, you get sufficient statistics that give you a way to efficiently evaluate on different parameters.
The log likelihood function is a function of both the parameters and the data, which is what my example used. I don’t understand what you mean by unpacking it, I just called it with two specific argument values, the parameter value, and the sample data. I’m sorry, but I don’t understand your criticism of the function as is. To me, it seems perfectly correct and useful as it is.
Likelihoods are different from densities in two ways:
- Conceptually, they’re a function of the parameters instead of the data.
- Computationally, they’re optimized for computations with fixed data and changing parameters.
The current loglikelihood
approach doesn’t “know” it’s a likelihood. It has no concept of the data being fixed. So in your example, each evaluation traverse the data. In your case, that’s just 100 values, but it could have been millions or billions.
Compare with
julia> suff = suffstats(Exponential, x)
ExponentialStats(44.7615, 100.0)
The family and the data together “know” they don’t need 100 values (or a million or a billion), but just two. The “right” way to build likelihoods treats the data as fixed, allowing optimizations like this.
And if you don’t treat logpdf
and loglikelihood
as different things, there’s really no reason to have two different names.
Looks interesting but could you give a complete example of what it would look like to calculate a likelihood with your idea?
So your point is basically that to compute the loglikelihood of a Normal sample, I only need to know the sample mean and variance. Is that what you are getting at?
I disagree that the likelihood is a function only of the parameters. The log likelihood is the log of the joint density of the sample, given a parameter vector. To evaluate it, you need both the entire data set and the parameter values, in general. So, the implementation in Distributions.jl is correct, I believe. There’s no need for a distinction between logpdf and loglikelihood, because the family already tells us whether the sample is of discrete or continuous R.V.s.
In the case where sufficient statistics exist, then knowing the family and the statistics means that the whole data set is not need, and that the likelihood function could be computed without knowing the entire data set. But there are many cases where sufficient statistics don’t exist. But whether this is possible or not doesn’t really have implications about the correctness of the implementation of loglikelihood in Distributions.jl. It just means that there might be a faster way to compute the likelihood function at different parameter values, given the data, through using sufficient statistics.
Yes, and what you need to compute depends on the family.
julia> using Distributions
julia> x = rand(1000000);
julia> suff = suffstats(Exponential, x)
ExponentialStats(5.00513e5, 1.0e6)
julia> loglik(ss::Distributions.ExponentialStats, r) = -ss.sw * log(r) - ss.sx / r
loglik
julia> using BenchmarkTools
julia> @btime loglikelihood(Exponential(0.3), x)
7.021 ms (1 allocation: 16 bytes)
-4.64405e5
julia> @btime loglik(suff, 0.3)
3.286 ns (0 allocations: 0 bytes)
-4.64405e5
I seem to recall, but am not sure and can’t find a reference for it, that the original intention was that loglikelihood
would differ in that:
- it could act on arrays, and would reduce via
sum
(i.e. would be the log-likelihood of iid samples) - it could drop any terms that don’t depend on the parameters (e.g. the factorial in the Poisson distribution)
Thanks @simonbyrne, this is helpful. It makes sense to have this function available, it’s just unfortunate it prevents defining loglikelihood
in the usual sense.
I certainly agree with that, I’m not sure how widely it is used.
Personally, I think something like:
loglikelihood(D, data, params...)
where data
is can be either an array or a sufficient statistic, or even
LogLikelihood(D, data)(params...)
which would make it easy to fit with optimization routines.
Here is some very old discussion of the related issues:
https://github.com/JuliaStats/Distributions.jl/pull/112
This sounds similar to what we’re doing in MeasureTheory. For us, a Likelihood
is a struct taking a Kernel
and an observation (which might be a vector or matrix, etc). We want to specialize this to different kinds of kernels, using a suffstats
-like approach.
Using sufficient statistics, when possible, makes sense, to save having to work with a potentially large sample. So, perhaps a method could be added to loglikelihood to take sufficient statistics as an argument, instead of the whole sample.
An example of a case where there are no sufficient statistics is the independent, non-identically distributed case. For example,
using Distributions, Optim
n = 100
x = [ones(n) randn(n)]
θ = [1., 2.]
y = rand.(Exponential.(exp.(x*θ)))
ℒ = θ -> -mean(loglikelihood.(Exponential.(exp.(x*θ)), y))
optimize(ℒ, zeros(2)).minimizer
You need the full sample, y and x, to get the ML estimates, and the present form of loglikelihood allows for working with this case. So, extending the function to accept sufficient statistics seems useful, but I think that the present form is also useful.
Regarding allowing loglikelihood to drop contant parts, that could be problematic if one wanted to use information criteria to compare models. Retaining the constant parts is very low cost.
Imagine loglikelihood
was defined in this way, optimized for “fixed data”, and taking parameters as a final argument. Would you be arguing to add methods that treat loglikelihood
and logpdf
as the same thing? To me this just adds confusion, and I don’t see any benefit.
That’s sometimes true; the canonical exception to this is Gaussian linear regression.
I think this is only true because Distributions doesn’t have the machinery you’d need for making likelihoods as expressive as they can be. Ideally, there should be a general way of specifying likelihoods (function of the parameter, with fixed data), and special cases with convenient sufficient statistics would “just work” through dispatch, with no special attention needed from the end user.
I don’t see any reason we can’t do this, but the current “likelihood is just a pdf” approach is definitely an obstacle, at least for doing this in Distributions.
I’m not really sure why both loglikelihood and logpdf are defined in Distributions.jl, I haven’t ever studied the package carefully enough to know why that was done. That last example I posted works identically if loglikelihood is replaced with logpdf. That is good, I would say, because that is what would be expected using standard definitions from theory. However, I am confused as to why both are defined in the package, as, to my mind, they are the same thing. Maybe it has to do with broadcasting…?
Some people are very careful to distinguish these, and the distinction is consistent. If these were the same thing, the decision of which to use in a given context would just be a coin flip. And again, allowing these terms to be swapped arbitrarily just leads to lots of confusion.
- The probability density is a function of the data, given the parameters. It integrates to one.
- The likelihood is a function of the parameters, for fixed data. In general, it does not integrate to one.
As you showed in your first example, it’s possible to compute either of these from the other. But that doesn’t make them conceptually or computationally identical.