Maximum likelihood for t-distribution by gradient descent

I started fiddling a bit around (first with the Euclidan gradients) with the given formula from the paper above but got a bit stuck.

  1. They suddenly take the gradient with respect to \Sigma^{-1}. What is the better variable to use \Sigma or \Sigma^{-1}? Both are SPD anyways.
  2. In their gradient with respect tu \nu they suddenly use a function \psi but they either never define that or I missed it (or it is an obvious function in statistics?) edit: Ah I missed it, its the digamma function (luckily available in SpecialFunctions).

edit2: besides the question which variable to take as an optimisation variable, \Sigma or its inverse, I at least managed to implement the three separated gradients. Will continue the next days fiddling around further :slight_smile:

Another possibility is the Distributions.LKJ or Distributions.LKJCholesky parameterization.

Many years ago we used the Cholesky with the log of the diagonal elements in the nlme package for R (Pinheiro and Bates, “Mixed-Effects Models in S and S-PLUS”, Springer) to express a positive definite covariance matrix. But in mixed-effects models it is not uncommon to converge to a singular covariance matrix, or need to evaluate the log-likelihood at a singular covariance matrix during the iterative optimization. This causes problems when using the logarithm of the diagonal elements. In the lme4 package for R and in MixedModels.jl we switched back to the Cholesky factor itself optimizing with box constraints using Powell’s BOBYQA algorithm as implemented in NLopt.jl

The behavior at the boundary can be an important consideration. Often the expressions of the log-likelihood for a mixed-effects model are written in terms of the precision matrix, which is the inverse of the covariance matrix, and many implementations are therefore written in terms of the precision matrix. But the precision doesn’t exist if the covariance matrix is singular, as can and does happen. It can be shown that the other limiting case, a singular precision matrix, can’t be the estimate for these models.

3 Likes

I found it easier to read the published version of this paper: https://www.tandfonline.com/doi/full/10.1080/03610926.2018.1445861

1 Like

They still have the gradient with the inverse of \Sigma and the typo in Equation (5).

1 Like

Metida.jl work with optimization maximum log-likelihood estimate for MVNormal distributions, see details.

You can use gradient descend, Newton or any other method (NelderMead). First of all you should think about constructing ÎŁ matrix in terms of variation and coefficients of variations. Second thoughts - how to make smooth optimization borders: you can use exp-log transformation for variation and sigmoid (or arctg) function for rho (coefficient of covariation, that can be only in range from -1 to 1).

Inverse of ÎŁ is more performance issue, you can make your own approach if your ÎŁ matrix is diagonal or block-diagonal, also sweep operator can bu useful.

I made analytical gradient function for log-likelihood, but it was not so faster that gradient with ForardDiff. Original paper can be found here.

Upd:

So, your task about t-distribution, and all above can’t be applicable forward, but maybe you can find some insight in this materials.

1 Like