I started fiddling a bit around (first with the Euclidan gradients) with the given formula from the paper above but got a bit stuck.
- They suddenly take the gradient with respect to \Sigma^{-1}. What is the better variable to use \Sigma or \Sigma^{-1}? Both are SPD anyways.
- In their gradient with respect tu \nu they suddenly use a function \psi but they either never define that or I missed it (or it is an obvious function in statistics?) edit: Ah I missed it, its the digamma function (luckily available in SpecialFunctions).
edit2: besides the question which variable to take as an optimisation variable, \Sigma or its inverse, I at least managed to implement the three separated gradients. Will continue the next days fiddling around further
Another possibility is the Distributions.LKJ
or Distributions.LKJCholesky
parameterization.
Many years ago we used the Cholesky with the log of the diagonal elements in the nlme
package for R (Pinheiro and Bates, âMixed-Effects Models in S and S-PLUSâ, Springer) to express a positive definite covariance matrix. But in mixed-effects models it is not uncommon to converge to a singular covariance matrix, or need to evaluate the log-likelihood at a singular covariance matrix during the iterative optimization. This causes problems when using the logarithm of the diagonal elements. In the lme4 package for R and in MixedModels.jl
we switched back to the Cholesky factor itself optimizing with box constraints using Powellâs BOBYQA algorithm as implemented in NLopt.jl
The behavior at the boundary can be an important consideration. Often the expressions of the log-likelihood for a mixed-effects model are written in terms of the precision matrix, which is the inverse of the covariance matrix, and many implementations are therefore written in terms of the precision matrix. But the precision doesnât exist if the covariance matrix is singular, as can and does happen. It can be shown that the other limiting case, a singular precision matrix, canât be the estimate for these models.
3 Likes
They still have the gradient with the inverse of \Sigma and the typo in Equation (5).
1 Like
Metida.jl work with optimization maximum log-likelihood estimate for MVNormal distributions, see details.
You can use gradient descend, Newton or any other method (NelderMead). First of all you should think about constructing ÎŁ matrix in terms of variation and coefficients of variations. Second thoughts - how to make smooth optimization borders: you can use exp-log
transformation for variation and sigmoid (or arctg) function for rho
(coefficient of covariation, that can be only in range from -1 to 1).
Inverse of ÎŁ is more performance issue, you can make your own approach if your ÎŁ matrix is diagonal or block-diagonal, also sweep operator can bu useful.
I made analytical gradient function for log-likelihood, but it was not so faster that gradient with ForardDiff. Original paper can be found here.
Upd:
So, your task about t-distribution, and all above canât be applicable forward, but maybe you can find some insight in this materials.
1 Like