Maximum likelihood for t-distribution by gradient descent

kellertuer · April 27, 2023, 2:06pm

I started fiddling a bit around (first with the Euclidan gradients) with the given formula from the paper above but got a bit stuck.

They suddenly take the gradient with respect to \Sigma^{-1}. What is the better variable to use \Sigma or \Sigma^{-1}? Both are SPD anyways.
In their gradient with respect tu \nu they suddenly use a function \psi but they either never define that or I missed it (or it is an obvious function in statistics?) edit: Ah I missed it, its the digamma function (luckily available in SpecialFunctions).

edit2: besides the question which variable to take as an optimisation variable, \Sigma or its inverse, I at least managed to implement the three separated gradients. Will continue the next days fiddling around further

dmbates · April 27, 2023, 2:11pm

Another possibility is the Distributions.LKJ or Distributions.LKJCholesky parameterization.

Many years ago we used the Cholesky with the log of the diagonal elements in the nlme package for R (Pinheiro and Bates, “Mixed-Effects Models in S and S-PLUS”, Springer) to express a positive definite covariance matrix. But in mixed-effects models it is not uncommon to converge to a singular covariance matrix, or need to evaluate the log-likelihood at a singular covariance matrix during the iterative optimization. This causes problems when using the logarithm of the diagonal elements. In the lme4 package for R and in MixedModels.jl we switched back to the Cholesky factor itself optimizing with box constraints using Powell’s BOBYQA algorithm as implemented in NLopt.jl

The behavior at the boundary can be an important consideration. Often the expressions of the log-likelihood for a mixed-effects model are written in terms of the precision matrix, which is the inverse of the covariance matrix, and many implementations are therefore written in terms of the precision matrix. But the precision doesn’t exist if the covariance matrix is singular, as can and does happen. It can be shown that the other limiting case, a singular precision matrix, can’t be the estimate for these models.

mleprovost · April 27, 2023, 3:25pm

I found it easier to read the published version of this paper: https://www.tandfonline.com/doi/full/10.1080/03610926.2018.1445861

kellertuer · April 28, 2023, 6:47am

They still have the gradient with the inverse of \Sigma and the typo in Equation (5).

PharmCat · April 28, 2023, 5:37pm

Metida.jl work with optimization maximum log-likelihood estimate for MVNormal distributions, see details.

You can use gradient descend, Newton or any other method (NelderMead). First of all you should think about constructing Σ matrix in terms of variation and coefficients of variations. Second thoughts - how to make smooth optimization borders: you can use exp-log transformation for variation and sigmoid (or arctg) function for rho (coefficient of covariation, that can be only in range from -1 to 1).

Inverse of Σ is more performance issue, you can make your own approach if your Σ matrix is diagonal or block-diagonal, also sweep operator can bu useful.

I made analytical gradient function for log-likelihood, but it was not so faster that gradient with ForardDiff. Original paper can be found here.

Upd:

So, your task about t-distribution, and all above can’t be applicable forward, but maybe you can find some insight in this materials.

Topic		Replies	Views
Maximum Likelihood Multivariate Model Optimization (Mathematical) optim	15	1641	January 10, 2020
Find gradient of a posterior Statistics question	12	1041	May 16, 2019
Quasi Maximum Likelihood implementation with Optim.jl Optimization (Mathematical)	21	2775	April 6, 2018
Gradient-based methods are not moving some parameters Optimization (Mathematical) optim , optimization	3	634	February 13, 2020
Inconsistency between standard errors from Hessian and Gradient in ForwardDiff/Optim Optimization (Mathematical) optim	6	2485	October 26, 2018

Maximum likelihood for t-distribution by gradient descent

Related topics