Maximum likelihood for t-distribution by gradient descent

mleprovost · April 26, 2023, 3:54pm

Hello,

I am interested in the estimation of parameters of a multivariate t-distribution from samples.

A multivariate random variable X \in \mathbb{R}^n with density \pi_X is characterized by a mean vector \mu \in \mathbb{R}^n, a positive definite matrix \Sigma \in \mathbb{R}^{n \times n} and a degree of freedom \nu > 1, see Multivariate t-distribution - Wikipedia for more details. The pdf is given by \pi_X(x) = \frac{\Gamma\left[(\nu+p)/2\right]}{\Gamma(\nu/2)\nu^{p/2}\pi^{p/2}\left|{\boldsymbol\Sigma}\right|^{1/2}}\left[1+\frac{1}{\nu}({\mathbf x}-{\boldsymbol\mu})^{\rm T}{\boldsymbol\Sigma}^{-1}({\mathbf x}-{\boldsymbol\mu})\right]^{-(\nu+p)/2}

I am trying to estimate the parameters (\mu, \Sigma, \nu) given M samples \{ x^1, \ldots, x^M\} by maximizing the log likelihood. Do you think that this problem can be tackled using a gradient descent by enforcing that \Sigma is a PSD matrix?

As a warm-up problem, I will try to recover the sample mean and covariance of a Gaussian distribution.

Thank you for your help,

kellertuer · April 26, 2023, 4:15pm

For me this looks like a classical problem for Optimization on Manifolds, but since that is my area of research, I might be very biased here.

Can you maybe elaborate what your cost function f(\mu,\Sigma, \nu) would look like (that of course depends on the fixed data samples)?
I do not see that from just seeing the multivariate t-distribution (I am not a statistician) but if you have that then you could write a constraint optimisation (constraint for \eta and \Sigma) but you could also introduce \eta = \nu -1 > 0 (to use a positive number) and use

using Manifolds
M = ℝ^n × SymmetricPositiveDefinite(n) × PositiveNumbers()

If you have the Euclidean gradient, the Riemannian one can be computed by an easy conversion ( change_represener Manifolds · ManifoldsBase.jl ) and the advantage is, you can do an unconstraint (!) gradient descent on the manifold M.

If you have the (classical) gradient and do not want to do Manifold-stuff, then you have to look for methods where you can enforce the constraints – maybe something like Augmented Lagranian.

Finally – are you sure they are positive definite and not semi-definite (Eigenvalue zero allowed)?

mleprovost · April 26, 2023, 5:20pm

Thank you so much @kellertuer for your answer.

The loss function l(\mu, \Sigma, \nu) is given by the negative log-likelihood of X for the M samples \{ x^1, \ldots, x^M\}:

l(\mu, \Sigma, \nu) = - \log \pi_{X}(x^1, \ldots, x^M) = -M \log(\Gamma(\frac{\nu + n}{2}))+ \frac{M}{2} \log \det \Sigma + \frac{M n}{2} \log{\pi \nu} +\\ M \log(\Gamma(\frac{\nu}{2})) - \frac{\nu + n}{2} M \log \nu + \sum_{i=1}^M \log(\nu + (x^i - \mu)^\top \Sigma^{-1}(x^i - \mu)), where \Gamma is the Gamma function.

Note, we can minimize this loss function with respect to \Sigma^{-1} is needed, by noting that \log\det C = -\log\det \Sigma^{-1}. The gradient of l(\mu, \Sigma, \nu) can be computed analytically.

Concerning your final question, \Sigma is positive definite, and not positive semi-definite.

kellertuer · April 26, 2023, 5:28pm

Thanks – I am a bit confused, but just in two - ah only one – point

you switched from \Sigma to C? Besides that the cost (optimiser here, also called loss for others ) does not look nice, but is implementable.

Cool, if you have the gradient (with respect to all 3 variables) then Manopt (https://manoptjl.org) can be used (after the gradient conversion trick, but that is one line of code).

I am not saying this is the only solution, just one nice things where you can use really classical gradient descent (unconstraint) just on a manifold (and the SPDness always automatically holds, as well as positiveness of \nu-1.

mleprovost · April 26, 2023, 5:32pm

Thank you for spotting this, I have switched to \Sigma everywhere.

Is there any example in Manopt.jl that I can use to get started with the code? It’s mostly how to deal with the product structure of the problem.

kellertuer · April 26, 2023, 5:40pm

If you have a loss (cost) defined on a manifold

https://manoptjl.org/stable/tutorials/Optimize!/

a trick for you would be with the manifold above you have usually a point p consisting of 3 parts:

julia> using Manifolds, Manopt
julia> n=2
2

julia> M = ℝ^n × SymmetricPositiveDefinite(n) × PositiveNumbers()
ProductManifold with 3 submanifolds:
 Euclidean(2; field = ℝ)
 SymmetricPositiveDefinite(2)
 PositiveNumbers()

julia> p = rand(M)
ProductRepr with 3 submanifold components:
 Component 1 =
  2-element Vector{Float64}:
   -0.5632918864354372
    0.7307605106092392
 Component 2 =
  2×2 Matrix{Float64}:
    1.63286   -0.368659
   -0.368659   1.58408
 Component 3 =
  0-dimensional Array{Float64, 0}:
  2.405536968972573

that you would access with e.g.

julia> μ, Σ, ν = p[M,1], p[M,2], p[M,3]

(the example above is just on one manifold not a product of a few). Then we only have to check for change_representer– in short: The Euclidean gradient (or classical one you can analytically compute) and the Riemannian one are a bit different but can be converted, see Use Automatic Differentiation · Manopt.jl in short: You have a different metric (way to measure angles) on the tangent spaces (set of directions at every point).

But again, if you implement some ∇l that returns three gradient parts (again μ, Σ, ν) I can help getting the Riemannian one.

mleprovost · April 26, 2023, 5:43pm

Let me work on those things, and I reach back to you when I get to the Riemannian aspect.

ElOceanografo · April 26, 2023, 5:44pm

You could look at TransformVariables, which has a transformation to construct the Cholesky factor of a positive definite correlation matrix from a flat vector:

julia> using TransformVariables

julia> tchol = corr_cholesky_factor(3)
3×3 correlation cholesky factor (dimension 3)

julia> U = transform(tchol, randn(3))
3×3 LinearAlgebra.UpperTriangular{Float64, Matrix{Float64}}:
 1.0  0.45283   0.453308
  ⋅   0.891597  0.51193
  ⋅    ⋅        0.729685

julia> Σ = U'U
3×3 Matrix{Float64}:
 1.0       0.45283   0.453308
 0.45283   1.0       0.661706
 0.453308  0.661706  1.0

mleprovost · April 26, 2023, 5:52pm

Is there a way to replace the space of positive definite matrices by the space of Cholesky factors, i.e. lower triangular matrices with strictly positive entries?

That will make the calculation of \log\det(\Sigma^{-1}) trivial.

kellertuer · April 26, 2023, 5:58pm

There is two things I am not sure about:

on SPDs, given a gradient direction, you can not do just plus (so something like p_k + X, you need formally the exponential map, in practice a retraction
The is a similar problem on Cholesky factors if you want to keep every entry positive. Just imagine the 1D case, you are at p=1 and the gradient says: Do a step X=-2 just p+X would “kick you out”.

In principle both ideas are related (since it is a variable transform, bijective and such), but I do not know whether the exponential map (=walking around) on Cholesky factors has been considered somewhere, we for sure do not have that implemented.

But \log\det(\Sigma^{-1}) should also otherwise not be complicated, within your loss implementation, do whatever you want, feel free to use cholesky from LinearAlgebra.

lrnv · April 26, 2023, 6:33pm

By a simple google research on “how to estimate a multivariate student distribution”, i stump upon this discussion r - Maximum likelihood of multivariate t-distributed variable with scaled covariance - Cross Validated
and then a link to this article https://arxiv.org/ftp/arxiv/papers/1707/1707.01130.pdf

From section 3 of the article, it looks like maximumlikelyhood produces unbouded derivatives with respect to de degree of freedom \nu and thus is not reliable. They advocate in section 4 for an expectation-maximization method which is in my honest oppinon the best shot for a performant estimation procedure !

Since the model is fairly simple (just a Gaussian conditionally on a gamma), maybe GitHub - dmetivie/ExpectationMaximization.jl: A simple but generic implementation of Expectation Maximization algorithms to fit mixture models. from @dmetivie could help ?

ElOceanografo · April 26, 2023, 6:46pm

Here’s a sketch of how I might approach the problem (not complete and not tested, may not be totally correct):

using TransformVariables
using Optim
using ForwardDiff

function mvt_logdensity(x, μ, Σ, ν)
    # implement log-pdf from Wikipedia equation here
end

x = get_data_from_somewhere()
p = length(x)
t = as((μ = as(Array, p), U = corr_cholesky_factor(p)))
init_params = rand(dimension(t)) # or whatever initial values you want

function cost(x, t, ν, params)
    μ, U = transform(t, params)
    Σ = U'U 
    return -mvt_logdensity(x, μ, Σ, ν)
end

# iterate through some range of integer ν values and optimize the other
# parameters conditional on each one.
ν_optima = map(1:10) do ν
    f(params) = cost(x, ν, t, params)
    optimize(f, init_params, GradientDescent(), autodiff=:forward)
end

# which degree of freedom produced the best fit?
findmin(opt -> opt.minimum, ν_optima)

kellertuer · April 26, 2023, 6:46pm

Cool, that paper does give a gradient.

Concerning performance – sure – my approach is just an optimisation view on the whole thing, since there are for sure other methods available as well, I am not claiming my method is fast, but I like the idea and might give it a try as a nice toy problem

Since, sure their derivatives (=gradient for the Euclidean case) might yield steps to unwanted \nu<1.

mleprovost · April 26, 2023, 8:14pm

Hello,

Thank you both for your feedback!

I have actually implemented the regularized EM algorithm, called EMq, presented in this paper (https://arxiv.org/ftp/arxiv/papers/1707/1707.01130.pdf) by Dogru et al. As you mentioned, their regularized likelihood function has a bounded gradient with respect to \nu.

However, I experience some issues with the root finding problem for \nu with limited samples M \sim 30 in a state space of dimension larger than n > 40. Over the iterations of the EMq algorithm, the degree of freedom decreases to values smaller than one. I am still trying to diagnose this unexpected behavior.

In the meantime, I was curious if gradient methods can be competitive methods that don’t suffer from this issue.

mleprovost · April 26, 2023, 8:32pm

There is a trick that could allow to loosen the constraint of working on Cholesky factors.

This is called the logarithm Cholesky factorization

Let \Sigma = L L^\top be a Cholesky factorization of the PSD \Sigma.

We define the logarithm Cholesky factor L^\prime such that L^\prime_{i,j} = L_{i,j} if i \neq j and L^\prime_{i,i} = \log(L_{i,i}). Then you can compute gradients with respect to L^\prime that belongs to the space of lower triangular matrices and avoid the constraint of strictly positive diagonal entries.

kellertuer · April 26, 2023, 8:49pm

Well the Euclidean gradient would suffer from the same problems, the unconstraint that is.
Augmented Lagrangian would respect that but might be slower – Riemannian Gradient (on all of that) does respect that but whether it‘s faster – I don‘t know (It is my preferred method, but it is a bit more costly).

And sure some Cholesky ideas can be used instead of working on the SPD manifold, with the small disadvantage, that again, a classical Euclidean gradient might destroy your porisive diagonal.

lrnv · April 27, 2023, 12:00pm

This idea of an exponentialized Cholesky diagonal looks smart, I never thought of that.

kellertuer · April 27, 2023, 12:26pm

Ah and I missed that yesterday, that actually does keep the diagonal positive, sure. Quite a transformation overall, though

mleprovost · April 27, 2023, 12:52pm

Is it correct that the gradient on the space of PSDs S^{++}(\mathbb{R}^n) has an intrinsic dimension of n(n+1)/2?

kellertuer · April 27, 2023, 12:55pm

yes, its the same as their manifold dimension see here.

Topic		Replies	Views
Maximum Likelihood Multivariate Model Optimization (Mathematical) optim	15	1579	January 10, 2020
Need help on Manopt General Usage optimization , manifolds , manopt	13	436	June 30, 2023
Find gradient of a posterior Statistics question	12	999	May 16, 2019
Singular Exception with LKJCholesky Probabilistic Programming turing	24	2763	April 25, 2025
Gradient-based methods are not moving some parameters Optimization (Mathematical) optim , optimization	3	616	February 13, 2020

Maximum likelihood for t-distribution by gradient descent

Related topics