Maximum likelihood for t-distribution by gradient descent

Hello,

I am interested in the estimation of parameters of a multivariate t-distribution from samples.

A multivariate random variable X \in \mathbb{R}^n with density \pi_X is characterized by a mean vector \mu \in \mathbb{R}^n, a positive definite matrix \Sigma \in \mathbb{R}^{n \times n} and a degree of freedom \nu > 1, see Multivariate t-distribution - Wikipedia for more details. The pdf is given by \pi_X(x) = \frac{\Gamma\left[(\nu+p)/2\right]}{\Gamma(\nu/2)\nu^{p/2}\pi^{p/2}\left|{\boldsymbol\Sigma}\right|^{1/2}}\left[1+\frac{1}{\nu}({\mathbf x}-{\boldsymbol\mu})^{\rm T}{\boldsymbol\Sigma}^{-1}({\mathbf x}-{\boldsymbol\mu})\right]^{-(\nu+p)/2}

I am trying to estimate the parameters (\mu, \Sigma, \nu) given M samples \{ x^1, \ldots, x^M\} by maximizing the log likelihood. Do you think that this problem can be tackled using a gradient descent by enforcing that \Sigma is a PSD matrix?

As a warm-up problem, I will try to recover the sample mean and covariance of a Gaussian distribution.

Thank you for your help,

For me this looks like a classical problem for Optimization on Manifolds, but since that is my area of research, I might be very biased here.

Can you maybe elaborate what your cost function f(\mu,\Sigma, \nu) would look like (that of course depends on the fixed data samples)?
I do not see that from just seeing the multivariate t-distribution (I am not a statistician) but if you have that then you could write a constraint optimisation (constraint for \eta and \Sigma) but you could also introduce \eta = \nu -1 > 0 (to use a positive number) and use

using Manifolds
M = ℝ^n × SymmetricPositiveDefinite(n) × PositiveNumbers()

If you have the Euclidean gradient, the Riemannian one can be computed by an easy conversion ( change_represener Manifolds · ManifoldsBase.jl ) and the advantage is, you can do an unconstraint (!) gradient descent on the manifold M.

If you have the (classical) gradient and do not want to do Manifold-stuff, then you have to look for methods where you can enforce the constraints – maybe something like Augmented Lagranian.

Finally – are you sure they are positive definite and not semi-definite (Eigenvalue zero allowed)?

Thank you so much @kellertuer for your answer.

The loss function l(\mu, \Sigma, \nu) is given by the negative log-likelihood of X for the M samples \{ x^1, \ldots, x^M\}:

l(\mu, \Sigma, \nu) = - \log \pi_{X}(x^1, \ldots, x^M) = -M \log(\Gamma(\frac{\nu + n}{2}))+ \frac{M}{2} \log \det \Sigma + \frac{M n}{2} \log{\pi \nu} +\\ M \log(\Gamma(\frac{\nu}{2})) - \frac{\nu + n}{2} M \log \nu + \sum_{i=1}^M \log(\nu + (x^i - \mu)^\top \Sigma^{-1}(x^i - \mu)), where \Gamma is the Gamma function.

Note, we can minimize this loss function with respect to \Sigma^{-1} is needed, by noting that \log\det C = -\log\det \Sigma^{-1}. The gradient of l(\mu, \Sigma, \nu) can be computed analytically.

Concerning your final question, \Sigma is positive definite, and not positive semi-definite.

Thanks – I am a bit confused, but just in two - ah only one – point

  • you switched from \Sigma to C? Besides that the cost (optimiser here, also called loss for others :wink: ) does not look nice, but is implementable.

Cool, if you have the gradient (with respect to all 3 variables) then Manopt (https://manoptjl.org) can be used (after the gradient conversion trick, but that is one line of code).

I am not saying this is the only solution, just one nice things where you can use really classical gradient descent (unconstraint) just on a manifold (and the SPDness always automatically holds, as well as positiveness of \nu-1.

Thank you for spotting this, I have switched to \Sigma everywhere.

Is there any example in Manopt.jl that I can use to get started with the code? It’s mostly how to deal with the product structure of the problem.

If you have a loss (cost) defined on a manifold

https://manoptjl.org/stable/tutorials/Optimize!/

a trick for you would be with the manifold above you have usually a point p consisting of 3 parts:

julia> using Manifolds, Manopt
julia> n=2
2

julia> M = ℝ^n × SymmetricPositiveDefinite(n) × PositiveNumbers()
ProductManifold with 3 submanifolds:
 Euclidean(2; field = ℝ)
 SymmetricPositiveDefinite(2)
 PositiveNumbers()

julia> p = rand(M)
ProductRepr with 3 submanifold components:
 Component 1 =
  2-element Vector{Float64}:
   -0.5632918864354372
    0.7307605106092392
 Component 2 =
  2×2 Matrix{Float64}:
    1.63286   -0.368659
   -0.368659   1.58408
 Component 3 =
  0-dimensional Array{Float64, 0}:
  2.405536968972573

that you would access with e.g.

julia> μ, Σ, ν = p[M,1], p[M,2], p[M,3]

(the example above is just on one manifold not a product of a few). Then we only have to check for change_representer– in short: The Euclidean gradient (or classical one you can analytically compute) and the Riemannian one are a bit different but can be converted, see Use Automatic Differentiation · Manopt.jl in short: You have a different metric (way to measure angles) on the tangent spaces (set of directions at every point).

But again, if you implement some ∇l that returns three gradient parts (again μ, Σ, ν) I can help getting the Riemannian one.

Let me work on those things, and I reach back to you when I get to the Riemannian aspect.

You could look at TransformVariables, which has a transformation to construct the Cholesky factor of a positive definite correlation matrix from a flat vector:

julia> using TransformVariables

julia> tchol = corr_cholesky_factor(3)
3×3 correlation cholesky factor (dimension 3)

julia> U = transform(tchol, randn(3))
3×3 LinearAlgebra.UpperTriangular{Float64, Matrix{Float64}}:
 1.0  0.45283   0.453308
  ⋅   0.891597  0.51193
  ⋅    ⋅        0.729685

julia> Σ = U'U
3×3 Matrix{Float64}:
 1.0       0.45283   0.453308
 0.45283   1.0       0.661706
 0.453308  0.661706  1.0

Is there a way to replace the space of positive definite matrices by the space of Cholesky factors, i.e. lower triangular matrices with strictly positive entries?

That will make the calculation of \log\det(\Sigma^{-1}) trivial.

There is two things I am not sure about:

  1. on SPDs, given a gradient direction, you can not do just plus (so something like p_k + X, you need formally the exponential map, in practice a retraction
  2. The is a similar problem on Cholesky factors if you want to keep every entry positive. Just imagine the 1D case, you are at p=1 and the gradient says: Do a step X=-2 just p+X would “kick you out”.

In principle both ideas are related (since it is a variable transform, bijective and such), but I do not know whether the exponential map (=walking around) on Cholesky factors has been considered somewhere, we for sure do not have that implemented.

But \log\det(\Sigma^{-1}) should also otherwise not be complicated, within your loss implementation, do whatever you want, feel free to use cholesky from LinearAlgebra.

By a simple google research on “how to estimate a multivariate student distribution”, i stump upon this discussion r - Maximum likelihood of multivariate t-distributed variable with scaled covariance - Cross Validated
and then a link to this article https://arxiv.org/ftp/arxiv/papers/1707/1707.01130.pdf

From section 3 of the article, it looks like maximumlikelyhood produces unbouded derivatives with respect to de degree of freedom \nu and thus is not reliable. They advocate in section 4 for an expectation-maximization method which is in my honest oppinon the best shot for a performant estimation procedure !

Since the model is fairly simple (just a Gaussian conditionally on a gamma), maybe GitHub - dmetivie/ExpectationMaximization.jl: A simple but generic implementation of Expectation Maximization algorithms to fit mixture models. from @dmetivie could help ?

Here’s a sketch of how I might approach the problem (not complete and not tested, may not be totally correct):

using TransformVariables
using Optim
using ForwardDiff

function mvt_logdensity(x, μ, Σ, ν)
    # implement log-pdf from Wikipedia equation here
end

x = get_data_from_somewhere()
p = length(x)
t = as((μ = as(Array, p), U = corr_cholesky_factor(p)))
init_params = rand(dimension(t)) # or whatever initial values you want

function cost(x, t, ν, params)
    μ, U = transform(t, params)
    Σ = U'U 
    return -mvt_logdensity(x, μ, Σ, ν)
end

# iterate through some range of integer ν values and optimize the other
# parameters conditional on each one.
ν_optima = map(1:10) do ν
    f(params) = cost(x, ν, t, params)
    optimize(f, init_params, GradientDescent(), autodiff=:forward)
end

# which degree of freedom produced the best fit?
findmin(opt -> opt.minimum, ν_optima)

Cool, that paper does give a gradient.

Concerning performance – sure – my approach is just an optimisation view on the whole thing, since there are for sure other methods available as well, I am not claiming my method is fast, but I like the idea and might give it a try as a nice toy problem :slight_smile:

Since, sure their derivatives (=gradient for the Euclidean case) might yield steps to unwanted \nu<1.

Hello,

Thank you both for your feedback!

I have actually implemented the regularized EM algorithm, called EMq, presented in this paper (https://arxiv.org/ftp/arxiv/papers/1707/1707.01130.pdf) by Dogru et al. As you mentioned, their regularized likelihood function has a bounded gradient with respect to \nu.

However, I experience some issues with the root finding problem for \nu with limited samples M \sim 30 in a state space of dimension larger than n > 40. Over the iterations of the EMq algorithm, the degree of freedom decreases to values smaller than one. I am still trying to diagnose this unexpected behavior.

In the meantime, I was curious if gradient methods can be competitive methods that don’t suffer from this issue.

There is a trick that could allow to loosen the constraint of working on Cholesky factors.

This is called the logarithm Cholesky factorization

Let \Sigma = L L^\top be a Cholesky factorization of the PSD \Sigma.

We define the logarithm Cholesky factor L^\prime such that L^\prime_{i,j} = L_{i,j} if i \neq j and L^\prime_{i,i} = \log(L_{i,i}). Then you can compute gradients with respect to L^\prime that belongs to the space of lower triangular matrices and avoid the constraint of strictly positive diagonal entries.

Well the Euclidean gradient would suffer from the same problems, the unconstraint that is.
Augmented Lagrangian would respect that but might be slower – Riemannian Gradient (on all of that) does respect that but whether it‘s faster – I don‘t know (It is my preferred method, but it is a bit more costly).

And sure some Cholesky ideas can be used instead of working on the SPD manifold, with the small disadvantage, that again, a classical Euclidean gradient might destroy your porisive diagonal.

This idea of an exponentialized Cholesky diagonal looks smart, I never thought of that.

Ah and I missed that yesterday, that actually does keep the diagonal positive, sure. Quite a transformation overall, though :wink:

Is it correct that the gradient on the space of PSDs S^{++}(\mathbb{R}^n) has an intrinsic dimension of n(n+1)/2?

yes, its the same as their manifold dimension see here.