Maximum likelihood for t-distribution by gradient descent

Hello,

I am interested in the estimation of parameters of a multivariate t-distribution from samples.

A multivariate random variable X \in \mathbb{R}^n with density \pi_X is characterized by a mean vector \mu \in \mathbb{R}^n, a positive definite matrix \Sigma \in \mathbb{R}^{n \times n} and a degree of freedom \nu > 1, see Multivariate t-distribution - Wikipedia for more details. The pdf is given by \pi_X(x) = \frac{\Gamma\left[(\nu+p)/2\right]}{\Gamma(\nu/2)\nu^{p/2}\pi^{p/2}\left|{\boldsymbol\Sigma}\right|^{1/2}}\left[1+\frac{1}{\nu}({\mathbf x}-{\boldsymbol\mu})^{\rm T}{\boldsymbol\Sigma}^{-1}({\mathbf x}-{\boldsymbol\mu})\right]^{-(\nu+p)/2}

I am trying to estimate the parameters (\mu, \Sigma, \nu) given M samples \{ x^1, \ldots, x^M\} by maximizing the log likelihood. Do you think that this problem can be tackled using a gradient descent by enforcing that \Sigma is a PSD matrix?

As a warm-up problem, I will try to recover the sample mean and covariance of a Gaussian distribution.

Thank you for your help,

1 Like

For me this looks like a classical problem for Optimization on Manifolds, but since that is my area of research, I might be very biased here.

Can you maybe elaborate what your cost function f(\mu,\Sigma, \nu) would look like (that of course depends on the fixed data samples)?
I do not see that from just seeing the multivariate t-distribution (I am not a statistician) but if you have that then you could write a constraint optimisation (constraint for \eta and \Sigma) but you could also introduce \eta = \nu -1 > 0 (to use a positive number) and use

using Manifolds
M = ℝ^n × SymmetricPositiveDefinite(n) × PositiveNumbers()

If you have the Euclidean gradient, the Riemannian one can be computed by an easy conversion ( change_represener Manifolds · ManifoldsBase.jl ) and the advantage is, you can do an unconstraint (!) gradient descent on the manifold M.

If you have the (classical) gradient and do not want to do Manifold-stuff, then you have to look for methods where you can enforce the constraints – maybe something like Augmented Lagranian.

Finally – are you sure they are positive definite and not semi-definite (Eigenvalue zero allowed)?

4 Likes

Thank you so much @kellertuer for your answer.

The loss function l(\mu, \Sigma, \nu) is given by the negative log-likelihood of X for the M samples \{ x^1, \ldots, x^M\}:

l(\mu, \Sigma, \nu) = - \log \pi_{X}(x^1, \ldots, x^M) = -M \log(\Gamma(\frac{\nu + n}{2}))+ \frac{M}{2} \log \det \Sigma + \frac{M n}{2} \log{\pi \nu} +\\ M \log(\Gamma(\frac{\nu}{2})) - \frac{\nu + n}{2} M \log \nu + \sum_{i=1}^M \log(\nu + (x^i - \mu)^\top \Sigma^{-1}(x^i - \mu)), where \Gamma is the Gamma function.

Note, we can minimize this loss function with respect to \Sigma^{-1} is needed, by noting that \log\det C = -\log\det \Sigma^{-1}. The gradient of l(\mu, \Sigma, \nu) can be computed analytically.

Concerning your final question, \Sigma is positive definite, and not positive semi-definite.

Thanks – I am a bit confused, but just in two - ah only one – point

  • you switched from \Sigma to C? Besides that the cost (optimiser here, also called loss for others :wink: ) does not look nice, but is implementable.

Cool, if you have the gradient (with respect to all 3 variables) then Manopt (https://manoptjl.org) can be used (after the gradient conversion trick, but that is one line of code).

I am not saying this is the only solution, just one nice things where you can use really classical gradient descent (unconstraint) just on a manifold (and the SPDness always automatically holds, as well as positiveness of \nu-1.

2 Likes

Thank you for spotting this, I have switched to \Sigma everywhere.

Is there any example in Manopt.jl that I can use to get started with the code? It’s mostly how to deal with the product structure of the problem.

If you have a loss (cost) defined on a manifold

https://manoptjl.org/stable/tutorials/Optimize!/

a trick for you would be with the manifold above you have usually a point p consisting of 3 parts:

julia> using Manifolds, Manopt
julia> n=2
2

julia> M = ℝ^n × SymmetricPositiveDefinite(n) × PositiveNumbers()
ProductManifold with 3 submanifolds:
 Euclidean(2; field = ℝ)
 SymmetricPositiveDefinite(2)
 PositiveNumbers()

julia> p = rand(M)
ProductRepr with 3 submanifold components:
 Component 1 =
  2-element Vector{Float64}:
   -0.5632918864354372
    0.7307605106092392
 Component 2 =
  2×2 Matrix{Float64}:
    1.63286   -0.368659
   -0.368659   1.58408
 Component 3 =
  0-dimensional Array{Float64, 0}:
  2.405536968972573

that you would access with e.g.

julia> μ, Σ, ν = p[M,1], p[M,2], p[M,3]

(the example above is just on one manifold not a product of a few). Then we only have to check for change_representer– in short: The Euclidean gradient (or classical one you can analytically compute) and the Riemannian one are a bit different but can be converted, see Use Automatic Differentiation · Manopt.jl in short: You have a different metric (way to measure angles) on the tangent spaces (set of directions at every point).

But again, if you implement some ∇l that returns three gradient parts (again μ, Σ, ν) I can help getting the Riemannian one.

1 Like

Let me work on those things, and I reach back to you when I get to the Riemannian aspect.

1 Like

You could look at TransformVariables, which has a transformation to construct the Cholesky factor of a positive definite correlation matrix from a flat vector:

julia> using TransformVariables

julia> tchol = corr_cholesky_factor(3)
3×3 correlation cholesky factor (dimension 3)

julia> U = transform(tchol, randn(3))
3×3 LinearAlgebra.UpperTriangular{Float64, Matrix{Float64}}:
 1.0  0.45283   0.453308
  ⋅   0.891597  0.51193
  ⋅    ⋅        0.729685

julia> Σ = U'U
3×3 Matrix{Float64}:
 1.0       0.45283   0.453308
 0.45283   1.0       0.661706
 0.453308  0.661706  1.0
3 Likes

Is there a way to replace the space of positive definite matrices by the space of Cholesky factors, i.e. lower triangular matrices with strictly positive entries?

That will make the calculation of \log\det(\Sigma^{-1}) trivial.

There is two things I am not sure about:

  1. on SPDs, given a gradient direction, you can not do just plus (so something like p_k + X, you need formally the exponential map, in practice a retraction
  2. The is a similar problem on Cholesky factors if you want to keep every entry positive. Just imagine the 1D case, you are at p=1 and the gradient says: Do a step X=-2 just p+X would “kick you out”.

In principle both ideas are related (since it is a variable transform, bijective and such), but I do not know whether the exponential map (=walking around) on Cholesky factors has been considered somewhere, we for sure do not have that implemented.

But \log\det(\Sigma^{-1}) should also otherwise not be complicated, within your loss implementation, do whatever you want, feel free to use cholesky from LinearAlgebra.

By a simple google research on “how to estimate a multivariate student distribution”, i stump upon this discussion r - Maximum likelihood of multivariate t-distributed variable with scaled covariance - Cross Validated
and then a link to this article https://arxiv.org/ftp/arxiv/papers/1707/1707.01130.pdf

From section 3 of the article, it looks like maximumlikelyhood produces unbouded derivatives with respect to de degree of freedom \nu and thus is not reliable. They advocate in section 4 for an expectation-maximization method which is in my honest oppinon the best shot for a performant estimation procedure !

Since the model is fairly simple (just a Gaussian conditionally on a gamma), maybe GitHub - dmetivie/ExpectationMaximization.jl: A simple but generic implementation of Expectation Maximization algorithms to fit mixture models. from @dmetivie could help ?

5 Likes

Here’s a sketch of how I might approach the problem (not complete and not tested, may not be totally correct):

using TransformVariables
using Optim
using ForwardDiff

function mvt_logdensity(x, μ, Σ, ν)
    # implement log-pdf from Wikipedia equation here
end

x = get_data_from_somewhere()
p = length(x)
t = as((μ = as(Array, p), U = corr_cholesky_factor(p)))
init_params = rand(dimension(t)) # or whatever initial values you want

function cost(x, t, ν, params)
    μ, U = transform(t, params)
    Σ = U'U 
    return -mvt_logdensity(x, μ, Σ, ν)
end

# iterate through some range of integer ν values and optimize the other
# parameters conditional on each one.
ν_optima = map(1:10) do ν
    f(params) = cost(x, ν, t, params)
    optimize(f, init_params, GradientDescent(), autodiff=:forward)
end

# which degree of freedom produced the best fit?
findmin(opt -> opt.minimum, ν_optima)
1 Like

Cool, that paper does give a gradient.

Concerning performance – sure – my approach is just an optimisation view on the whole thing, since there are for sure other methods available as well, I am not claiming my method is fast, but I like the idea and might give it a try as a nice toy problem :slight_smile:

Since, sure their derivatives (=gradient for the Euclidean case) might yield steps to unwanted \nu<1.

3 Likes

Hello,

Thank you both for your feedback!

I have actually implemented the regularized EM algorithm, called EMq, presented in this paper (https://arxiv.org/ftp/arxiv/papers/1707/1707.01130.pdf) by Dogru et al. As you mentioned, their regularized likelihood function has a bounded gradient with respect to \nu.

However, I experience some issues with the root finding problem for \nu with limited samples M \sim 30 in a state space of dimension larger than n > 40. Over the iterations of the EMq algorithm, the degree of freedom decreases to values smaller than one. I am still trying to diagnose this unexpected behavior.

In the meantime, I was curious if gradient methods can be competitive methods that don’t suffer from this issue.

1 Like

There is a trick that could allow to loosen the constraint of working on Cholesky factors.

This is called the logarithm Cholesky factorization

Let \Sigma = L L^\top be a Cholesky factorization of the PSD \Sigma.

We define the logarithm Cholesky factor L^\prime such that L^\prime_{i,j} = L_{i,j} if i \neq j and L^\prime_{i,i} = \log(L_{i,i}). Then you can compute gradients with respect to L^\prime that belongs to the space of lower triangular matrices and avoid the constraint of strictly positive diagonal entries.

2 Likes

Well the Euclidean gradient would suffer from the same problems, the unconstraint that is.
Augmented Lagrangian would respect that but might be slower – Riemannian Gradient (on all of that) does respect that but whether it‘s faster – I don‘t know (It is my preferred method, but it is a bit more costly).

And sure some Cholesky ideas can be used instead of working on the SPD manifold, with the small disadvantage, that again, a classical Euclidean gradient might destroy your porisive diagonal.

This idea of an exponentialized Cholesky diagonal looks smart, I never thought of that.

Ah and I missed that yesterday, that actually does keep the diagonal positive, sure. Quite a transformation overall, though :wink:

Is it correct that the gradient on the space of PSDs S^{++}(\mathbb{R}^n) has an intrinsic dimension of n(n+1)/2?

yes, its the same as their manifold dimension see here.

1 Like