I am interested in the estimation of parameters of a multivariate t-distribution from samples.
A multivariate random variable X \in \mathbb{R}^n with density \pi_X is characterized by a mean vector \mu \in \mathbb{R}^n, a positive definite matrix \Sigma \in \mathbb{R}^{n \times n} and a degree of freedom \nu > 1, see Multivariate t-distribution - Wikipedia for more details. The pdf is given by \pi_X(x) = \frac{\Gamma\left[(\nu+p)/2\right]}{\Gamma(\nu/2)\nu^{p/2}\pi^{p/2}\left|{\boldsymbol\Sigma}\right|^{1/2}}\left[1+\frac{1}{\nu}({\mathbf x}-{\boldsymbol\mu})^{\rm T}{\boldsymbol\Sigma}^{-1}({\mathbf x}-{\boldsymbol\mu})\right]^{-(\nu+p)/2}
I am trying to estimate the parameters (\mu, \Sigma, \nu) given M samples \{ x^1, \ldots, x^M\} by maximizing the log likelihood. Do you think that this problem can be tackled using a gradient descent by enforcing that \Sigma is a PSD matrix?
As a warm-up problem, I will try to recover the sample mean and covariance of a Gaussian distribution.
For me this looks like a classical problem for Optimization on Manifolds, but since that is my area of research, I might be very biased here.
Can you maybe elaborate what your cost function f(\mu,\Sigma, \nu) would look like (that of course depends on the fixed data samples)?
I do not see that from just seeing the multivariate t-distribution (I am not a statistician) but if you have that then you could write a constraint optimisation (constraint for \eta and \Sigma) but you could also introduce \eta = \nu -1 > 0 (to use a positive number) and use
using Manifolds
M = ℝ^n × SymmetricPositiveDefinite(n) × PositiveNumbers()
If you have the Euclidean gradient, the Riemannian one can be computed by an easy conversion ( change_represenerManifolds · ManifoldsBase.jl ) and the advantage is, you can do an unconstraint (!) gradient descent on the manifold M.
If you have the (classical) gradient and do not want to do Manifold-stuff, then you have to look for methods where you can enforce the constraints – maybe something like Augmented Lagranian.
Finally – are you sure they are positive definite and not semi-definite (Eigenvalue zero allowed)?
The loss function l(\mu, \Sigma, \nu) is given by the negative log-likelihood of X for the M samples \{ x^1, \ldots, x^M\}:
l(\mu, \Sigma, \nu) = - \log \pi_{X}(x^1, \ldots, x^M) = -M \log(\Gamma(\frac{\nu + n}{2}))+ \frac{M}{2} \log \det \Sigma + \frac{M n}{2} \log{\pi \nu} +\\
M \log(\Gamma(\frac{\nu}{2})) - \frac{\nu + n}{2} M \log \nu + \sum_{i=1}^M \log(\nu + (x^i - \mu)^\top \Sigma^{-1}(x^i - \mu)), where \Gamma is the Gamma function.
Note, we can minimize this loss function with respect to \Sigma^{-1} is needed, by noting that \log\det C = -\log\det \Sigma^{-1}. The gradient of l(\mu, \Sigma, \nu) can be computed analytically.
Concerning your final question, \Sigma is positive definite, and not positive semi-definite.
Thanks – I am a bit confused, but just in two - ah only one – point
you switched from \Sigma to C? Besides that the cost (optimiser here, also called loss for others ) does not look nice, but is implementable.
Cool, if you have the gradient (with respect to all 3 variables) then Manopt (https://manoptjl.org) can be used (after the gradient conversion trick, but that is one line of code).
I am not saying this is the only solution, just one nice things where you can use really classical gradient descent (unconstraint) just on a manifold (and the SPDness always automatically holds, as well as positiveness of \nu-1.
a trick for you would be with the manifold above you have usually a point p consisting of 3 parts:
julia> using Manifolds, Manopt
julia> n=2
2
julia> M = ℝ^n × SymmetricPositiveDefinite(n) × PositiveNumbers()
ProductManifold with 3 submanifolds:
Euclidean(2; field = ℝ)
SymmetricPositiveDefinite(2)
PositiveNumbers()
julia> p = rand(M)
ProductRepr with 3 submanifold components:
Component 1 =
2-element Vector{Float64}:
-0.5632918864354372
0.7307605106092392
Component 2 =
2×2 Matrix{Float64}:
1.63286 -0.368659
-0.368659 1.58408
Component 3 =
0-dimensional Array{Float64, 0}:
2.405536968972573
that you would access with e.g.
julia> μ, Σ, ν = p[M,1], p[M,2], p[M,3]
(the example above is just on one manifold not a product of a few). Then we only have to check for change_representer– in short: The Euclidean gradient (or classical one you can analytically compute) and the Riemannian one are a bit different but can be converted, see Use Automatic Differentiation · Manopt.jl in short: You have a different metric (way to measure angles) on the tangent spaces (set of directions at every point).
But again, if you implement some ∇l that returns three gradient parts (again μ, Σ, ν) I can help getting the Riemannian one.
You could look at TransformVariables, which has a transformation to construct the Cholesky factor of a positive definite correlation matrix from a flat vector:
Is there a way to replace the space of positive definite matrices by the space of Cholesky factors, i.e. lower triangular matrices with strictly positive entries?
That will make the calculation of \log\det(\Sigma^{-1}) trivial.
on SPDs, given a gradient direction, you can not do just plus (so something like p_k + X, you need formally the exponential map, in practice a retraction
The is a similar problem on Cholesky factors if you want to keep every entry positive. Just imagine the 1D case, you are at p=1 and the gradient says: Do a step X=-2 just p+X would “kick you out”.
In principle both ideas are related (since it is a variable transform, bijective and such), but I do not know whether the exponential map (=walking around) on Cholesky factors has been considered somewhere, we for sure do not have that implemented.
But \log\det(\Sigma^{-1}) should also otherwise not be complicated, within your loss implementation, do whatever you want, feel free to use cholesky from LinearAlgebra.
From section 3 of the article, it looks like maximumlikelyhood produces unbouded derivatives with respect to de degree of freedom \nu and thus is not reliable. They advocate in section 4 for an expectation-maximization method which is in my honest oppinon the best shot for a performant estimation procedure !
Here’s a sketch of how I might approach the problem (not complete and not tested, may not be totally correct):
using TransformVariables
using Optim
using ForwardDiff
function mvt_logdensity(x, μ, Σ, ν)
# implement log-pdf from Wikipedia equation here
end
x = get_data_from_somewhere()
p = length(x)
t = as((μ = as(Array, p), U = corr_cholesky_factor(p)))
init_params = rand(dimension(t)) # or whatever initial values you want
function cost(x, t, ν, params)
μ, U = transform(t, params)
Σ = U'U
return -mvt_logdensity(x, μ, Σ, ν)
end
# iterate through some range of integer ν values and optimize the other
# parameters conditional on each one.
ν_optima = map(1:10) do ν
f(params) = cost(x, ν, t, params)
optimize(f, init_params, GradientDescent(), autodiff=:forward)
end
# which degree of freedom produced the best fit?
findmin(opt -> opt.minimum, ν_optima)
Concerning performance – sure – my approach is just an optimisation view on the whole thing, since there are for sure other methods available as well, I am not claiming my method is fast, but I like the idea and might give it a try as a nice toy problem
Since, sure their derivatives (=gradient for the Euclidean case) might yield steps to unwanted \nu<1.
I have actually implemented the regularized EM algorithm, called EMq, presented in this paper (https://arxiv.org/ftp/arxiv/papers/1707/1707.01130.pdf) by Dogru et al. As you mentioned, their regularized likelihood function has a bounded gradient with respect to \nu.
However, I experience some issues with the root finding problem for \nu with limited samples M \sim 30 in a state space of dimension larger than n > 40. Over the iterations of the EMq algorithm, the degree of freedom decreases to values smaller than one. I am still trying to diagnose this unexpected behavior.
In the meantime, I was curious if gradient methods can be competitive methods that don’t suffer from this issue.
There is a trick that could allow to loosen the constraint of working on Cholesky factors.
This is called the logarithm Cholesky factorization
Let \Sigma = L L^\top be a Cholesky factorization of the PSD \Sigma.
We define the logarithm Cholesky factor L^\prime such that L^\prime_{i,j} = L_{i,j} if i \neq j and L^\prime_{i,i} = \log(L_{i,i}). Then you can compute gradients with respect to L^\prime that belongs to the space of lower triangular matrices and avoid the constraint of strictly positive diagonal entries.
Well the Euclidean gradient would suffer from the same problems, the unconstraint that is.
Augmented Lagrangian would respect that but might be slower – Riemannian Gradient (on all of that) does respect that but whether it‘s faster – I don‘t know (It is my preferred method, but it is a bit more costly).
And sure some Cholesky ideas can be used instead of working on the SPD manifold, with the small disadvantage, that again, a classical Euclidean gradient might destroy your porisive diagonal.