I need to calculate the inverse of a positive definite matrix H of the form H = (X’ * X + Diagonal(d)). X is a flat matrix, having (dominantly) more columns than rows. d consists of positive, large values.
The most far I could get is to turn the problem into finding the inverse of (I + X * Diagonal(1 ./ d) * X) using the Woodbury formula. I wonder if there is a particular method to speed up the matrix inversion, considering that it has a specific structure? cholesky!(I + X * Diagonal(1 ./ d) * X) or inv(I + X * Diagonal(1 ./ d) * X), or anything that makes more sense?
Thank you very much. Any suggestions would be much appreciated!
No, I was suggesting that you consider whether you can use the Cholesky factors without explicitly computing the inverse. That is, given C = cholesky!(X'X + Diagonal(d)), you can solve a linear system for any given right-hand-side quickly, so in many cases you don’t need the inverse matrix explicitly.
If you really need the whole inverse matrix, I would suggest
LinearAlgebra.inv!(cholesky!(X'X + Diagonal(d)))
(I don’t see the point of your Diagonal(1 ./ d) reformulation.)
I do need the inverse (more preferably), or at least the diagonals of the inverse…
What are you ultimately trying to compute, and what are your typical matrix sizes?
Thank you so much for your reply. I suppose I do need the inverse - it is an intermediate step in my iterations, and I need the diagonal of the inverse and it’s determinant to be fed into the next iteration.
Calculating the determinate does not require inversion, though, but I was not able to find how to extract the diagonal of the matrix inverse without cubic complexity…
Given the cholesky decomposition, is there a faster way to only calculate the diagonal of the inverse? Unfortunately, the matrix of interest is huge, at least 30000 x 30000… but it grows smaller along the iterations.
Thank you so much - my iteration is trying to find the optimal value of a parameter that minimises a function, e.g., argmin_{w} f(X, w, a). I need the negative inverse of the Hessian to approximate w’s (posterior) distribution, at least in the final iteration. In the intermediate iterations, I (only) need the diagonal of the negative inverse of Hessian, which is necessary to optimise the other parameter, a.
The Hessian of w has a particular structure, (X’X + Diagonal(d)). Unfortunately the dimensions of X and w are huge - X 30,000 by 2,000,000 and w 2,000,000 by 1 - which renders it impossible to calculate the diagonal of its negative inverse Hessian. Thus I exploited the Woodbury identity (A + BD^-1C)^-1 = A^-1 - A^-1B(D + CA^-1B)^-1CA^-1 ,assuming A is the diagonal matrix Diagonal(d) here, B being X’, C being X, and D being the identity matrix, such that the inversion can be conducted on a matrix of much smaller scale, (I + X * Diagonal(1 ./ d) * X’). I only need the diagonal elements of its inverse.
In the final iteration, I need the entire negative inverse Hessian, but it would be of much smaller size.
Thank you so much for taking the time to read this - is there a better approach to do it? I considered BFGS, but storing the actual (approximated) inverse Hessian at each step is very expensive as well.
Thanks - there are two steps in my iteration. Suppose X is of dimension n by d, w and a are of d by 1.
solve argmin_{w}f(X, t, w, a) = y’y + 0.5 * w’ * Diagonal(a) * w, keeping a fixed.
Here y = t .* log.( sigmoid.(X * w)) .+ (1 .- t) .* log(1 .- sigmoid.(X * w)). t is of n by 1.
after finding the optimal value of w, keeping it fixed, and solve argmin_{a}f2(X, t, w, a) = y’y + 0.5 * w’ * Diagonal(a) * w + 0.5sum(log.(a)) + 0.5logdet(Sigma). Here Sigma is the negative inverse Hessian of w at its optimal value, and the Hessian of w is of the form (X’ * Diagonal(y .* (1 .- y)) * X + Diagonal(a))
It is a relevance vector machine problem.
Actually I have been baffled by this problem for quite a while. f(X, t, w, a) is the posterior of w, and f2(X, t, w, a) is the function with w integrated out (i.e., type 2 likelihood). As the exact integration is intractable, I used Laplace approximate as people usually do, calculating f2(X, t, w, a) as an approximation of the type 2 likelihood. My friends are suggesting doing SGD on f(X, t, w, a) to find w and a jointly in order to avoid f2, but I haven’t investigated if this would yield good results. …
How close is the approximate inverse Hessian to the actual inverse Hessian, yielded by program like BFGS/L-BFGS?
Okay, so this is why you think you need the full matrix inverse — you are computing a log determinant? There are lots of methods for estimating the log determinant, e.g. by stochastic Lanczos quadrature + Hutchinson trace estimators. Moreover, since Hutchinson trace estimators are precisely in the form of an expected-value calculation, they are compatible with stochastic gradient descent (so you don’t need to compute the exact log determinant at each step in order to minimize it). These are iterative methods, so you only need matrix–vector products, which is useful if your matrix is large and sparse (or structured).
Alternatively, if you have the explicit Cholesky factorization C of a matrix A, you can compute logdet(A) or logdet(A⁻¹)=-logdet(A) directly from the Cholesky factors (just call logdet(C))
I would definitely recommend a single minimization rather than two nested minimizations if possible.