Square root of the inverse of a matrix

Hello,

I am working on square root factorization of the matrix of the form A = I + U'U which is typically positive semi-definite of dimension N \sim 100.

I need to solve a least square problem A^{-1}Y and compute the product \sqrt{A^{-1}} R, where Y a non-square matrix and R an orthogonal matrix.

What is the best way to perform these operations? I think it might be advantageous to do a matrix factorization of A but unfortunately, sqrt doesn’t work with eigenfactorizations.

using Random
using RandomMatrices

N = 100
Nx = 50
U = randn(N, Nx)
A = I + U'*U

Y = randn(Nx,N)

# First operation
A\Y

# Generate Random Matrices
d = Haar(1)
R = rand(d, N)

# Second operation
real.(sqrt(inv(A))*R
1 Like

If the matrix is symmetric positive definite, how about pre computing the eigen decomposition and using that to compute both the inverse and square root?

Yes, that’s true. I wish I could just provide the Eigen object to the blackslash function\ and sqrt but it doesn’t work.
Instead, I think that I have to do it by hand, by looping over the eigenvalues, right?

  1. You should tell julia that A is hermitian to avoid generating a complex matrix and then dropping the imaginary values
  2. The inverse and square root commute here, so you should be able to instead calculate
sqrt(Hermitian(A))\R

which should be equivalent and faster than real.(sqrt(inv(A))*R. However, someone who knows more will likely have a better idea that will avoid repeating the factorization multiple times.

2 Likes

You can simply do this with E.vectors * Diagonal( 1 ./ sqrt.(E.values)) * E.vectors'. In my experience matrix square roots are rarely needed, and Cholesky factorizations are often just as good. Also, an SVD of U can be more stable.

5 Likes

So to put this all together, here’s how I understand this should be approached.

You can use your favorite factorization as it suits you. I tested svd and eigen and found svd was a bit slower, but as Antoine said, it can be more stable:

using Random, RandomMatrices
function f1(A, Y, R)
    (op1=A\Y, 
     op2=real.(sqrt(inv(A)))*R)
end
function f2(A::Hermitian, Y, R)
    (op1 = A\Y, 
     op2 = sqrt(A)\R)
end
function f3(A::Hermitian, Y, R)
    λ, ϕ = eigen(A)
    
   (op1 = ϕ * Diagonal(1 ./ λ) * ϕ' * Y,
    op2 = ϕ * Diagonal(1 ./ sqrt.(λ)) * ϕ' * R)
end
let N = 100, Nx = 50
    U = randn(Nx, N) #switched Nx and N so the code runs
    A = I + U'*U
    
    Y = randn(N, Nx)  #switched Nx and N so the code runs

    # Generate Random Matrices
    d = Haar(1)
    R = rand(d, N)

    r1 = @btime f1($A, $Y, $R)
    r2 = @btime f2($(Hermitian(A)), $Y, $R)
    r3 = @btime f3($(Hermitian(A)), $Y, $R)
    Tuple(r1) .≈ Tuple(r2) .≈ Tuple(r3) #test that all the outputs are the same
end

#+RESULTS:
:   9.683 ms (31 allocations: 1.38 MiB)
:   1.781 ms (33 allocations: 705.03 KiB)
:   1.149 ms (31 allocations: 704.16 KiB)
: (true, true)
2 Likes

Thank you @Mason, this is very nice of you!!

1 Like

What is the context where you need this? (In many cases where you might ostensibly use a matrix square root, you can use Cholesky factors instead.)

2 Likes

I am working on the filtering problem for nonlinear systems. I am using the ensemble transform Kalman filter which is an improved version of the ensemble Kalman filter to assimilate measurements in the state estimate. Bishop et al. https://journals.ametsoc.org/mwr/article/129/3/420/66417/Adaptive-Sampling-with-the-Ensemble-Transform uses a square-root factorization of the covariance propagation to produce a consistent analysis step.

The covariance matrix P^f, P^a before and after the assimilation of measurements are related by:

P^a = (I - KH) P^f, where K is the Kalman gain and H the observation matrix.

We can do the factorization P^a = X^a {X^a}^T where X^a = \frac{1}{\sqrt{N-1}}[x^1-\bar{x}, \ldots, x^N-\bar{x}], and the x^i are N samples from the state distribution and \bar{x} the sample mean.

When you do the algebra, you end up with an update equation for X^a that involves the square root of an expression with the Kalman gain.

As suggested by @antoine-levitt, the SVD is very nice for Kalman filtering. Maybe you can adapt the standard tricks from the plain vanilla KF, eg see and the papers that cite it.

@inproceedings{wang1992kalman,
  title={Kalman filter algorithm based on singular value decomposition},
  author={Wang, Liang and Libert, Ga{\"e}tan and Manneback, Pierre},
  booktitle={Decision and Control, 1992., Proceedings of the 31st IEEE Conference on},
  pages={1224--1229},
  year={1992},
  organization={IEEE}
}

If you look at their algebra:

this seems to be precisely one of those cases where they could instead have used a Cholesky factorization R = LL^T. In particular, if you follow the same algebra and let \hat{H} = L^{-1} H (via cholesky(R).L \ H in Julia), then you obtain:

PH^T(HPH^T + R)^{-1} HP = PH^T\left[ L \left( L^{-1} H P H^T L^{-T} + I\right)L^T\right]^{-1} H P \\ = P \hat{H}^T \left( \hat{H} P \hat{H}^T + I\right)^{-1} \hat{H} P

with no matrix square root but a very similar form to their expression that seems to work equally well for the subsequent steps in the derivation.

(Note that this is closely related to Tikhonov/ridge-regularized least-squares, and there are a variety of techniques such, as the GSVD, for attacking such problems accurately.)

6 Likes

Well, if it’s correct to use a Cholesky factorization here it’s by far the most performant option given in this thread yet:

using Random
using RandomMatrices
function f1(A, Y, R)
    (op1=A\Y, 
     op2=real.(sqrt(inv(A)))*R)
end
function f2(A::Hermitian, Y, R)
    (op1 = A\Y, 
     op2 = sqrt(A)\R)
end
function f3(A::Hermitian, Y, R)
    λ, ϕ = eigen(A)
    (op1 = ϕ * Diagonal(1 ./ λ) * ϕ' * Y,
     op2 = ϕ * Diagonal( 1 ./ sqrt.(λ)) * ϕ' * R)
end
function f4(A::Hermitian, Y, R)
    C = cholesky(A)
    (op1 = C \ Y,
     op2 = C.L \ R)
end
let N = 100, Nx = 50
    U = randn(Nx, N) #switched Nx and N so the code runs
    A = I + U'*U
    
    Y = randn(N, Nx)  #switched Nx and N so the code runs

    # Generate Random Matrices
    d = Haar(1)
    R = rand(d, N)

    r1 = @btime f1($A, $Y, $R)
    r2 = @btime f2($(Hermitian(A)), $Y, $R)
    r3 = @btime f3($(Hermitian(A)), $Y, $R)
    r4 = @btime f4($(Hermitian(A)), $Y, $R)

    (same_results = Tuple(r1) .≈ Tuple(r4), 
     same_eigvalues = eigvals(r4.op2) ≈ eigvals(r1.op2),
     same_singvalues = svd(r4.op2).S ≈ svd(r1.op2).S)  
end

:RESULTS:
:   9.786 ms (31 allocations: 1.38 MiB)
:   1.780 ms (33 allocations: 705.03 KiB)
:   1.173 ms (31 allocations: 704.16 KiB)
:   117.878 μs (10 allocations: 273.80 KiB)
: (same_results = (true, false), same_eigvalues = false, same_singvalues = true)
:END:

As you can see, you don’t get the same answer for the second operation when you use the Cholesky L factor, but that doesn’t mean it isn’t equivalent. The resulting matrix appears to have different eigenvalues, but the same singular values as the other methods.

Yes, it’s not the same matrix — the matrix \hat{H} in my post is different from the matrix \tilde{H} in the original paper, but if you modify the other algebra (the x update) to match, changing R^{1/2} to L or L^T consistently, then the final result (the x update) will be the same up to roundoff errors.

1 Like

Thank you so much for your help. In practice, we have a finite number of samples O(50-100) to estimate a covariance matrix of dimension 200-500 in my case. Therefore the term (\hat{H}P\hat{H}^T + I)^{-1} is usually ill-conditioned (positive semi-definite). Can I still use a Cholesky decomposition?

I don’t quite follow your question: the Cholesky factorization is applied to R, not this matrix. (And \hat{H}P\hat{H}^T + I should be strictly positive-definite anyway because of the I?)

1 Like

I need to check a few things before to answer. Let me get back to you tomorrow.

Hello,

Thank you @stevengj for your answers.
I wrote my code on the pseudo-code of the ETKF provided by Asch et al. https://my.siam.org/Store/Product/viewproduct/?ProductId=28457218 , chapter 6:

image

where E = [x^1, \ldots, x^m] with x^i the i-th sample.

In my case, R is a diagonal matrix (with entries of order 10^{-8}), so its inverse square root is straightforward to compute.
The slow operations are to compute w in l.8 (the least-square problem) and T^{1/2} in l.9.
In my case, the determinant of T is small ~10^{-12}, so I don’t know if it makes sense to do a Cholesky decomposition of T.

Determinants don’t exist, they’re a conspiracy propagated by linear algebra teachers. If you have 12 eigenvalues of 0.1, the determinant is 10^-12 which sounds scary, but it’s still perfectly reasonable to invert the matrix. More interesting is the condition number. In any case, either the thing is well-conditioned, in which case you can do what you want with it, or it’s ill-conditioned, in which case whatever you do with it will blow up, so it doesn’t really matter what you do (simplifying a full numerical analysis course ever so slightly).

It is likely that you can bypass the square root. Often square roots are made to change variables to ones where some overlap/Gram/covariance/whatever matrix is identity; this can also be done with Cholesky as @stevengj showed. You have to understand the derivation of the algorithm and see where the square root is there.

5 Likes