Resolve sign ambiguity in SVD and Eigendecomposition

Hello,

Given a matrix A \in \mathbb{R}^{m\times n}, my goal is to recover the SVD decomposition of A from the eigendecomposition of A^\top A and A^\top A.

A compact SVD decomposition of matrix A \in \mathbb{R}^{m\times n} gives:
A = U_{svd}\Sigma V_{svd}^\top, in which \Sigma is a r \times r diagonal matrix where r = min\{m, n\}. U_{svd} \in \mathbb{R}^{m \times r} and V_{svd} \in \mathbb{R}^{n \times r} verify U_{svd} U_{svd}^\top = V_{svd} V_{svd}^\top = I_r.

A is not directly accessible, I only know C_x = A^\top A \in \mathbb{R}^{m \times m} and
C_y = AA^\top \in \mathbb{R}^{n \times n}.

From the spectral theorem, we know that:
C_x = V \Lambda_x V^\top, C_y = U \Lambda_y U^\top.

If we assume that m \leq n, the eigenvalues \Lambda_x equate the first r values of \Lambda_y, which are the square of the singular values \Sigma.

However, we have a sign ambiguity in the columns of the different eigenvectors of C_x and C_y.

Therefore, U\sqrt{\Lambda_y}V[:,1:r]^\top does not give the SVD decomposition of A! The sign of the columns of U are V are not chosen such that AV = U and A^\top U = V, as in the SVD decomposition (Singular value decomposition - Wikipedia).

Is there a way to flip the sign of the columns of U and V to recover the SVD decomposition of A?

A simple example:

A norm of 2 for one of the columns means that they are opposite of each other.

m = 10
n = 20
r = min(m, n)
A = randn(m, n)

Cx = A'*A
Cy = A*A'

Ex = eigen(Cx; sortby = λ -> -λ)
Ey = eigen(Cy; sortby = λ -> -λ)
svdA = svd(A)

@show norm(svdA.S - sqrt.(Ex.values[1:r]))
@show norm(svdA.S - sqrt.(Ey.values[1:r]))

@show norm.(eachcol(Ex.vectors[:,1:r]-svdA.V))
@show norm.(eachcol(Ey.vectors-svdA.U));

The sign information that you need is lost when you start from A^TA instead of A. (For example, you cannot distinguish between A = [1] and A = [-1] if you only know that A^TA = [1].)

6 Likes

Apart from this issue, I would advise against using this method to compute an SVD in anything serious. There are other major issues:

  • worse accuracy (trouble now starts when \sigma_1 / \sigma_n \approx sqrt(eps) rather than \sigma_1 / \sigma_n \approx eps)
  • even more non-uniqueness issues if there are repeated singular values (take A orthogonal, for instance).

In general you should regard computing A^TA a bit like calling inv(A): 99% of the time it is a mistake, and if you shuffle your computations around there are better ways to solve the same problem.

4 Likes

Thank you for your feedback. Annoying this loss of phase. I totally agree for the conditioning number. Let me give you a bit more context.

I would like to find the important directions in the input and output spaces for a nonlinear function f: \mathbb{R}^m \xrightarrow{} \mathbb{R}^n.

For a linear function f(x) = Ax, we can just look at the SVD decomposition of A.

To generalize this to the nonlinear case, we can find these directions by looking at the eigenvalue decomposition of the integrated inner and outer product of the Jacobian of the function f integrated over the distribution \mu:

C_x = \int (\nabla_x f(x))^\top (\nabla_x f(x)) d\mu(x) \in \mathbb{R}^{m,m} and C_y = \int (\nabla_x f(x)) (\nabla_x f(x))^\top d\mu(x) \in \mathbb{R}^{n,n}.

In practice, we can perform a Monte Carlo appproximation of C_x, C_y given M samples x^i \sim \mu:

C_x \approx \frac{1}{M}\sum_{i=1}^M (\nabla_x f(x^i))^\top (\nabla_x f(x^i)) and similarly for C_y.

Since C_x is positive definite, an eigendecomposition gives us:

C_x = V \Lambda_x V^T with V\in \mathbb{R}^{m,m}, where V an orthonormal basis for the input space. We can truncate this basis to identify the most important directions based on the decay of energy spectrum of \Lambda_x.

Similarly, we write C_y = U \Lambda_y U^T with U\in \mathbb{R}^{n,n}. U is an orthonormal basis for the output space.

I don’t know how to avoid the construction and the eigendecomposition of the C_x, C_y matrices to find the bases U, V.

If f:\mathbb{R}^m \xrightarrow{} \mathbb{R}, then we can perform an SVD decomposition of the matrices \sqrt{C_y} = \frac{1}{\sqrt{M}} \left[ \nabla f(x^1), \ldots, \nabla f(x^M) \right] \in \mathbb{R}^{n,M} and \sqrt{C_x} = \frac{1}{\sqrt{M}} \left[ \nabla f(x^1)^\top, \ldots, \nabla f(x^M)^\top \right]\in \mathbb{R}^{m, M}, and extract the left singular vectors to get U and V.

We have C_x = \sqrt{C_x} \sqrt{C_x}^\top and C_y = \sqrt{C_y} \sqrt{C_y}^\top.

However, for f:\mathbb{R}^m \xrightarrow{} \mathbb{R}^n, \nabla f \in \mathbb{R}^{n,m}, and we would have to assemble matrices of (n, mM) and (m, nM), which requires too much storage.

Is there a better way to do this computation without assembling these large matrices? I can only afford finite-difference to compute the Jacobian of the function f.

Since you only want to know the “important directions”, maybe you only need the singular vectors corresponding to largest few singular values?

In that case you would be able to use iterative SVD methods (e.g. SVDL · IterativeSolvers.jl or GitHub - Jutho/KrylovKit.jl: Krylov methods for linear problems, eigenvalues, singular values and matrix functions or GitHub - JuliaLinearAlgebra/RandomizedLinAlg.jl: Randomized algorithms for numerical linear algebra in Julia …). Then you only need to provide a function to multiply the Jacobian by an arbitrary vector, which requires a single finite-difference operation.

3 Likes

To answer your original question, you only need to compute the eigen-decomposition of C_x or C_y, not both — you can get the eigenvectors of one (for nonzero eigenvalues) from the eigenvectors of the other, and the sign/phase is then consistent for the SVD. In fact, this is one way to derive the existence of the SVD from the spectral theorem for Hermitian matrices.

For example, given a nonzero eigenvalue \lambda_k = \sigma_k^2 > 0 of C_x = A^* A and a corresponding (orthonormal) eigenvector v_k, you can easily show that u_k = A v_k / \sigma_k is an (orthonormal) eigenvector of C_y = A A^* with the same eigenvalue.

3 Likes

Thank you @stevenj for your ideas!

I tried to implement it, for the linear case f(x) = Ax, so \nabla f(x) = A, but I get some issues with the svdl routine from ‘IterativeSolvers.jl’

using LinearAlgebra
using LinearMaps
using IterativeSolvers

Nx = 20
Ny = 10

A = randn(Ny, Nx)
Ne = 500
X = randn(Nx, Ne)


function mapsqrtCx!(Ny, Ne, vout, vin)
    fill!(vout, 0.0)
    
    @inbounds for j=1:Ne
        # chunk of dimension Ny
        vj = view(vin, (j-1)*Ny+1:j*Ny)
        # A will be the Jacobian of the function f evaluated at x^i
        mul!(vout, A', vj, 1.0, 1.0)
    end
    vout .*= 1/sqrt(Ne)
end

function mapsqrtCxtranspose!(Ny, Ne, vout, vin)
    fill!(vout, 0.0)

    @inbounds for j=1:Ne
        # chunk of dimension Ny
        vj = view(vout, (j-1)*Ny+1:j*Ny)
        # A will be the Jacobian of the function f evaluated at x^i
        mul!(vj, A, vin, 1.0, 1.0)
    end
    vout .*= 1/sqrt(Ne)
end

function mapsqrtCy!(Nx, Ne, vout, vin)
    fill!(vout, 0.0)
    
    @inbounds for j=1:Ne
        # chunk of dimension Nx
        vj = view(vin, (j-1)*Nx+1:j*Nx)
        # A will be the Jacobian of the function f evaluated at x^i
        mul!(vout, A, vj, 1.0, 1.0)
    end
    vout .*= 1/sqrt(Ne)
end

function mapsqrtCytranspose!(Nx, Ne, vout, vin)
    fill!(vout, 0.0)

    @inbounds for j=1:Ne
        # chunk of dimension Nx
        vj = view(vout, (j-1)*Nx+1:j*Nx)
        # A will be the Jacobian of the function f evaluated at x^i
        mul!(vj, A', vin, 1.0, 1.0)
    end
    vout .*= 1/sqrt(Ne)
end

sqrtCx = LinearMap((vout, vin) -> mapsqrtCx!(Ny, Ne, vout, vin), 
                   (vout, vin) -> mapsqrtCxtranspose!(Ny, Ne, vout, vin), 
                   Nx, Ny*Ne; ismutating = true)

sqrtCy = LinearMap((vout, vin) -> mapsqrtCy!(Nx, Ne, vout, vin), 
                   (vout, vin) -> mapsqrtCytranspose!(Nx, Ne, vout, vin), 
                   Ny, Nx*Ne; ismutating = true)

cache = ones(Ny, Ne)
@time 1/sqrt(Ne)*sum(A'*cache;dims = 2)

v = ones(Ny*Ne);
@time sqrtCx*v

@show norm(1/sqrt(Ne)*sum(A'*ones(Ny, Ne);dims = 2)-sqrtCx*v)

12.892 μs (10 allocations: 39.56 KiB)
40.500 μs (501 allocations: 23.59 KiB)
norm((1 / sqrt(Ne)) * sum(A * ones(Nx, Ne); dims = 2) - sqrtCx * v) = 7.149791431641379e-12

Any ideas to reduce the computational time for mapsqrtCx!?

Now, we can use the routine svdl from IterativeSolvers.jI:

maxrank = 5
@time SCx, errorCx  = svdl(sqrtCx; nsv = maxrank, vecs = :left);
@time SCy, errorCy  = svdl(sqrtCy; nsv = maxrank, vecs = :left);

svdA = svd(A)
# check singular values
@show norm(SCx.S[1:maxrank] - svdA.S[1:maxrank])
@show norm(SCy.S[1:maxrank] - svdA.S[1:maxrank])

matrix–matrix multiplications will almost always be much more efficient than performing the same operation as a loop of matrix–vector multiplications. A factor of 4 isn’t bad.

In any case, before you wild on micro-optimizations, in your real application won’t the dominant cost be in evaluating your nonlinear function f(x)?

1 Like

Yes, the dominant cost is clearly to evaluate/differentiate f. I think the problem is that we need multiple evaluation of sqrtCx v for this iterative solver, therefore we need to evaluate the Jacobian of the nonlinear function f for all the different samples for a single evaluation of the product sqrtCx v.

I forgot the transpose version of g to compute the left singular vectors of sqrtCx.

I am not very familiar with iterative SVD solvers, but I think this is working correctly as long as the spectral gap between two consecutive singular values is big enough. On this toy problem, I have found that the singular vectors are correct if I stop at the 4th value, but stopping at the 5th value gives a very poor prediction of the singular left vectors…

You are not evaluating the whole Jacobian matrix, right? For a Jacobian–vector product you only need a directional derivative, e.g. by a single finite difference.

2 Likes

I am probably missing something important here.

The input vector has a dimension m \sim 100, and the output has dimension n \sim 20-40, and I have about M \sim 100-500 samples.
So far, I was computing the entire Jacobian matrix for the different ensemble members and storing it.
Do you mean that we can compute the gradient of f(x)^\top a with respect to x for \sqrt{C_x} to get \nabla f(x)^\top a, where a \in \mathbb{R}^n?

For f:\mathbb{R}^m \xrightarrow{} \mathbb{R}^n, let us define the matrices \sqrt{C_x} = \frac{1}{\sqrt{M}} \left[ \nabla f(x^1)^\top, \ldots, \nabla f(x^M)^\top \right]\in \mathbb{R}^{m, nM} and \sqrt{C_y} = \frac{1}{\sqrt{M}} \left[ \nabla f(x^1), \ldots, \nabla f(x^M) \right] \in \mathbb{R}^{n,mM}.

We have C_x = \sqrt{C_x} \sqrt{C_x}^\top and C_y = \sqrt{C_y} \sqrt{C_y}^\top.

I don’t know how to do to get \nabla f(x)b, (with b \in \mathbb{R}^m) as the gradient of something with respect to x?
Do you have some references on these techniques?

For these iterative solvers, do we know how many iterations they need to converge for the leading singular values?

In the linear case, based on the characterization of the SVD that you wrote above, we have:

\sqrt{C_x} \begin{bmatrix} u_i\\ \vdots \\ u_i \end{bmatrix} = \sqrt{M} \sqrt{\sigma_i} v_i.

I believe this doesn’t hold when f is a nonlinear function? Meaning that we need to perform two SVD decompositions : one for \sqrt{C_x} and one for \sqrt{C_y}.

I’m not sure why you do (\nabla_x f(x))^\top (\nabla_x f(x)) and (\nabla_x f(x)) (\nabla_x f(x))^\top in these integrals. Wouldn’t an SVD of

\int \nabla f(x) \, d\mu(x)

give you similar information without the problem of squared condition numbers? (I’m no expert on this part, so this is a genuine question.)

If the gradient exists, then by definition we have

\nabla f(x) \, b = \lim_{\varepsilon \to 0} \frac{f(x + \varepsilon b) - f(x)}{\varepsilon}.

Thus, you can compute \nabla f(x) \, b with a single application of finite differences to \varepsilon \mapsto f(x + \varepsilon b).

3 Likes

That’s a good question, a proper justification of these Gramian matrices is described in this paper:

You can also look for active subspaces
https://epubs.siam.org/doi/book/10.1137/1.9781611973860?mobileUi=0&

For example in 2D, if f(x_1, x_2) = [x_1; x_2^2], so \partial f_2/\partial x_2 = 2 x_2. If you take the expectation of the Jacobian over a zero-mean symmetric distribution along the x_2 axis (e.g. a standard Gaussian distribution), then the expectation of this component is clearly zero, and the most informative direction in the input space is missed.

Thank you for your explanation, how would you compute \nabla f(x)^\top a with a single application of finite differences? We can say that this is the gradient of f(x)^\top a with respect to x, but I don’t think that we can use the trick of the directional derivative as for \nabla f(x) b?

Makes sense.

I believe the only way to evaluate gradients through finite differencing is to compute the derivatives with respect to each input dimension. If you are open to replacing finite differences with automatic differentiation, then what you would have to do is to replace forward mode AD with reverse mode AD.

1 Like