Resolve sign ambiguity in SVD and Eigendecomposition

mleprovost · May 19, 2021, 11:55pm

Hello,

Given a matrix A \in \mathbb{R}^{m\times n}, my goal is to recover the SVD decomposition of A from the eigendecomposition of A^\top A and A^\top A.

A compact SVD decomposition of matrix A \in \mathbb{R}^{m\times n} gives:
A = U_{svd}\Sigma V_{svd}^\top, in which \Sigma is a r \times r diagonal matrix where r = min\{m, n\}. U_{svd} \in \mathbb{R}^{m \times r} and V_{svd} \in \mathbb{R}^{n \times r} verify U_{svd} U_{svd}^\top = V_{svd} V_{svd}^\top = I_r.

A is not directly accessible, I only know C_x = A^\top A \in \mathbb{R}^{m \times m} and
C_y = AA^\top \in \mathbb{R}^{n \times n}.

From the spectral theorem, we know that:
C_x = V \Lambda_x V^\top, C_y = U \Lambda_y U^\top.

If we assume that m \leq n, the eigenvalues \Lambda_x equate the first r values of \Lambda_y, which are the square of the singular values \Sigma.

However, we have a sign ambiguity in the columns of the different eigenvectors of C_x and C_y.

Therefore, U\sqrt{\Lambda_y}V[:,1:r]^\top does not give the SVD decomposition of A! The sign of the columns of U are V are not chosen such that AV = U and A^\top U = V, as in the SVD decomposition (Singular value decomposition - Wikipedia).

Is there a way to flip the sign of the columns of U and V to recover the SVD decomposition of A?

A simple example:

A norm of 2 for one of the columns means that they are opposite of each other.

m = 10
n = 20
r = min(m, n)
A = randn(m, n)

Cx = A'*A
Cy = A*A'

Ex = eigen(Cx; sortby = λ -> -λ)
Ey = eigen(Cy; sortby = λ -> -λ)
svdA = svd(A)

@show norm(svdA.S - sqrt.(Ex.values[1:r]))
@show norm(svdA.S - sqrt.(Ey.values[1:r]))

@show norm.(eachcol(Ex.vectors[:,1:r]-svdA.V))
@show norm.(eachcol(Ey.vectors-svdA.U));

Per · May 20, 2021, 7:21am

The sign information that you need is lost when you start from A^TA instead of A. (For example, you cannot distinguish between A = [1] and A = [-1] if you only know that A^TA = [1].)

fph · May 20, 2021, 7:42am

Apart from this issue, I would advise against using this method to compute an SVD in anything serious. There are other major issues:

worse accuracy (trouble now starts when \sigma_1 / \sigma_n \approx sqrt(eps) rather than \sigma_1 / \sigma_n \approx eps)
even more non-uniqueness issues if there are repeated singular values (take A orthogonal, for instance).

In general you should regard computing A^TA a bit like calling inv(A): 99% of the time it is a mistake, and if you shuffle your computations around there are better ways to solve the same problem.

mleprovost · May 20, 2021, 4:00pm

Thank you for your feedback. Annoying this loss of phase. I totally agree for the conditioning number. Let me give you a bit more context.

I would like to find the important directions in the input and output spaces for a nonlinear function f: \mathbb{R}^m \xrightarrow{} \mathbb{R}^n.

For a linear function f(x) = Ax, we can just look at the SVD decomposition of A.

To generalize this to the nonlinear case, we can find these directions by looking at the eigenvalue decomposition of the integrated inner and outer product of the Jacobian of the function f integrated over the distribution \mu:

C_x = \int (\nabla_x f(x))^\top (\nabla_x f(x)) d\mu(x) \in \mathbb{R}^{m,m} and C_y = \int (\nabla_x f(x)) (\nabla_x f(x))^\top d\mu(x) \in \mathbb{R}^{n,n}.

In practice, we can perform a Monte Carlo appproximation of C_x, C_y given M samples x^i \sim \mu:

C_x \approx \frac{1}{M}\sum_{i=1}^M (\nabla_x f(x^i))^\top (\nabla_x f(x^i)) and similarly for C_y.

Since C_x is positive definite, an eigendecomposition gives us:

C_x = V \Lambda_x V^T with V\in \mathbb{R}^{m,m}, where V an orthonormal basis for the input space. We can truncate this basis to identify the most important directions based on the decay of energy spectrum of \Lambda_x.

Similarly, we write C_y = U \Lambda_y U^T with U\in \mathbb{R}^{n,n}. U is an orthonormal basis for the output space.

I don’t know how to avoid the construction and the eigendecomposition of the C_x, C_y matrices to find the bases U, V.

If f:\mathbb{R}^m \xrightarrow{} \mathbb{R}, then we can perform an SVD decomposition of the matrices \sqrt{C_y} = \frac{1}{\sqrt{M}} \left[ \nabla f(x^1), \ldots, \nabla f(x^M) \right] \in \mathbb{R}^{n,M} and \sqrt{C_x} = \frac{1}{\sqrt{M}} \left[ \nabla f(x^1)^\top, \ldots, \nabla f(x^M)^\top \right]\in \mathbb{R}^{m, M}, and extract the left singular vectors to get U and V.

We have C_x = \sqrt{C_x} \sqrt{C_x}^\top and C_y = \sqrt{C_y} \sqrt{C_y}^\top.

However, for f:\mathbb{R}^m \xrightarrow{} \mathbb{R}^n, \nabla f \in \mathbb{R}^{n,m}, and we would have to assemble matrices of (n, mM) and (m, nM), which requires too much storage.

Is there a better way to do this computation without assembling these large matrices? I can only afford finite-difference to compute the Jacobian of the function f.

stevengj · May 20, 2021, 5:15pm

Since you only want to know the “important directions”, maybe you only need the singular vectors corresponding to largest few singular values?

In that case you would be able to use iterative SVD methods (e.g. SVDL · IterativeSolvers.jl or GitHub - Jutho/KrylovKit.jl: Krylov methods for linear problems, eigenvalues, singular values and matrix functions or GitHub - JuliaLinearAlgebra/RandomizedLinAlg.jl: Randomized algorithms for numerical linear algebra in Julia …). Then you only need to provide a function to multiply the Jacobian by an arbitrary vector, which requires a single finite-difference operation.

stevengj · May 20, 2021, 5:21pm

To answer your original question, you only need to compute the eigen-decomposition of C_x or C_y, not both — you can get the eigenvectors of one (for nonzero eigenvalues) from the eigenvectors of the other, and the sign/phase is then consistent for the SVD. In fact, this is one way to derive the existence of the SVD from the spectral theorem for Hermitian matrices.

For example, given a nonzero eigenvalue \lambda_k = \sigma_k^2 > 0 of C_x = A^* A and a corresponding (orthonormal) eigenvector v_k, you can easily show that u_k = A v_k / \sigma_k is an (orthonormal) eigenvector of C_y = A A^* with the same eigenvalue.

mleprovost · May 20, 2021, 7:08pm

Thank you @stevenj for your ideas!

I tried to implement it, for the linear case f(x) = Ax, so \nabla f(x) = A, but I get some issues with the svdl routine from ‘IterativeSolvers.jl’

using LinearAlgebra
using LinearMaps
using IterativeSolvers

Nx = 20
Ny = 10

A = randn(Ny, Nx)
Ne = 500
X = randn(Nx, Ne)


function mapsqrtCx!(Ny, Ne, vout, vin)
    fill!(vout, 0.0)
    
    @inbounds for j=1:Ne
        # chunk of dimension Ny
        vj = view(vin, (j-1)*Ny+1:j*Ny)
        # A will be the Jacobian of the function f evaluated at x^i
        mul!(vout, A', vj, 1.0, 1.0)
    end
    vout .*= 1/sqrt(Ne)
end

function mapsqrtCxtranspose!(Ny, Ne, vout, vin)
    fill!(vout, 0.0)

    @inbounds for j=1:Ne
        # chunk of dimension Ny
        vj = view(vout, (j-1)*Ny+1:j*Ny)
        # A will be the Jacobian of the function f evaluated at x^i
        mul!(vj, A, vin, 1.0, 1.0)
    end
    vout .*= 1/sqrt(Ne)
end

function mapsqrtCy!(Nx, Ne, vout, vin)
    fill!(vout, 0.0)
    
    @inbounds for j=1:Ne
        # chunk of dimension Nx
        vj = view(vin, (j-1)*Nx+1:j*Nx)
        # A will be the Jacobian of the function f evaluated at x^i
        mul!(vout, A, vj, 1.0, 1.0)
    end
    vout .*= 1/sqrt(Ne)
end

function mapsqrtCytranspose!(Nx, Ne, vout, vin)
    fill!(vout, 0.0)

    @inbounds for j=1:Ne
        # chunk of dimension Nx
        vj = view(vout, (j-1)*Nx+1:j*Nx)
        # A will be the Jacobian of the function f evaluated at x^i
        mul!(vj, A', vin, 1.0, 1.0)
    end
    vout .*= 1/sqrt(Ne)
end

sqrtCx = LinearMap((vout, vin) -> mapsqrtCx!(Ny, Ne, vout, vin), 
                   (vout, vin) -> mapsqrtCxtranspose!(Ny, Ne, vout, vin), 
                   Nx, Ny*Ne; ismutating = true)

sqrtCy = LinearMap((vout, vin) -> mapsqrtCy!(Nx, Ne, vout, vin), 
                   (vout, vin) -> mapsqrtCytranspose!(Nx, Ne, vout, vin), 
                   Ny, Nx*Ne; ismutating = true)

cache = ones(Ny, Ne)
@time 1/sqrt(Ne)*sum(A'*cache;dims = 2)

v = ones(Ny*Ne);
@time sqrtCx*v

@show norm(1/sqrt(Ne)*sum(A'*ones(Ny, Ne);dims = 2)-sqrtCx*v)

12.892 μs (10 allocations: 39.56 KiB)
40.500 μs (501 allocations: 23.59 KiB)
norm((1 / sqrt(Ne)) * sum(A * ones(Nx, Ne); dims = 2) - sqrtCx * v) = 7.149791431641379e-12

Any ideas to reduce the computational time for mapsqrtCx!?

Now, we can use the routine svdl from IterativeSolvers.jI:

maxrank = 5
@time SCx, errorCx  = svdl(sqrtCx; nsv = maxrank, vecs = :left);
@time SCy, errorCy  = svdl(sqrtCy; nsv = maxrank, vecs = :left);

svdA = svd(A)
# check singular values
@show norm(SCx.S[1:maxrank] - svdA.S[1:maxrank])
@show norm(SCy.S[1:maxrank] - svdA.S[1:maxrank])

stevengj · May 20, 2021, 9:09pm

matrix–matrix multiplications will almost always be much more efficient than performing the same operation as a loop of matrix–vector multiplications. A factor of 4 isn’t bad.

In any case, before you wild on micro-optimizations, in your real application won’t the dominant cost be in evaluating your nonlinear function f(x)?

mleprovost · May 20, 2021, 9:21pm

Yes, the dominant cost is clearly to evaluate/differentiate f. I think the problem is that we need multiple evaluation of sqrtCx v for this iterative solver, therefore we need to evaluate the Jacobian of the nonlinear function f for all the different samples for a single evaluation of the product sqrtCx v.

I forgot the transpose version of g to compute the left singular vectors of sqrtCx.

I am not very familiar with iterative SVD solvers, but I think this is working correctly as long as the spectral gap between two consecutive singular values is big enough. On this toy problem, I have found that the singular vectors are correct if I stop at the 4th value, but stopping at the 5th value gives a very poor prediction of the singular left vectors…

stevengj · May 21, 2021, 1:17am

You are not evaluating the whole Jacobian matrix, right? For a Jacobian–vector product you only need a directional derivative, e.g. by a single finite difference.

mleprovost · May 21, 2021, 4:47am

I am probably missing something important here.

The input vector has a dimension m \sim 100, and the output has dimension n \sim 20-40, and I have about M \sim 100-500 samples.
So far, I was computing the entire Jacobian matrix for the different ensemble members and storing it.
Do you mean that we can compute the gradient of f(x)^\top a with respect to x for \sqrt{C_x} to get \nabla f(x)^\top a, where a \in \mathbb{R}^n?

For f:\mathbb{R}^m \xrightarrow{} \mathbb{R}^n, let us define the matrices \sqrt{C_x} = \frac{1}{\sqrt{M}} \left[ \nabla f(x^1)^\top, \ldots, \nabla f(x^M)^\top \right]\in \mathbb{R}^{m, nM} and \sqrt{C_y} = \frac{1}{\sqrt{M}} \left[ \nabla f(x^1), \ldots, \nabla f(x^M) \right] \in \mathbb{R}^{n,mM}.

We have C_x = \sqrt{C_x} \sqrt{C_x}^\top and C_y = \sqrt{C_y} \sqrt{C_y}^\top.

I don’t know how to do to get \nabla f(x)b, (with b \in \mathbb{R}^m) as the gradient of something with respect to x?
Do you have some references on these techniques?

For these iterative solvers, do we know how many iterations they need to converge for the leading singular values?

In the linear case, based on the characterization of the SVD that you wrote above, we have:

\sqrt{C_x} \begin{bmatrix} u_i\\ \vdots \\ u_i \end{bmatrix} = \sqrt{M} \sqrt{\sigma_i} v_i.

I believe this doesn’t hold when f is a nonlinear function? Meaning that we need to perform two SVD decompositions : one for \sqrt{C_x} and one for \sqrt{C_y}.

ettersi · May 21, 2021, 7:46am

I’m not sure why you do (\nabla_x f(x))^\top (\nabla_x f(x)) and (\nabla_x f(x)) (\nabla_x f(x))^\top in these integrals. Wouldn’t an SVD of

\int \nabla f(x) \, d\mu(x)

give you similar information without the problem of squared condition numbers? (I’m no expert on this part, so this is a genuine question.)

If the gradient exists, then by definition we have

\nabla f(x) \, b = \lim_{\varepsilon \to 0} \frac{f(x + \varepsilon b) - f(x)}{\varepsilon}.

Thus, you can compute \nabla f(x) \, b with a single application of finite differences to \varepsilon \mapsto f(x + \varepsilon b).

mleprovost · May 21, 2021, 5:35pm

That’s a good question, a proper justification of these Gramian matrices is described in this paper:

You can also look for active subspaces
https://epubs.siam.org/doi/book/10.1137/1.9781611973860?mobileUi=0&

For example in 2D, if f(x_1, x_2) = [x_1; x_2^2], so \partial f_2/\partial x_2 = 2 x_2. If you take the expectation of the Jacobian over a zero-mean symmetric distribution along the x_2 axis (e.g. a standard Gaussian distribution), then the expectation of this component is clearly zero, and the most informative direction in the input space is missed.

Thank you for your explanation, how would you compute \nabla f(x)^\top a with a single application of finite differences? We can say that this is the gradient of f(x)^\top a with respect to x, but I don’t think that we can use the trick of the directional derivative as for \nabla f(x) b?

ettersi · May 22, 2021, 11:46am

Makes sense.

I believe the only way to evaluate gradients through finite differencing is to compute the derivatives with respect to each input dimension. If you are open to replacing finite differences with automatic differentiation, then what you would have to do is to replace forward mode AD with reverse mode AD.

Topic		Replies	Views
Does the singular value degeneracy cause the gradient to be NAN？ General Usage question , zygote , svd	11	1024	August 29, 2022
Automatic differentiation of f(x) constructed by decomposition of a matrix M(x) Optimization (Mathematical) differentiation , flux , machine-learning , numerics	7	2867	January 8, 2020
SVD on a matrix that does not fit into memory Numerics memory , svd	20	2248	January 16, 2023
Identify important subspaces of a function without squaring the conditioning number Numerics	2	362	October 20, 2020
SVD gives different results from Python with MWE General Usage linearalgebra , python	15	2056	March 9, 2021

Resolve sign ambiguity in SVD and Eigendecomposition

Related topics