Solving optimization problems involving orthogonal matrices

I am working on an optimization problem on the set of orthogonal matrices O_n=\{A \text{ is } n\times n \text{ and }AA^T=I_n\}. Basically, it is a minimization problem of the form

\min_{(A,\mathbf{x})\in O_n\times \mathbb{R}^k} f(A,\mathbf{x}),

where f is a nice function (at most it is a fourth degree polynomial of the entries). Currently, I am using Jump+Ipopt and enforcing the condition A^TA=I_n using n(n-1)/2 constrainsts, but it tends to be slow for high values of n and leads to constraint violations. I am wondering if there is package that optimizes over O_n (it would be even better if norm(A'A-I) is zero or around eps(10)) .

Toy example:

This code attempts to find the closest commuting symmetric matrices to given matrices Z1 and Z2. The goal is to find A and x=[x1;x2] such that Z1≈ A'*x1*A and Z2≈ A'*x2*A

using JuMP, Ipopt, LinearAlgebra, Random
d=30
Random.seed!(1234)
Z1, Z2 = randn(d,d), randn(d,d)
model = Model(Ipopt.Optimizer)
A = @variable(model, A[1:d, 1:d])
x = @variable(model, x[1:2d])

set_start_value.(A, I(d)) 
set_start_value.(x[1:d],diag(Z1))
set_start_value.(x[d+1:end],diag(Z2))


for i=1:d, j=1:i  # Enforce A*A'=I(d) only on lower triangle
    @constraint(model,(A*A'-I(d))[i,j] == 0)
end

@objective(model, Min, sum(((x[1:d].*A) - A*Z1) .^ 2) + sum((A'*(x[d+1:end].*A) - A*Z2) .^ 2))
optimize!(model)

value.(A)
value.(x)
objective_value(model)
norm(value.(A'*A-I(d)))

This is fast for d<10 and leads to “almost feasible” solutions (norm(A*A'-I)<1e-14). Do you have any recommendation on how to: 1) Speed it up. 2) Finding feasible solutions.

See Optimization on Stiefel manifold with auto-differentiation — two options:

  1. Use something like Manopt.jl that knows how to optimize over O_n, and it looks like it supports a Cartesian product of manifolds to include x \in \mathbb{R}^k as well.
  2. Use a (differentiable) change of variables via the polar decomposition to unconstrained matrices X, which has the advantage of letting you use any ordinary optimization algorithm:
\min_{(X,x)\in \mathbb{R}^{n \times n} \times \mathbb{R}^k} f(X(X^T X)^{-1/2}, x)
1 Like

Note that several AD systems can handle matrix square roots etcetera. But it is a nice exercise in matrix calculus to work out how to propagate the matrix gradient \nabla f of f(A) to the gradient \nabla g of g(X) = f(X (X^T X)^{-1/2}) (I might want to assign it as a homework problem at some point). I just worked it out and checked it, so for posterity I’ll include it below. Suppose that you have a function f_and_∇f(A) that returns f, ∇f. Then the chain rule (implemented in reverse mode) gives:

function g_and_∇g(X)
    B = sqrt(Hermitian(X'X))
    P = inv(cholesky(B))
    A = X * P
    f, ∇f = f_and_∇f(A)
    ∇fP = ∇f * P
    ∇g = ∇fP + 2X * lyap(B, hermitianpart!(A'∇fP))
    return f, ∇g
end

(The function lyapc from MatrixEquations.jl can be used in place of lyap, and seems like it takes better advantage of the fact that B and Y here are Hermitian. Even better, since both Lyapunov and the matrix square root use the diagonalization of Hermitian(X'X), you can re-use that diagonalization for the square root, the inversion, and the Lyapunov solve. The possibility of this kind of optimization is one reason why it can be worthwhile to work out gradients by hand.)

2 Likes

@stevengj Thank you very much for the detailed response. Manopt.jl turned out to be easier to use than I initially thought. I will include a naive rewrite of my previous code using Manopt below in case someone stumbles upon this thread in the future (I am pretty sure that this can be improved greatly, but it is starting point).

using Manopt, Manifolds, Random, LinearAlgebra, ManifoldDiff
using FiniteDifferences, ADTypes,RecursiveArrayTools
using ManifoldsBase: ProductManifold

d=30
Random.seed!(1234)
Z1,Z2=rand(d,d),rand(d,d)
Mprod = ProductManifold(OrthogonalMatrices(d), Euclidean(d),Euclidean(d))
f(x) =begin
    A=x.x[1]
    d1=x.x[2]
    d2=x.x[3]
    norm(d1 .* A -A*Z1)^2 + norm(d2 .* A -A*Z2)^2
end

f(Mprod,x)=f(x)
x0=ArrayPartition(Matrix{Float64}(I,d,d), diag(Z1), diag(Z2))
r_backend = ManifoldDiff.TangentDiffBackend(
    AutoFiniteDifferences(central_fdm(5, 1))
)
gradf_FD(M,p) = ManifoldDiff.gradient(M, f, p, r_backend)


xopt = gradient_descent(Mprod, f, gradf_FD, x0;
    debug=[:Iteration,(:Change, "|Δp|: %1.9f |"),
        (:Cost, " F(x): %1.11f | "), "\n", :Stop],
    stopping_criterion = StopAfterIteration(60)
    );

A=xopt.x[1]
x1=xopt.x[2]
x2=xopt.x[3]

I have not tried the second approach yet, but it is a nice trick to keep in mind for the future. However, should I be worried about the condition number of X'*X?

No. As long as you start out with a well-conditioned X, it should stay that way. (For example, if your starting X is orthogonal.)

The basic reason for this is similar to optimizing on a sphere by changing variables to x/\Vert x \Vert: because \nabla_x f(x/\Vert x \Vert) is tangent to the sphere, gradient-based optimization should never make x stray very far towards the origin. In the same way, the gradient of g(X) = f(X(X^T X)^{-1/2}) is tangent to the manifold of orthogonal matrices, so optimization steps will keep X pretty close to that manifold (i.e., well-conditioned).

(In particular, you can easily show that if X^T X = I, then X^T \nabla g is anti-symmetric. It follows that if you take a small gradient step \delta X = \epsilon \nabla g, then the change in X^T X is O(\epsilon^2), i.e. it stays orthogonal to first order.)

1 Like