Issue: PDMats.jl matrices losing positive-definiteness with Optim.jl autodiff

I’m optimizing a covariance recovery problem and want to use PDMats.jl to ensure positive-definiteness. However, when using Optim.jl’s autodiff, the matrices created with PDMats aren’t maintaining their PD characteristics.

Has anyone encountered this? What’s the recommended approach for maintaining PD constraints during optimization with these libraries?

1 Like

Hi, do you have an MWE?

Without an MWE it is a bit hard to provide more insight, but the main reason is, that SPDs do not form a vector space, or in other words: The sum of two SPD matrices is not necessarily SPD; you especially do not have a minus that keeps SPD.
With minus the example is very easy to see; if A is SPD, then A-A is only semidefinite, because it has only zero eigenvalues.

This is for example taken into account by Manopt.jl, though there is two remarks necessary:

  1. it does not use PDMats.jl but the SymetricPositiveDefinite matrices from Manifolds.jl
  2. I am the developer of both mentioned packages, so slightly biased

An advantage of this approach is, that you suddenly can use unconstraint optimization like a L-BFGS on that manifold. The replacement for + (and -) is the exponential (and logarithmic) map, they are a bit more expensive than +/-, so there is no free lunch.

If you want to stick to something constrained, I would really need an MWE and how far you “leave” the constraint and with which algorithm.

PS: :wave: Welcome to Julia discourse!

3 Likes

You could also just use a change of variables to do nonlinear optimization over semidefinite matrices (which you have to include in “SPD” for optimization since the set of strictly positive-definite matrices is not closed). That is, instead of optimizing f(A) over “SPD” matrices A, you optimize f(X^T X) over unconstrained m \times m matrices X, or better yet f(Hermitian(X'X)) to ensure that it is exactly Hermitian despite roundoff errors.

This has no loss of generality, allows you to use any nonlinear optimization algorithm, and is easy. (The downside is that you might lose some special structure in f that would otherwise let you use something specialized like an SDP solver, but I’m guessing you don’t have a straightforward SDP or you would be using such a solver already.)

You can also, of course, use specialized methods for semidefinite constraints. And, of course, it depends a lot on what you want to do with the SPD matrix in your function. If you want to solve linear systems, for example, it would be more efficient to optimize over the vector space of lower- or upper-triangular matrices X, since then you have the Cholesky factorization for free.

Note, however, that X^T X may be slightly indefinite due to roundoff errors, so your f(X^T X) computation needs to be robust to such things. This is a general issue to be aware of when working with SDP matrices; how to ensure robustness depends on what precise calculation you’re performing, but a general idea is to treat slightly negative eigenvalues as zeros. Alternatively you can add some regularization like replacing f(A) with f(A + \varepsilon \Vert A \Vert_F I / m) for a small \varepsilon proportional to the precision. This has come up many times on this forum, e.g.:

(But this may not be an issue if your optimization problem does not favor matrices that are close to singular, i.e. if the SDP constraint is inactive at the optimum — in that case, as long as you start with a random X so that X^T X is far from singular, probably the gradient steps will keep things that way.)

5 Likes

You propose that at nearly all of my answers, so you seem to not really like the idea of using manifolds, I suppose. But sure, there is several ways to write most problems.

I came here to say this :sweat_smile: the surest way to summon @stevengj on this forum is to say you want to optimize a function on the sphere

5 Likes

I think it’s great to have optimization algorithms specifically designed for particular manifolds, and I’m not trying to disparage you. It just seems incomplete to me not to mention that you can use a change of variables too, which gives a lot more flexibility in the choice of algorithms. Which way is more efficient will depend on the problem (and the algorithm chosen), so it’s worth trying out (or at least being aware of) both ways when they are available.

(The flexibility of a change-of-variables approach is also useful if you want to impose additional nonlinear constraints, or if you want some variables to live on one manifold and other variables to live on another manifold, or other variations that might not yet have specialized algorithms.)

2 Likes

Thanks for the clarification, then I misread your tone (the problems in writing not always being unambiguous and it not being my mother tongue probably).

You are right, both are valid approaches for sure, both with their advantages and disadvantages.

For your comment in brackets – you can still do constrained optimization on manifolds as well, so additional constraints are not a problem there, either.
For multiple variables on different manifolds one can use a producty manifold. The algorithms in Manopt.jl work on arbitrary manifolds, but you are right, for example the semidefinite case (since it is not a manifold) is probably easier to do with reparametrisation.

There might be a big disadvantage to this approach.
If the argument of optimization is a matrix \boldsymbol{Z} which is parameterized by \boldsymbol{Z} = \boldsymbol{X}^{\top} \boldsymbol{X} you may have ruined the Convexity of the problem.
The set of SPSD matrices is a Convex Set and the projection onto it is known.
In case the objective is also Convex the whole problem could be Convex. It won’t be after this.
In case the problem is Convex the loss of Convexity by the parameterization might be significant.

To be honest, the “type of convexity” also changes for the manifold approach.
The reason is, that on manifolds you would not “check” convexity along straight lines (they do not exist) but use geodesics.

I would phrase that point differently: The type of convexity might be a reason to choose one approach over the other as well. Maybe the problem is convex in the Cholesky factor \mathbf{X} in your notation, maybe it is convex on the (Hadamard) manifold – and if it is “Euclidean convex” maybe also a purely constrained approach (without manifolds or reparametrization) is the way to go.

2 Likes

You can, but it seems like the only such methods currently implemented in your package are penalty methods like augmented Lagrangians, which have the downside that you have to perform a sequence of unconstrained optimization problems. It’s good to have the option to use NLP methods that directly handle constraints, without requiring you to converge a sequence of auxiliary optima.

Of course, there’s no intrinsic reason why one couldn’t implement more NLP algorithms for manifolds in the future, but in the meantime it’s nice to have the option to exploit the wider universe of Euclidean NLP methods.

Well, there is also an IP-Newton; beyond that, I am not sure how many have been considered/derived, but I would not be aware of them then. But someone then would have to derive them; I might do that for a few, somewhen, though I am not from a background that has much (or at all) funding, so that will take a lot of time.

And then sure, Euclidean stuff is much more mainstream and should then probably be used, as you propose.

Hi, thank you all for the response and the welcome to the forums. I’m not sure how to reply to multiple people, yet. I definitely should have included a MWE but I was just getting so frustrated with this as it is very simple to do in MATLAB (albeit without automatic differentiation).

Here is the example: dpaste: MWE

It is specifically example 2.

Essentially, I thought that PSDMats.jl would handle simple restraints like applying jitter during the optimization or maintaining hermitian-ness but it doesn’t seem to handle either? I also would occasionally get errors like regarding dual numbers in other attempts to get the optimization to work. I understand that I can optimize through the cholesky and add jitter etc. to keep the argument in the PSD cone and just use BFGS over my objective function but it seems that I cannot do that with PDMats.jl

Ultimately, I’m doing covariance estimation in a hierarchical model but I wanted to try doing this without having to derive the analytic gradients becuase this can be very difficult despite being faster.

I think you have entirely the wrong impression about what PSDMats.jl does. As I understand it, that package has nothing to do with enforcing PSD constraints. It is just about expressing the information that a matrix is already PSD in the type domain, so that other algorithms can take advantage of it.

That is,

Sigma = PDMat(Sigma)

assumes that Sigma is already PSD.

Yes, that is all it does. I realize that what is meant by interface in their summary on github.

I’ve removed it from my code entirely and I enforce jitter and symmetry now during the optimization itself but I still wanted a way to use automatically differentiate and a way to avoid having to handle these cases for each objective function that I have.

Ideally, I would just write the objective function and the optimizer would handle the rest.

In general, optimization algorithms aren’t going to be able to deduce implicit constraints like wanting a matrix to be symmetric. Even if the objective function is written in such a way that the gradient is symmetric for a symmetric input, roundoff errors can cause it to break the symmetry. Nor can they generally detect or enforce the implicit domain of your function — even if your function diverges or throws an exception outside the domain, as is the case here for non-SPD matrices—because optimization algorithms take discrete steps, they might accidentally take a step that is too large and lands outside the domain. You really should enforce the constraints yourself (by choosing an optimization algorithm that can enforce some constraints for you, and/or by using a change of variables that implicitly enforces the constraints).

Have you tried the change-of-variable suggestions above? i.e. just write something like:

function log_normal_pdf(X, data)
    L = LowerTriangular(X) # ignore upper triangle
    # Sigma = Hermitian(L * L')    # need not be explicitly computed
    logdeterminant = sum(log ∘ abs2, diag(L))  # = logdet(Sigma)
    trace = tr(L' \ (L \ data)) # = trace(Sigma \ data) = tr(inv(Sigma) * data)
    return logdeterminant + trace
end

where X is an unconstrained d \times d matrix (or, if you want, you could constrain the upper triangle to be zero, since it is not used anyway, or even omit the upper triangle from your optimization variables to begin with).

In addition to enforcing PSD matrices Sigma, this should be significantly cheaper to compute than your previous version, because it avoids the need to re-compute the Cholesky factor L for the logdet or the inversion/solve — you have it by construction.

(I assume that your real problem is more complicated, though, since this one has an analytical solution.)

PS. I would strongly consider using reverse-mode AD if your matrix gets much bigger than 10\times 10. The cost of forward-mode AD scales linearly with the number of parameters, whereas reverse-mode should cost roughly one additional objective-function evaluation. For that matter, it’s not too difficult to write down the analytical reverse-mode gradient of this function if you know a little matrix calculus. To use AD effectively, it’s important to have some idea of how it works.

PPS. You can speed this function up by at least another factor of two, I think. As I understand it, your data matrix here is itself an SPD covariance matrix. If you Cholesky factorize it to U = cholesky(data).U before doing optimization, then you can use the identity \text{tr}(\Sigma^{-1} \text{data}) = \text{tr}(L^{-T} L^{-1} U^{T} U) = \text{tr}(L^{-1} U^{T} U L^{-T}) = \Vert L^{-1} U^{T} \Vert_F^2, which is simply sum(abs2, L \ U'), saving you a triangular solve.

PPPS. Even with the above transformation, if your optimization algorithm takes a step that makes a diagonal entry of X exactly zero, you will get a singularity. You can reduce the chances of this by replacing L with something like L += eps(norm(L)) * I. But honestly this problem seems unlikely to occur for unconstrained diagonal entries.

Hi, I am using the change of variables by optimizing through the cholesky and applying jitter essentially in the same way was your example and this is what I was doing before in Matlab. And I lose Positive definiteness and get the same PosDefException error, Factorization Failed.

I’ve tried this by adding jitter at each function call and I tried by adding the cholesky inside and the objective function and another time outside of the objective function and this the result. This does not happen to with MATLAB for the same objective equation without AD.

The code I posted cannot throw that exception — it never needs to perform any factorization — so you cannot be doing the same thing as what I suggested.

I’ve tried this by adding jitter at each function call

I don’t know what that means, or why you expect it to help.

This does not happen to with MATLAB for the same objective equation without AD.

Matlab just ignores a slight asymmetry of the matrix when computing a Cholesky factorization (its chol documentation explicitly says “If A is nonsymmetric, then chol treats the matrix as symmetric and uses only the diagonal and upper triangle of A.”), whereas Julia is more picky and throws an error.

If you want Julia to do the same thing at Matlab, just let Sigma_sym = Hermitian(Sigma), which uses only the diagonal and upper triangle of Sigma.