Issue: PDMats.jl matrices losing positive-definiteness with Optim.jl autodiff

fibrebundl · October 24, 2025, 10:50pm

I’m optimizing a covariance recovery problem and want to use PDMats.jl to ensure positive-definiteness. However, when using Optim.jl’s autodiff, the matrices created with PDMats aren’t maintaining their PD characteristics.

Has anyone encountered this? What’s the recommended approach for maintaining PD constraints during optimization with these libraries?

gdalle · October 25, 2025, 5:39am

Hi, do you have an MWE?

kellertuer · October 25, 2025, 9:22am

Without an MWE it is a bit hard to provide more insight, but the main reason is, that SPDs do not form a vector space, or in other words: The sum of two SPD matrices is not necessarily SPD; you especially do not have a minus that keeps SPD.
With minus the example is very easy to see; if A is SPD, then A-A is only semidefinite, because it has only zero eigenvalues.

This is for example taken into account by Manopt.jl, though there is two remarks necessary:

it does not use PDMats.jl but the SymetricPositiveDefinite matrices from Manifolds.jl
I am the developer of both mentioned packages, so slightly biased

An advantage of this approach is, that you suddenly can use unconstraint optimization like a L-BFGS on that manifold. The replacement for + (and -) is the exponential (and logarithmic) map, they are a bit more expensive than +/-, so there is no free lunch.

If you want to stick to something constrained, I would really need an MWE and how far you “leave” the constraint and with which algorithm.

PS: Welcome to Julia discourse!

stevengj · October 25, 2025, 11:44am

You could also just use a change of variables to do nonlinear optimization over semidefinite matrices (which you have to include in “SPD” for optimization since the set of strictly positive-definite matrices is not closed). That is, instead of optimizing f(A) over “SPD” matrices A, you optimize f(X^T X) over unconstrained m \times m matrices X, or better yet f(Hermitian(X'X)) to ensure that it is exactly Hermitian despite roundoff errors.

This has no loss of generality, allows you to use any nonlinear optimization algorithm, and is easy. (The downside is that you might lose some special structure in f that would otherwise let you use something specialized like an SDP solver, but I’m guessing you don’t have a straightforward SDP or you would be using such a solver already.)

You can also, of course, use specialized methods for semidefinite constraints. And, of course, it depends a lot on what you want to do with the SPD matrix in your function. If you want to solve linear systems, for example, it would be more efficient to optimize over the vector space of lower- or upper-triangular matrices X, since then you have the Cholesky factorization for free.

Note, however, that X^T X may be slightly indefinite due to roundoff errors, so your f(X^T X) computation needs to be robust to such things. This is a general issue to be aware of when working with SDP matrices; how to ensure robustness depends on what precise calculation you’re performing, but a general idea is to treat slightly negative eigenvalues as zeros. Alternatively you can add some regularization like replacing f(A) with f(A + \varepsilon \Vert A \Vert_F I / m) for a small \varepsilon proportional to the precision. This has come up many times on this forum, e.g.:

(But this may not be an issue if your optimization problem does not favor matrices that are close to singular, i.e. if the SDP constraint is inactive at the optimum — in that case, as long as you start with a random X so that X^T X is far from singular, probably the gradient steps will keep things that way.)

kellertuer · October 25, 2025, 12:00pm

You propose that at nearly all of my answers, so you seem to not really like the idea of using manifolds, I suppose. But sure, there is several ways to write most problems.

gdalle · October 25, 2025, 3:00pm

I came here to say this the surest way to summon @stevengj on this forum is to say you want to optimize a function on the sphere

stevengj · October 25, 2025, 9:33pm

I think it’s great to have optimization algorithms specifically designed for particular manifolds, and I’m not trying to disparage you. It just seems incomplete to me not to mention that you can use a change of variables too, which gives a lot more flexibility in the choice of algorithms. Which way is more efficient will depend on the problem (and the algorithm chosen), so it’s worth trying out (or at least being aware of) both ways when they are available.

(The flexibility of a change-of-variables approach is also useful if you want to impose additional nonlinear constraints, or if you want some variables to live on one manifold and other variables to live on another manifold, or other variations that might not yet have specialized algorithms.)

kellertuer · October 26, 2025, 6:04am

Thanks for the clarification, then I misread your tone (the problems in writing not always being unambiguous and it not being my mother tongue probably).

You are right, both are valid approaches for sure, both with their advantages and disadvantages.

For your comment in brackets – you can still do constrained optimization on manifolds as well, so additional constraints are not a problem there, either.
For multiple variables on different manifolds one can use a producty manifold. The algorithms in Manopt.jl work on arbitrary manifolds, but you are right, for example the semidefinite case (since it is not a manifold) is probably easier to do with reparametrisation.

RoyiAvital · October 26, 2025, 8:07am

There might be a big disadvantage to this approach.
If the argument of optimization is a matrix \boldsymbol{Z} which is parameterized by \boldsymbol{Z} = \boldsymbol{X}^{\top} \boldsymbol{X} you may have ruined the Convexity of the problem.
The set of SPSD matrices is a Convex Set and the projection onto it is known.
In case the objective is also Convex the whole problem could be Convex. It won’t be after this.
In case the problem is Convex the loss of Convexity by the parameterization might be significant.

kellertuer · October 26, 2025, 8:57am

To be honest, the “type of convexity” also changes for the manifold approach.
The reason is, that on manifolds you would not “check” convexity along straight lines (they do not exist) but use geodesics.

I would phrase that point differently: The type of convexity might be a reason to choose one approach over the other as well. Maybe the problem is convex in the Cholesky factor \mathbf{X} in your notation, maybe it is convex on the (Hadamard) manifold – and if it is “Euclidean convex” maybe also a purely constrained approach (without manifolds or reparametrization) is the way to go.

stevengj · October 26, 2025, 12:17pm

You can, but it seems like the only such methods currently implemented in your package are penalty methods like augmented Lagrangians, which have the downside that you have to perform a sequence of unconstrained optimization problems. It’s good to have the option to use NLP methods that directly handle constraints, without requiring you to converge a sequence of auxiliary optima.

Of course, there’s no intrinsic reason why one couldn’t implement more NLP algorithms for manifolds in the future, but in the meantime it’s nice to have the option to exploit the wider universe of Euclidean NLP methods.

kellertuer · October 26, 2025, 12:28pm

Well, there is also an IP-Newton; beyond that, I am not sure how many have been considered/derived, but I would not be aware of them then. But someone then would have to derive them; I might do that for a few, somewhen, though I am not from a background that has much (or at all) funding, so that will take a lot of time.

And then sure, Euclidean stuff is much more mainstream and should then probably be used, as you propose.

fibrebundl · October 27, 2025, 4:22pm

Hi, thank you all for the response and the welcome to the forums. I’m not sure how to reply to multiple people, yet. I definitely should have included a MWE but I was just getting so frustrated with this as it is very simple to do in MATLAB (albeit without automatic differentiation).

Here is the example: dpaste: MWE

It is specifically example 2.

Essentially, I thought that PSDMats.jl would handle simple restraints like applying jitter during the optimization or maintaining hermitian-ness but it doesn’t seem to handle either? I also would occasionally get errors like regarding dual numbers in other attempts to get the optimization to work. I understand that I can optimize through the cholesky and add jitter etc. to keep the argument in the PSD cone and just use BFGS over my objective function but it seems that I cannot do that with PDMats.jl

Ultimately, I’m doing covariance estimation in a hierarchical model but I wanted to try doing this without having to derive the analytic gradients becuase this can be very difficult despite being faster.

stevengj · October 27, 2025, 5:00pm

I think you have entirely the wrong impression about what PSDMats.jl does. As I understand it, that package has nothing to do with enforcing PSD constraints. It is just about expressing the information that a matrix is already PSD in the type domain, so that other algorithms can take advantage of it.

That is,

Sigma = PDMat(Sigma)

assumes that Sigma is already PSD.

fibrebundl · October 27, 2025, 5:32pm

Yes, that is all it does. I realize that what is meant by interface in their summary on github.

github.com/JuliaStats/PDMats.jl

src/pdmat.jl

master

"""
Full positive definite matrix together with a Cholesky factorization object.
"""
struct PDMat{T <: Real, S <: AbstractMatrix{T}} <: AbstractPDMat{T}
    mat::S
    chol::Cholesky{T, S}

    function PDMat{T, S}(m::AbstractMatrix, c::Cholesky) where {T <: Real, S <: AbstractMatrix{T}}
        d = LinearAlgebra.checksquare(m)
        if size(c, 1) != d
            throw(DimensionMismatch("Dimensions of mat and chol are inconsistent."))
        end
        # in principle we might want to check that `c` is a Cholesky factorization of `m`,
        # but that's slow
        return new{T, S}(m, c)
    end
end
function PDMat{T}(m::AbstractMatrix, c::Cholesky) where {T <: Real}
    c = convert(Cholesky{T}, c)
    return PDMat{T, typeof(c.factors)}(m, c)

This file has been truncated. show original

I’ve removed it from my code entirely and I enforce jitter and symmetry now during the optimization itself but I still wanted a way to use automatically differentiate and a way to avoid having to handle these cases for each objective function that I have.

Ideally, I would just write the objective function and the optimizer would handle the rest.

stevengj · October 27, 2025, 5:51pm

In general, optimization algorithms aren’t going to be able to deduce implicit constraints like wanting a matrix to be symmetric. Even if the objective function is written in such a way that the gradient is symmetric for a symmetric input, roundoff errors can cause it to break the symmetry. Nor can they generally detect or enforce the implicit domain of your function — even if your function diverges or throws an exception outside the domain, as is the case here for non-SPD matrices—because optimization algorithms take discrete steps, they might accidentally take a step that is too large and lands outside the domain. You really should enforce the constraints yourself (by choosing an optimization algorithm that can enforce some constraints for you, and/or by using a change of variables that implicitly enforces the constraints).

Have you tried the change-of-variable suggestions above? i.e. just write something like:

function log_normal_pdf(X, data)
    L = LowerTriangular(X) # ignore upper triangle
    # Sigma = Hermitian(L * L')    # need not be explicitly computed
    logdeterminant = sum(log ∘ abs2, diag(L))  # = logdet(Sigma)
    trace = tr(L' \ (L \ data)) # = trace(Sigma \ data) = tr(inv(Sigma) * data)
    return logdeterminant + trace
end

where X is an unconstrained d \times d matrix (or, if you want, you could constrain the upper triangle to be zero, since it is not used anyway, or even omit the upper triangle from your optimization variables to begin with).

In addition to enforcing PSD matrices Sigma, this should be significantly cheaper to compute than your previous version, because it avoids the need to re-compute the Cholesky factor L for the logdet or the inversion/solve — you have it by construction.

(I assume that your real problem is more complicated, though, since this one has an analytical solution.)

PS. I would strongly consider using reverse-mode AD if your matrix gets much bigger than 10\times 10. The cost of forward-mode AD scales linearly with the number of parameters, whereas reverse-mode should cost roughly one additional objective-function evaluation. For that matter, it’s not too difficult to write down the analytical reverse-mode gradient of this function if you know a little matrix calculus. To use AD effectively, it’s important to have some idea of how it works.

PPS. You can speed this function up by at least another factor of two, I think. As I understand it, your data matrix here is itself an SPD covariance matrix. If you Cholesky factorize it to U = cholesky(data).U before doing optimization, then you can use the identity \text{tr}(\Sigma^{-1} \text{data}) = \text{tr}(L^{-T} L^{-1} U^{T} U) = \text{tr}(L^{-1} U^{T} U L^{-T}) = \Vert L^{-1} U^{T} \Vert_F^2, which is simply sum(abs2, L \ U'), saving you a triangular solve.

PPPS. Even with the above transformation, if your optimization algorithm takes a step that makes a diagonal entry of X exactly zero, you will get a singularity. You can reduce the chances of this by replacing L with something like L += eps(norm(L)) * I. But honestly this problem seems unlikely to occur for unconstrained diagonal entries.

fibrebundl · October 27, 2025, 7:00pm

Hi, I am using the change of variables by optimizing through the cholesky and applying jitter essentially in the same way was your example and this is what I was doing before in Matlab. And I lose Positive definiteness and get the same PosDefException error, Factorization Failed.

I’ve tried this by adding jitter at each function call and I tried by adding the cholesky inside and the objective function and another time outside of the objective function and this the result. This does not happen to with MATLAB for the same objective equation without AD.

stevengj · October 27, 2025, 8:11pm

The code I posted cannot throw that exception — it never needs to perform any factorization — so you cannot be doing the same thing as what I suggested.

I’ve tried this by adding jitter at each function call

I don’t know what that means, or why you expect it to help.

This does not happen to with MATLAB for the same objective equation without AD.

Matlab just ignores a slight asymmetry of the matrix when computing a Cholesky factorization (its chol documentation explicitly says “If A is nonsymmetric, then chol treats the matrix as symmetric and uses only the diagonal and upper triangle of A.”), whereas Julia is more picky and throws an error.

If you want Julia to do the same thing at Matlab, just let Sigma_sym = Hermitian(Sigma), which uses only the diagonal and upper triangle of Sigma.

fibrebundl · October 29, 2025, 3:32pm

I’ve got to play around with this now using with Optim.jl and JuMP.jl. It was a lot trickier to get working in JuMP.jl but I used constraints and variables @variable(outer_model, LL[i=1:nt, j=1:nt; i>=j]) and
@constraint(outer_model, [i=1:nt], LL[i,i] >= 1e-6) and it appears to work for the “PSD” constraint and and lower triangular only.

But you mention that this is for efficient for solving a linear system and I’m specifically doing covariance matrix estimation. Would this method still work?

Also I appreciate your other posts concerning valid factorizations and possible negative pivots due to roundoff errors. These are also not things that I considered.

stevengj · October 29, 2025, 4:13pm

The point is that your original objective function involved two steps (a logdet call and an inv call) that both involve matrix factorization (just as in solving linear systems, and indeed multiplying data by \Sigma^{-1} in your trace is explicitly solving linear systems). By re-expressing the function and the degrees of freedom in terms of the Cholesky factor L, you avoid this matrix-factorization/Gaussian-elimination step (triangular solves don’t require factorization), which should improve efficiency.

Topic		Replies	Views
Maximum Likelihood Multivariate Model Optimization (Mathematical) optim	15	1646	January 10, 2020
Newton optimization with Optim doing strange things Optimization (Mathematical)	36	2919	April 12, 2018
Solve for Hermitian PSD with non-linear objective in JuMP Optimization (Mathematical) jump	19	1157	May 27, 2021
Type stable matrix factorization corner case with ForwardDiff Numerics question	9	1091	December 19, 2017
Find gradient of a posterior Statistics question	12	1041	May 16, 2019

Issue: PDMats.jl matrices losing positive-definiteness with Optim.jl autodiff

Related topics