In general, optimization algorithms aren’t going to be able to deduce implicit constraints like wanting a matrix to be symmetric. Even if the objective function is written in such a way that the gradient is symmetric for a symmetric input, roundoff errors can cause it to break the symmetry. Nor can they generally detect or enforce the implicit domain of your function — even if your function diverges or throws an exception outside the domain, as is the case here for non-SPD matrices—because optimization algorithms take discrete steps, they might accidentally take a step that is too large and lands outside the domain. You really should enforce the constraints yourself (by choosing an optimization algorithm that can enforce some constraints for you, and/or by using a change of variables that implicitly enforces the constraints).
Have you tried the change-of-variable suggestions above? i.e. just write something like:
function log_normal_pdf(X, data)
L = LowerTriangular(X) # ignore upper triangle
# Sigma = Hermitian(L * L') # need not be explicitly computed
logdeterminant = sum(log ∘ abs2, diag(L)) # = logdet(Sigma)
trace = tr(L' \ (L \ data)) # = trace(Sigma \ data) = tr(inv(Sigma) * data)
return logdeterminant + trace
end
where X is an unconstrained d \times d matrix (or, if you want, you could constrain the upper triangle to be zero, since it is not used anyway, or even omit the upper triangle from your optimization variables to begin with).
In addition to enforcing PSD matrices Sigma, this should be significantly cheaper to compute than your previous version, because it avoids the need to re-compute the Cholesky factor L for the logdet or the inversion/solve — you have it by construction.
(I assume that your real problem is more complicated, though, since this one has an analytical solution.)
PS. I would strongly consider using reverse-mode AD if your matrix gets much bigger than 10\times 10. The cost of forward-mode AD scales linearly with the number of parameters, whereas reverse-mode should cost roughly one additional objective-function evaluation. For that matter, it’s not too difficult to write down the analytical reverse-mode gradient of this function if you know a little matrix calculus. To use AD effectively, it’s important to have some idea of how it works.
PPS. You can speed this function up by at least another factor of two, I think. As I understand it, your data matrix here is itself an SPD covariance matrix. If you Cholesky factorize it to U = cholesky(data).U before doing optimization, then you can use the identity \text{tr}(\Sigma^{-1} \text{data}) = \text{tr}(L^{-T} L^{-1} U^{T} U) = \text{tr}(L^{-1} U^{T} U L^{-T}) = \Vert L^{-1} U^{T} \Vert_F^2, which is simply sum(abs2, L \ U'), saving you a triangular solve.
PPPS. Even with the above transformation, if your optimization algorithm takes a step that makes a diagonal entry of X exactly zero, you will get a singularity. You can reduce the chances of this by replacing L with something like L += eps(norm(L)) * I. But honestly this problem seems unlikely to occur for unconstrained diagonal entries.