Numerical inaccuracies when constructing correlation matrices

daniel · August 2, 2022, 4:18pm

I’m trying to sample correlation matrices using the vine method explained here:
How to generate a large full-rank random correlation matrix with some strong correlations present? - Cross Validated (which is based on the JKL 2009 paper https://doi.org/10.1016/j.jmva.2009.04.008).

using Random
using LinearAlgebra
Random.seed!(123)

function random_correlation_matrix(d::Int)
    P = fill(0.0, d, d)
    S = Matrix{Float64}(I, d, d)

    for k in 1:(d - 1), i in (k + 1):d
        P[k, i] = (rand() - 0.5) * 2 # scale to [-1, 1]

        p = P[k, i]
        for l in (k - 1):-1:1
            p = p * sqrt((1 - P[l, i] ^ 2) * (1 - P[l, k] ^ 2)) + P[l, i] * P[l, k]
        end
        
        S[k, i] = p
        S[i, k] = p
    end

    permutation = shuffle(1:d)
    S = S[permutation, permutation]

    return S
end

S1 = random_correlation_matrix(10);
S2 = random_correlation_matrix(1000);

This works fine for lower dimensions d, but fails for higher dimensions:

isposdef(S1) # true
isposdef(S2) # false

This seems to be due to accumulating floating-point rounding errors, as the smallest eigenvalues are all close to 0 but negative:

minimum(eigvals(S2)) # -8.46e-14

Is there a method to heal an almost positive semidefinite matrix like this (or tweaks to make the algorithm more robust)?

I’ve found a cor_nearPD function in the Bigsimr.jl package that does exactly what I want, but the problem seems to be simple enough that I’d like to avoid adding another dependency.

I also tried doing an eigendecomposition and simply replace the negative eigenvalues in D, but that didn’t work out because I’m lacking the background in linear algebra to know what I’m doing.

stephenll · August 2, 2022, 5:56pm

Here’s code I use. Not the greatest code, needs some updating for clarity.

using LinearAlgebra
using 
function quickpd!(x::AbstractMatrix, posdtol = 1e-8)

    n = size(x,1)

    diagIdx = 1:n+1:n^2

   # if !issymmetric(x)
        # TODO: add error
    #end

    vals, vecs = eigen(x)

    eips = posdtol * maximum(abs.(vals))

    vals[vals .< eips] .= eips

    o_diag = diag(x)

    X = vecs * (vals .* transpose(vecs))

    D = sqrt.(max.(eips, o_diag) ./ diag(X))
    
    x .= D .* X .* repeat(transpose(D), inner=(n,1))

    x[diagIdx] .= 1.0

    return x
end

joshualeond · August 2, 2022, 6:35pm

Is the vine method different than random sampling from an LKJ distribution in Distributions.jl? I know you said you don’t want to introduce a dependency but just curious:

using Distributions, Random, LinearAlgebra
Random.seed!(123)

S1 = rand(LKJ(10, 1))
S2 = rand(LKJ(1000, 1))

daniel · August 2, 2022, 6:50pm

Cool, thanks! I tried it out and it actually works

isposdef(quickpd!(copy(S2))) # false, but adding Symmetric:
isposdef(Symmetric(quickpd!(copy(S2)))) # true

Could you maybe explain what’s going on in these lines (or give a link) so I can brush up my linear algebra?

    D = sqrt.(max.(eips, o_diag) ./ diag(X))
    x .= D .* X .* repeat(transpose(D), inner=(n,1))

daniel · August 2, 2022, 7:01pm

Yes, from my (limited) understanding using the LKJ distribution is indeed identical to this function (not sure about the eta=1 though).
The authors of the paper linked above are where the initials for the LKJ distribution come from.

My usecase is a bit more generic though, I want the partial correlation coefficients to come from a mixture of Beta distributions, so that I can simulate correlation matrices where some of the correlation coefficients are large and some low (I just left this part out of the minimal example because it wasn’t relevant).

stevengj · August 2, 2022, 7:27pm

The basic issue is that trying to make a matrix exactly semidefinite is inherently nonrobust to roundoff errors. Different ways of computing the eigenvalues will give slightly different values due to differences in floating-point effects, so even if you make the eigenvalues \ge 0 for one computation you might get slightly negative eigenvalues from another computation/algorithm with the same matrix.

The question is, what are you trying to do with the matrix S2 that requires it to be strictly semidefinite? It is generally better to try to fix your subsequent computation to be tolerant of tiny negative eigenvalues in such cases. For example, sqrt(Symmetric(A)) includes a backwards-stable correction to return a real matrix square root for nearly semidefinite matrices (see julia#35057).

Alternatively, if you don’t care too much about accuracy, you can just do something like S2 + 1e-8*norm(S2)*I, of course (sacrificing half of the significant digits).

You should also consider the possibility that you have a bug in your code, or are mis-using the algorithm, or that the source you are using is incorrect. Your linked article claims that the algorithm generates “full-rank” correlation matrices, but your S2 matrix is extremely rank deficient: it is a 1000x1000 matrix with numerical rank (number of non-negligible eigenvalues) of only 64 . Is this really what you want?

daniel · August 2, 2022, 9:59pm

Thank you! It does not matter to me that the matrix is strictly semidefinite (that’s impossible anyway), I just want it to behave nice enough that isposdef(A) = true, so I can do stuff like sampling from a multivariate normal without getting a PosDefException. The ultimate goal is to use it for a Gaussian copula.

I believe the full rank thing just comes from the stack exchange post. It is of course entirely possible that there is a bug in the code, I’m misusing the algorithm or that the person in that stack exchange post did something wrong, it might be even all of that!

But for now I’m happy with getting correlation matrices where some elements are high and others low, and the LKJ paper seemed like a good fit because it allows me to set the partial correlations to whatever I want.

Topic		Replies	Views
[Solved] Determinant and random matrix generation Performance linearalgebra	6	720	February 19, 2019
Verify a matrix is positive semi-definite General Usage	13	9146	April 13, 2020
Generate a positive definite matrix General Usage question , matrices	6	4437	October 19, 2020
Creating a posdef covariance matrix General Usage	8	378	November 27, 2022
Random draws of multivariate normal with positive semi-definite covariance matrix Statistics linearalgebra	4	1113	March 26, 2022

Numerical inaccuracies when constructing correlation matrices

Related topics