Numerical inaccuracies when constructing correlation matrices

I’m trying to sample correlation matrices using the vine method explained here:
How to generate a large full-rank random correlation matrix with some strong correlations present? - Cross Validated (which is based on the JKL 2009 paper https://doi.org/10.1016/j.jmva.2009.04.008).

using Random
using LinearAlgebra
Random.seed!(123)

function random_correlation_matrix(d::Int)
    P = fill(0.0, d, d)
    S = Matrix{Float64}(I, d, d)

    for k in 1:(d - 1), i in (k + 1):d
        P[k, i] = (rand() - 0.5) * 2 # scale to [-1, 1]

        p = P[k, i]
        for l in (k - 1):-1:1
            p = p * sqrt((1 - P[l, i] ^ 2) * (1 - P[l, k] ^ 2)) + P[l, i] * P[l, k]
        end
        
        S[k, i] = p
        S[i, k] = p
    end

    permutation = shuffle(1:d)
    S = S[permutation, permutation]

    return S
end

S1 = random_correlation_matrix(10);
S2 = random_correlation_matrix(1000);

This works fine for lower dimensions d, but fails for higher dimensions:

isposdef(S1) # true
isposdef(S2) # false

This seems to be due to accumulating floating-point rounding errors, as the smallest eigenvalues are all close to 0 but negative:

minimum(eigvals(S2)) # -8.46e-14

Is there a method to heal an almost positive semidefinite matrix like this (or tweaks to make the algorithm more robust)?

I’ve found a cor_nearPD function in the Bigsimr.jl package that does exactly what I want, but the problem seems to be simple enough that I’d like to avoid adding another dependency.

I also tried doing an eigendecomposition and simply replace the negative eigenvalues in D, but that didn’t work out because I’m lacking the background in linear algebra to know what I’m doing.

Here’s code I use. Not the greatest code, needs some updating for clarity.

using LinearAlgebra
using 
function quickpd!(x::AbstractMatrix, posdtol = 1e-8)

    n = size(x,1)

    diagIdx = 1:n+1:n^2

   # if !issymmetric(x)
        # TODO: add error
    #end

    vals, vecs = eigen(x)

    eips = posdtol * maximum(abs.(vals))

    vals[vals .< eips] .= eips

    o_diag = diag(x)

    X = vecs * (vals .* transpose(vecs))

    D = sqrt.(max.(eips, o_diag) ./ diag(X))
    
    x .= D .* X .* repeat(transpose(D), inner=(n,1))

    x[diagIdx] .= 1.0

    return x
end
1 Like

Is the vine method different than random sampling from an LKJ distribution in Distributions.jl? I know you said you don’t want to introduce a dependency but just curious:

using Distributions, Random, LinearAlgebra
Random.seed!(123)

S1 = rand(LKJ(10, 1))
S2 = rand(LKJ(1000, 1))
1 Like

Cool, thanks! I tried it out and it actually works

isposdef(quickpd!(copy(S2))) # false, but adding Symmetric:
isposdef(Symmetric(quickpd!(copy(S2)))) # true

Could you maybe explain what’s going on in these lines (or give a link) so I can brush up my linear algebra?

    D = sqrt.(max.(eips, o_diag) ./ diag(X))
    x .= D .* X .* repeat(transpose(D), inner=(n,1))

Yes, from my (limited) understanding using the LKJ distribution is indeed identical to this function (not sure about the eta=1 though).
The authors of the paper linked above are where the initials for the LKJ distribution come from.

My usecase is a bit more generic though, I want the partial correlation coefficients to come from a mixture of Beta distributions, so that I can simulate correlation matrices where some of the correlation coefficients are large and some low (I just left this part out of the minimal example because it wasn’t relevant).

1 Like

The basic issue is that trying to make a matrix exactly semidefinite is inherently nonrobust to roundoff errors. Different ways of computing the eigenvalues will give slightly different values due to differences in floating-point effects, so even if you make the eigenvalues \ge 0 for one computation you might get slightly negative eigenvalues from another computation/algorithm with the same matrix.

The question is, what are you trying to do with the matrix S2 that requires it to be strictly semidefinite? It is generally better to try to fix your subsequent computation to be tolerant of tiny negative eigenvalues in such cases. For example, sqrt(Symmetric(A)) includes a backwards-stable correction to return a real matrix square root for nearly semidefinite matrices (see julia#35057).

Alternatively, if you don’t care too much about accuracy, you can just do something like S2 + 1e-8*norm(S2)*I, of course (sacrificing half of the significant digits).

You should also consider the possibility that you have a bug in your code, or are mis-using the algorithm, or that the source you are using is incorrect. Your linked article claims that the algorithm generates “full-rank” correlation matrices, but your S2 matrix is extremely rank deficient: it is a 1000x1000 matrix with numerical rank (number of non-negligible eigenvalues) of only 64 . Is this really what you want?

3 Likes

Thank you! It does not matter to me that the matrix is strictly semidefinite (that’s impossible anyway), I just want it to behave nice enough that isposdef(A) = true, so I can do stuff like sampling from a multivariate normal without getting a PosDefException. The ultimate goal is to use it for a Gaussian copula.

I believe the full rank thing just comes from the stack exchange post. It is of course entirely possible that there is a bug in the code, I’m misusing the algorithm or that the person in that stack exchange post did something wrong, it might be even all of that!

But for now I’m happy with getting correlation matrices where some elements are high and others low, and the LKJ paper seemed like a good fit because it allows me to set the partial correlations to whatever I want.