Variance-Covariance matrix with missing data

I am working with financial data where I have 600 matrices that are all 61 by 25000.
I am computing the variance-covariance matrix but running into performance issues because of missing data.
The way I am handling the missing data is to compute pairwise covariances across columns.
Is there any way I can speed this up?
I have access to a 128 epic system with 220 GB of memory.

using BenchmarkTools, Random, Missings, StatsBase

M = Matrix{Union{Float64, Missing}}(undef,600,25000)
M .= rand(600,25000) 

ix = rand(CartesianIndices(M), 20_000)  
M[ix] .= missing

function my_cov(x)
    nc = size(x,2);
    t = zeros(nc, nc)
    Threads.@threads for i in 1:nc
        for j in 1:nc
            if i <= j
               sx, sy = skipmissings(x[:, i], x[:, j])
                t[i, j] = cov(collect(sx), collect(sy))
                t[j, i] = t[i, j]
            end 
        end
    end
    return t
end


@benchmark my_cov(M) 

Thanks a lot for the suggestions

Just in case: would it be possible to use NaNs instead of missings?
The function nancov() in the NaNStatistics.jl package is very fast.

1 Like

Good idea.

In case you want something different, here is an idea:
(a) for each pair replace (x,y) with 0 if either x or y contains a missing/NaN
(b) keep track of the number of non-missing/NaN and adjust the covariance estimate accordingly.

1 Like

I think you want pairwise which lives in StatsBase.jl. But Iā€™m not 100% sure how it works. You should read the docs and see if it helps you.

1 Like

Thanks a lot for the suggestions.

The pairwise solution is this and runs just as fast as my solution above.

@benchmark pairwise(cov, eachcol(M), skipmissing =:pairwise, symmetric =true) 

I know this was not your question, but typically people try not to compute covariance matrices this large as they may not be very trustworthy. You might want to look into shrinkage or other high dimensional estimation techniques. Just an FYI

1 Like