Variance-Covariance matrix with missing data

Baba_Yara_Fahiz · June 24, 2022, 2:10pm

I am working with financial data where I have 600 matrices that are all 61 by 25000.
I am computing the variance-covariance matrix but running into performance issues because of missing data.
The way I am handling the missing data is to compute pairwise covariances across columns.
Is there any way I can speed this up?
I have access to a 128 epic system with 220 GB of memory.

using BenchmarkTools, Random, Missings, StatsBase

M = Matrix{Union{Float64, Missing}}(undef,600,25000)
M .= rand(600,25000) 

ix = rand(CartesianIndices(M), 20_000)  
M[ix] .= missing

function my_cov(x)
    nc = size(x,2);
    t = zeros(nc, nc)
    Threads.@threads for i in 1:nc
        for j in 1:nc
            if i <= j
               sx, sy = skipmissings(x[:, i], x[:, j])
                t[i, j] = cov(collect(sx), collect(sy))
                t[j, i] = t[i, j]
            end 
        end
    end
    return t
end


@benchmark my_cov(M)

Thanks a lot for the suggestions

rafael.guerra · June 24, 2022, 6:01pm

Just in case: would it be possible to use NaNs instead of missings?
The function nancov() in the NaNStatistics.jl package is very fast.

Paul_Soderlind · June 24, 2022, 6:15pm

Good idea.

In case you want something different, here is an idea:
(a) for each pair replace (x,y) with 0 if either x or y contains a missing/NaN
(b) keep track of the number of non-missing/NaN and adjust the covariance estimate accordingly.

pdeffebach · June 24, 2022, 6:19pm

I think you want pairwise which lives in StatsBase.jl. But I’m not 100% sure how it works. You should read the docs and see if it helps you.

Baba_Yara_Fahiz · June 24, 2022, 8:20pm

Thanks a lot for the suggestions.

The pairwise solution is this and runs just as fast as my solution above.

@benchmark pairwise(cov, eachcol(M), skipmissing =:pairwise, symmetric =true)

tbeason · June 25, 2022, 12:17am

I know this was not your question, but typically people try not to compute covariance matrices this large as they may not be very trustworthy. You might want to look into shrinkage or other high dimensional estimation techniques. Just an FYI

Topic		Replies	Views
Covariance from DataFrame or TimeArray New to Julia statistics , dataframes , finance	17	2033	October 24, 2021
Most efficient implementation of covariance matrix Performance statistics , multithreading , nerdsnipe , cov , covariance-matrix	12	911	May 28, 2024
Three million linear regressions Performance	4	464	May 23, 2022
Weighted covariance, how to compute? Statistics	9	2912	November 25, 2017
Improving Speed for a Novice Performance	7	2198	May 14, 2018

Variance-Covariance matrix with missing data

Related topics