I am working with financial data where I have 600 matrices that are all 61 by 25000.
I am computing the variance-covariance matrix but running into performance issues because of missing data.
The way I am handling the missing data is to compute pairwise covariances across columns.
Is there any way I can speed this up?
I have access to a 128 epic system with 220 GB of memory.
using BenchmarkTools, Random, Missings, StatsBase
M = Matrix{Union{Float64, Missing}}(undef,600,25000)
M .= rand(600,25000)
ix = rand(CartesianIndices(M), 20_000)
M[ix] .= missing
function my_cov(x)
nc = size(x,2);
t = zeros(nc, nc)
Threads.@threads for i in 1:nc
for j in 1:nc
if i <= j
sx, sy = skipmissings(x[:, i], x[:, j])
t[i, j] = cov(collect(sx), collect(sy))
t[j, i] = t[i, j]
end
end
end
return t
end
@benchmark my_cov(M)
Thanks a lot for the suggestions