Covariance from DataFrame or TimeArray

rvaj · October 14, 2020, 4:24pm

I have a large data set consisting of daily stock returns over 30 years and 7000 stocks. The time series do not align with each other across the set (i.e. dropmissing(data) returns no rows). If I compute the covariance matrix in pandas/Python, I get a result that reflects pairwise covariances for overlapping periods. How do I do the same thing with a DataFrame or TimeArray without computing at the pairwise level. Please leave aside the issue of whether the pandas result is actually sensible. I am just trying to understand how to deal with lots of missing data in Julia without degrading performance by hacking the calculation manually.

pdeffebach · October 14, 2020, 4:43pm

unfortunately, the only way I can think to do this is to calculate the pairwise covariances manually. Missings.jl has the function skipmissings which returns a pair of iterators that skip missing elements in both iterators.

julia> df = DataFrame(
           a = [rand() < .2 ? missing : rand() for i in 1:100],
           b = [rand() < .2 ? missing : rand() for i in 1:100],
           c = [rand() < .2 ? missing : rand() for i in 1:100]
       );

julia> function covmat(df)
       nc = ncol(df)
       t = zeros(nc, nc)
       for (i, c1) in enumerate(eachcol(df))
           for (j, c2) in enumerate(eachcol(df))
               sx, sy = skipmissings(c1, c2)
               t[i, j] = cov(collect(sx), collect(sy))
           end
       end
       return t
       end;

julia> covmat(df)
3×3 Matrix{Float64}:
  0.0748675   0.00742271  -0.00240989
  0.00742271  0.0805814    0.00814958
 -0.00240989  0.00814958   0.0702716

It’s unfortunate there isn’t a better way to do this.

rvaj · October 14, 2020, 6:23pm

pdeffebach:

function covmat(df)
       nc = ncol(df)
       t = zeros(nc, nc)
       for (i, c1) in enumerate(eachcol(df))
           for (j, c2) in enumerate(eachcol(df))
               sx, sy = skipmissings(c1, c2)
               t[i, j] = cov(collect(sx), collect(sy))
           end
       end
       return t
       end;

Thanks for the answer. I was afraid of that. In this case, the same task pandas handles in 5 minutes takes 45 minutes with Julia. It would be necessary to parallelize the code just to have a hope of making the same timing.

pdeffebach · October 14, 2020, 6:37pm

Wow thats a big performance hit, sorry to hear that.

A few performance suggestions:

Use var(skipmissing(c1)) when i == j
Only fill in t to be an upper triangular matrix, i.e. change the iteration to be

nc = ncol(df)
for i in 1:nc
    for j in i:nc
          c1 = df[!, i]
          c2 = df[!, j]
          ...
    end
end

Hopefully someone else can chime in with better ideas. Really sorry you are taking such a performance hit right after switching to Julia.

rvaj · October 14, 2020, 6:48pm

Yeah, I already did those things and the performance hit included those. I can live with the disappointment as I am crazy about Julia, but I guess I can’t swear off pandas yet.

pdeffebach · October 14, 2020, 6:52pm

We would really like to make this faster. Can you do us a favor and confirm that the performance hit is just when you have to do the missing values thing? And that covariance in general is not slow.

rvaj · October 14, 2020, 7:16pm

Oh, yeah, that’s fine. I am packing my data in a TimeArray with 5719 rows and 7091 columns. Running cov(values(data)) on that answers in 4 seconds. The issue is that, to deal with missings, we must extract copies of the data from the underlying data structure and run cov(.) individually on the requisite elements. That is costly even for pandas, but my feeling is that the work is done on the underlying raw data using views rather than by copying, etc.

pdeffebach · October 14, 2020, 8:29pm

That’s good to know. I worked on a PR a while ago to make covariance work lazily with skipmissing-related items. It did not get merged but this would have been a good use-case.

pdeffebach · October 15, 2020, 1:41pm

Could you also benchmark a case of the above function where you pass a view of all the indices that are not missing for both arrays? You can pass that directly to cov. Maybe that will be faster.

rvaj · October 15, 2020, 4:09pm

That is indeed faster. Now Julia is 805 seconds to pythons 268 seconds.

pdeffebach · October 15, 2020, 4:15pm

Great! But this is definitely something we can work on more.

pdeffebach · October 15, 2020, 4:33pm

Finally, maybe this will help

julia> function mycov(x, y)
       t = zip(x, y)
       s1 = 0.0
       s2 = 0.0
       sc = 0.0
       counter = 0
       for ti in t
           t1 = first(ti)
           t2 = last(ti)
           s1 += t1
           s2 += t2
           sc += t1 * t2
           counter += 1
       end
       return (counter / (counter - 1)) * ((sc / counter) - (s1 * s2) / (counter^2))
       end

You can run this after doing sx, sy = skipmissings(x, y)

EDIT: this is maybe faster

julia> x = [rand() < .2 ? missing : rand() for i in 1:10_000];

julia> y = [rand() < .2 ? missing : rand() for i in 1:10_000];

julia> sx, sy = skipmissings(x, y);

julia> function mycov2(x, y)
       s1 = 0.0
       s2 = 0.0
       sc = 0.0
       counter = 0
       for i in eachindex(x)
           t1 = x[i]
           t2 = y[i]
           s1 += t1
           s2 += t2
           sc += t1 * t2
           counter += 1
       end
       return ((counter-1) / counter) * ((sc / counter) - (s1 * s2) / (counter^2))
       end;

julia> @btime mycov2(sx, sy);
  83.645 μs (1 allocation: 16 bytes)

julia> @btime mycov(sx, sy);
  102.753 μs (1 allocation: 16 bytes)

pboes · October 24, 2021, 9:24am

I know this is not the point of the post but in case anybody blindly copy/pastes this code I believe there might be a typo in mycov2: The prefactor in the final line should be inverted (as is the case in mycov) (unless I’m missing something?)

pboes · October 24, 2021, 9:25am

Hi, I ran in to the same issue and am wondering whether there has been movement on this front. Cheers. Paul

rafael.guerra · October 24, 2021, 4:42pm

This could be a use case for the Impute.jl package, which runs extremelly fast in the example below.

NB: no idea about the caveats of financial data, but provided a MWE, FWIW.

using Statistics, DataFrames
using Impute: Interpolate, NOCB, LOCF

M = Matrix{Union{Float64, Missing}}(undef,5719,7091)
M .= rand(5719,7091)
# cov(M)  # Computation takes 2.3 s with no missings

ix = rand(CartesianIndices(M), 20_000)  
M[ix] .= missing         # 20_000 missings in 5719x7091 Matrix M
df = DataFrame(M,:auto)

# NOCB (Next Observation Carried Backward)
# LOCF (Last Observation Carried Forward)
C = Interpolate() ∘ NOCB() ∘ LOCF() 
dg = C(df)         # DataFrame imputation (very fast)
cov(Matrix(dg))    # with no missings takes only 2.3 s

pdeffebach · October 24, 2021, 5:28pm

Wait what’s the change I need to make? Don’t want a bad example floating around.

pboes · October 24, 2021, 7:43pm

My understanding is that your functions mycov and mycov2 above are both meant to calculate the Bessel corrected sample covariance matrix. If so, then I believe in mcov2 you should change

return ((counter-1) / counter) * ((sc / counter) - (s1 * s2) / (counter^2))

to

return (counter / (counter - 1)) * ((sc / counter) - (s1 * s2) / (counter^2))

(which is already how you calculate the return value in mycov).

pdeffebach · October 24, 2021, 7:52pm

Okay. It appears to late to edit my response. Hopefully people will see the rest of this thread before blindly copying and pasting.

Topic		Replies	Views
Variance-Covariance matrix with missing data Performance statistics , missing-values	5	1219	June 25, 2022
How to calculate correlation and covariance matrix between columns of a TimeArray Statistics time-series	1	383	February 4, 2023
Whats the easiest way to create correlation matrices in Julia? New to Julia question , statistics	5	4343	November 5, 2021
Skipmissing no working in cor function New to Julia question	5	1092	November 11, 2021
Iterate over all numeric columns in DataFrames Data	21	4900	February 11, 2018

Covariance from DataFrame or TimeArray

Related topics