I have a large data set consisting of daily stock returns over 30 years and 7000 stocks. The time series do not align with each other across the set (i.e. dropmissing(data) returns no rows). If I compute the covariance matrix in pandas/Python, I get a result that reflects pairwise covariances for overlapping periods. How do I do the same thing with a DataFrame or TimeArray without computing at the pairwise level. Please leave aside the issue of whether the pandas result is actually sensible. I am just trying to understand how to deal with lots of missing data in Julia without degrading performance by hacking the calculation manually.
unfortunately, the only way I can think to do this is to calculate the pairwise covariances manually. Missings.jl
has the function skipmissings
which returns a pair of iterators that skip missing elements in both iterators.
julia> df = DataFrame(
a = [rand() < .2 ? missing : rand() for i in 1:100],
b = [rand() < .2 ? missing : rand() for i in 1:100],
c = [rand() < .2 ? missing : rand() for i in 1:100]
);
julia> function covmat(df)
nc = ncol(df)
t = zeros(nc, nc)
for (i, c1) in enumerate(eachcol(df))
for (j, c2) in enumerate(eachcol(df))
sx, sy = skipmissings(c1, c2)
t[i, j] = cov(collect(sx), collect(sy))
end
end
return t
end;
julia> covmat(df)
3×3 Matrix{Float64}:
0.0748675 0.00742271 -0.00240989
0.00742271 0.0805814 0.00814958
-0.00240989 0.00814958 0.0702716
It’s unfortunate there isn’t a better way to do this.
Thanks for the answer. I was afraid of that. In this case, the same task pandas handles in 5 minutes takes 45 minutes with Julia. It would be necessary to parallelize the code just to have a hope of making the same timing.
Wow thats a big performance hit, sorry to hear that.
A few performance suggestions:
- Use
var(skipmissing(c1))
wheni == j
- Only fill in
t
to be an upper triangular matrix, i.e. change the iteration to be
nc = ncol(df)
for i in 1:nc
for j in i:nc
c1 = df[!, i]
c2 = df[!, j]
...
end
end
Hopefully someone else can chime in with better ideas. Really sorry you are taking such a performance hit right after switching to Julia.
Yeah, I already did those things and the performance hit included those. I can live with the disappointment as I am crazy about Julia, but I guess I can’t swear off pandas yet.
We would really like to make this faster. Can you do us a favor and confirm that the performance hit is just when you have to do the missing values thing? And that covariance in general is not slow.
Oh, yeah, that’s fine. I am packing my data in a TimeArray with 5719 rows and 7091 columns. Running cov(values(data)) on that answers in 4 seconds. The issue is that, to deal with missings, we must extract copies of the data from the underlying data structure and run cov(.) individually on the requisite elements. That is costly even for pandas, but my feeling is that the work is done on the underlying raw data using views rather than by copying, etc.
That’s good to know. I worked on a PR a while ago to make covariance
work lazily with skipmissing
-related items. It did not get merged but this would have been a good use-case.
Could you also benchmark a case of the above function where you pass a view
of all the indices that are not missing for both arrays? You can pass that directly to cov
. Maybe that will be faster.
That is indeed faster. Now Julia is 805 seconds to pythons 268 seconds.
Great! But this is definitely something we can work on more.
Finally, maybe this will help
julia> function mycov(x, y)
t = zip(x, y)
s1 = 0.0
s2 = 0.0
sc = 0.0
counter = 0
for ti in t
t1 = first(ti)
t2 = last(ti)
s1 += t1
s2 += t2
sc += t1 * t2
counter += 1
end
return (counter / (counter - 1)) * ((sc / counter) - (s1 * s2) / (counter^2))
end
You can run this after doing sx, sy = skipmissings(x, y)
EDIT: this is maybe faster
julia> x = [rand() < .2 ? missing : rand() for i in 1:10_000];
julia> y = [rand() < .2 ? missing : rand() for i in 1:10_000];
julia> sx, sy = skipmissings(x, y);
julia> function mycov2(x, y)
s1 = 0.0
s2 = 0.0
sc = 0.0
counter = 0
for i in eachindex(x)
t1 = x[i]
t2 = y[i]
s1 += t1
s2 += t2
sc += t1 * t2
counter += 1
end
return ((counter-1) / counter) * ((sc / counter) - (s1 * s2) / (counter^2))
end;
julia> @btime mycov2(sx, sy);
83.645 μs (1 allocation: 16 bytes)
julia> @btime mycov(sx, sy);
102.753 μs (1 allocation: 16 bytes)
I know this is not the point of the post but in case anybody blindly copy/pastes this code I believe there might be a typo in mycov2: The prefactor in the final line should be inverted (as is the case in mycov) (unless I’m missing something?)
Hi, I ran in to the same issue and am wondering whether there has been movement on this front. Cheers. Paul
This could be a use case for the Impute.jl package, which runs extremelly fast in the example below.
NB: no idea about the caveats of financial data, but provided a MWE, FWIW.
using Statistics, DataFrames
using Impute: Interpolate, NOCB, LOCF
M = Matrix{Union{Float64, Missing}}(undef,5719,7091)
M .= rand(5719,7091)
# cov(M) # Computation takes 2.3 s with no missings
ix = rand(CartesianIndices(M), 20_000)
M[ix] .= missing # 20_000 missings in 5719x7091 Matrix M
df = DataFrame(M,:auto)
# NOCB (Next Observation Carried Backward)
# LOCF (Last Observation Carried Forward)
C = Interpolate() ∘ NOCB() ∘ LOCF()
dg = C(df) # DataFrame imputation (very fast)
cov(Matrix(dg)) # with no missings takes only 2.3 s
Wait what’s the change I need to make? Don’t want a bad example floating around.
My understanding is that your functions mycov and mycov2 above are both meant to calculate the Bessel corrected sample covariance matrix. If so, then I believe in mcov2 you should change
return ((counter-1) / counter) * ((sc / counter) - (s1 * s2) / (counter^2))
to
return (counter / (counter - 1)) * ((sc / counter) - (s1 * s2) / (counter^2))
(which is already how you calculate the return value in mycov).
Okay. It appears to late to edit my response. Hopefully people will see the rest of this thread before blindly copying and pasting.