I have a large data set consisting of daily stock returns over 30 years and 7000 stocks. The time series do not align with each other across the set (i.e. dropmissing(data) returns no rows). If I compute the covariance matrix in pandas/Python, I get a result that reflects pairwise covariances for overlapping periods. How do I do the same thing with a DataFrame or TimeArray without computing at the pairwise level. Please leave aside the issue of whether the pandas result is actually sensible. I am just trying to understand how to deal with lots of missing data in Julia without degrading performance by hacking the calculation manually.

unfortunately, the only way I can think to do this is to calculate the pairwise covariances manually. `Missings.jl`

has the function `skipmissings`

which returns a pair of iterators that skip missing elements in *both* iterators.

```
julia> df = DataFrame(
a = [rand() < .2 ? missing : rand() for i in 1:100],
b = [rand() < .2 ? missing : rand() for i in 1:100],
c = [rand() < .2 ? missing : rand() for i in 1:100]
);
julia> function covmat(df)
nc = ncol(df)
t = zeros(nc, nc)
for (i, c1) in enumerate(eachcol(df))
for (j, c2) in enumerate(eachcol(df))
sx, sy = skipmissings(c1, c2)
t[i, j] = cov(collect(sx), collect(sy))
end
end
return t
end;
julia> covmat(df)
3×3 Matrix{Float64}:
0.0748675 0.00742271 -0.00240989
0.00742271 0.0805814 0.00814958
-0.00240989 0.00814958 0.0702716
```

It’s unfortunate there isn’t a better way to do this.

Thanks for the answer. I was afraid of that. In this case, the same task pandas handles in 5 minutes takes 45 minutes with Julia. It would be necessary to parallelize the code just to have a hope of making the same timing.

Wow thats a big performance hit, sorry to hear that.

A few performance suggestions:

- Use
`var(skipmissing(c1))`

when`i == j`

- Only fill in
`t`

to be an upper triangular matrix, i.e. change the iteration to be

```
nc = ncol(df)
for i in 1:nc
for j in i:nc
c1 = df[!, i]
c2 = df[!, j]
...
end
end
```

Hopefully someone else can chime in with better ideas. Really sorry you are taking such a performance hit right after switching to Julia.

Yeah, I already did those things and the performance hit included those. I can live with the disappointment as I am crazy about Julia, but I guess I can’t swear off pandas yet.

We would really like to make this faster. Can you do us a favor and confirm that the performance hit is *just* when you have to do the missing values thing? And that covariance in general is not slow.

Oh, yeah, that’s fine. I am packing my data in a TimeArray with 5719 rows and 7091 columns. Running cov(values(data)) on that answers in 4 seconds. The issue is that, to deal with missings, we must extract copies of the data from the underlying data structure and run cov(.) individually on the requisite elements. That is costly even for pandas, but my feeling is that the work is done on the underlying raw data using views rather than by copying, etc.

That’s good to know. I worked on a PR a while ago to make `covariance`

work lazily with `skipmissing`

-related items. It did not get merged but this would have been a good use-case.

Could you also benchmark a case of the above function where you pass a `view`

of all the indices that are not missing for both arrays? You can pass that directly to `cov`

. Maybe that will be faster.

That is indeed faster. Now Julia is 805 seconds to pythons 268 seconds.

Great! But this is definitely something we can work on more.

Finally, maybe this will help

```
julia> function mycov(x, y)
t = zip(x, y)
s1 = 0.0
s2 = 0.0
sc = 0.0
counter = 0
for ti in t
t1 = first(ti)
t2 = last(ti)
s1 += t1
s2 += t2
sc += t1 * t2
counter += 1
end
return (counter / (counter - 1)) * ((sc / counter) - (s1 * s2) / (counter^2))
end
```

You can run this after doing `sx, sy = skipmissings(x, y)`

EDIT: this is maybe faster

```
julia> x = [rand() < .2 ? missing : rand() for i in 1:10_000];
julia> y = [rand() < .2 ? missing : rand() for i in 1:10_000];
julia> sx, sy = skipmissings(x, y);
julia> function mycov2(x, y)
s1 = 0.0
s2 = 0.0
sc = 0.0
counter = 0
for i in eachindex(x)
t1 = x[i]
t2 = y[i]
s1 += t1
s2 += t2
sc += t1 * t2
counter += 1
end
return ((counter-1) / counter) * ((sc / counter) - (s1 * s2) / (counter^2))
end;
julia> @btime mycov2(sx, sy);
83.645 μs (1 allocation: 16 bytes)
julia> @btime mycov(sx, sy);
102.753 μs (1 allocation: 16 bytes)
```