Most efficient implementation of covariance matrix

Hi everyone,

I was computing some rather large covariance matrices and noticed that that implementation in Statistics is not multithreaded. TLDR, I want the following to be as fast as possible (ideally, for even larger matrices):

julia> a = rand(20000,1500);
@time vcov(a);
  8.743414 seconds (1.51 k allocations: 17.235 MiB)

I went down the rabbit hole for some other packages that implement this, but it seems to me that neither of them are making use of threads. The only package I found that tried this was VectorizedStatistics.jl, but I found this to be slower in practice than the implementation from Base.

I feel like it can’t be so difficult to make this multithreaded, but I cannot imagine no one has done this before, unless there are good reasons that I am missing.
I read somewhere here on discourse that one typically does not want to actually use cov for large matrices, but rather use other estimators (i.e. from CovarianceEstimators.jl) that implement shrinking. However as far as I can tell they all seem to not be multithreaded either.

Ultimately, I would like to use a Cholesky factorization to solve a linear system. Is this problem so ill-conditioned for large matrices that it is pointless to even try to make it more efficient? So far it was good sufficient to add some small diagonal shift to the covariance matrix.

I hope someone can give me more ideas :slight_smile:

Perhaps you can comment on the application you have in mind. In some geospatial applications we avoid building the covariance of all locations at once, because that doesn’t fit into memory, or is too slow.

1 Like

Sure, I am experimenting with implementing a stochastic reconfiguration method:
(see also Quantum Geometric Tensor and Stochastic Reconfiguration — NetKet)

It is somewhat similar to stochastic gradient descend. One tries to optimize a bunch of variational parameters using the quantum geometric tensor, which is essentially a covariance matrix S = <X_k X_l> - <X_k> <X_l> where X is a vector that is estimated using a Markov Chain.

The parameters are optimized by solving a linear system S δα = F
To optimize many parameters, one should have many measurements of X_k so that the covariance matrix is well approximated. However, actually computing the covariance matrix is rather slow.

As you say, I can imagine that practice, this is done very differently. I am absolutely no expert on this to be honest. Perhaps one uses a lazy representation of the covariance matrix instead?
I hope this clarifies things :slight_smile:

The cov implementation in Statistics is multithreaded, because the slowest part of the cov computation is a matrix multiplication, which uses BLAS … and BLAS is multithreaded (as well as SIMD-vectorized).

In particular, if you have an m \times n matrix A, then to compute the covarance matrix you first subtract the column mean to obtain X, which has \Theta(mn) cost; then compute X^* X, which has \Theta(mn^2) cost; then do some normalization, which has \Theta(n^2) cost. So, in cases like yours where m and n are large, the X^* X computation should completely dominate.

You could try different matrix-multiplication libraries, like MKL.jl or Octavian.jl or AppleAccelerate.jl, to see if they are faster than the default OpenBLAS library on your machine. You could also try Float32 precision to see if that is accurate enough for your application.

That’s not surprising. LoopVectorization.jl is very good, but it’s best on single linear-time (BLAS1-like or maybe BLAS2-like) loops; higher-level transformations (like in Octavian.jl) are required to make matrix multiplication (BLAS3) fast enough to compete with optimized BLAS libraries for large matrices.


strange, I only see one thread usage, despite setting the number of BLAS threads to 4 with BLAS.set_num_threads(4).
As a test, if I make a call to eigen for some large matrix, I do see higher thread usage.
Is there something misconfigured on my end?

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, haswell)
  Threads: 5 on 4 virtual cores

Perhaps it would make sense to solve S\,\delta\alpha=F using an interative method. In this case you don’t need to materialize the covariance matrix, you just need the result of the linear transformation that maps \delta\alpha to S\,\delta\alpha. Up to some fiddling, this map basically reduces to two matrix-vector multiplications. Have a look at LinearMaps.jl. (Please be warned that I haven’t thought through the details).


How are you determining thread usage?

If I simply look at the performance, the impact of multiple threads is clearly apparent:

julia> using Statistics, BenchmarkTools, LinearAlgebra

julia> a = rand(20000,1500);

julia> BLAS.set_num_threads(1);

julia> @btime cov($a) evals=1;
  968.765 ms (11 allocations: 246.06 MiB)

julia> BLAS.set_num_threads(2);

julia> @btime cov($a) evals=1;
  524.270 ms (11 allocations: 246.06 MiB)
1 Like

Not for me, unfortunately:

julia> a = rand(4800,10000);
julia> BLAS.set_num_threads(4)

julia> @time cov(a);
 11.101160 seconds (12 allocations: 1.103 GiB)

julia> BLAS.set_num_threads(1)

julia> @time cov(a);
 10.656343 seconds (12 allocations: 1.103 GiB, 0.59% gc time)

I can only guess there is some issue with the BLAS call.

This is an interesting idea, thanks for the link. I believe this will not be faster here, because I expect that evaluating the matrix vector product once will likely have to actually compute each matrix element (though its probably a good idea to check this, perhaps there is a shortcut I am overlooking).

This seems to be the answer! Upon switching to MKL.jl, I see the expected result:

julia> @time cov(a);
  9.637042 seconds (12 allocations: 1.103 GiB, 0.93% gc time)

julia> BLAS.set_num_threads(4)

julia> @time cov(a);
  3.076901 seconds (12 allocations: 1.103 GiB, 1.68% gc time)

For Float32 this time is even reduced by another factor of 2 (yes, I think the precision loss should not matter here).
Thanks a bunch!
I suppose the only remaining mystery is why BLAS does not want to use threads for me, though MKL seems to be faster here anyway.


Could this be linked to the weird interactions between Julia threads and BLAS threads?

I get performance similar to this whether I use julia -t 1 or julia -t 4. Is that what you mean?

In particular, I get the following times with a single julia thread:

11.757666 seconds (12 allocations: 1.103 GiB, 0.04% gc time)
11.723894 seconds (12 allocations: 1.103 GiB, 0.12% gc time)

and the following times with 4 julia threads:

11.804927 seconds (12 allocations: 1.103 GiB, 0.05% gc time)
11.698108 seconds (12 allocations: 1.103 GiB, 0.13% gc time)

Hm, the page doesnt go too much into detail regarding this, but according to the guide everything should be fine when the number of BLAS threads is set correctly. So possibly being a bug with my BLAS artifact?

yup, that’s also what I get without MKL. Could you share what hardware youre running this on? Maybe thats our common denominator

I don’t think this is hardware-related