I was computing some rather large covariance matrices and noticed that that implementation in Statistics is not multithreaded. TLDR, I want the following to be as fast as possible (ideally, for even larger matrices):

julia> a = rand(20000,1500);
@time vcov(a);
8.743414 seconds (1.51 k allocations: 17.235 MiB)

I went down the rabbit hole for some other packages that implement this, but it seems to me that neither of them are making use of threads. The only package I found that tried this was VectorizedStatistics.jl, but I found this to be slower in practice than the implementation from Base.

I feel like it can’t be so difficult to make this multithreaded, but I cannot imagine no one has done this before, unless there are good reasons that I am missing.
I read somewhere here on discourse that one typically does not want to actually use cov for large matrices, but rather use other estimators (i.e. from CovarianceEstimators.jl) that implement shrinking. However as far as I can tell they all seem to not be multithreaded either.

Ultimately, I would like to use a Cholesky factorization to solve a linear system. Is this problem so ill-conditioned for large matrices that it is pointless to even try to make it more efficient? So far it was good sufficient to add some small diagonal shift to the covariance matrix.

Perhaps you can comment on the application you have in mind. In some geospatial applications we avoid building the covariance of all locations at once, because that doesn’t fit into memory, or is too slow.

It is somewhat similar to stochastic gradient descend. One tries to optimize a bunch of variational parameters using the quantum geometric tensor, which is essentially a covariance matrix S = <X_k X_l> - <X_k> <X_l> where X is a vector that is estimated using a Markov Chain.

The parameters are optimized by solving a linear system S δα = F
To optimize many parameters, one should have many measurements of X_k so that the covariance matrix is well approximated. However, actually computing the covariance matrix is rather slow.

As you say, I can imagine that practice, this is done very differently. I am absolutely no expert on this to be honest. Perhaps one uses a lazy representation of the covariance matrix instead?
I hope this clarifies things

The cov implementation in Statisticsis multithreaded, because the slowest part of the cov computation is a matrix multiplication, which uses BLAS … and BLAS is multithreaded (as well as SIMD-vectorized).

In particular, if you have an m \times n matrix A, then to compute the covarance matrix you first subtract the column mean to obtain X, which has \Theta(mn) cost; then compute X^* X, which has \Theta(mn^2) cost; then do some normalization, which has \Theta(n^2) cost. So, in cases like yours where m and n are large, the X^* X computation should completely dominate.

You could try different matrix-multiplication libraries, like MKL.jl or Octavian.jl or AppleAccelerate.jl, to see if they are faster than the default OpenBLAS library on your machine. You could also try Float32 precision to see if that is accurate enough for your application.

That’s not surprising. LoopVectorization.jl is very good, but it’s best on single linear-time (BLAS1-like or maybe BLAS2-like) loops; higher-level transformations (like in Octavian.jl) are required to make matrix multiplication (BLAS3) fast enough to compete with optimized BLAS libraries for large matrices.

strange, I only see one thread usage, despite setting the number of BLAS threads to 4 with BLAS.set_num_threads(4).
As a test, if I make a call to eigen for some large matrix, I do see higher thread usage.
Is there something misconfigured on my end?

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 4 × Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, haswell)
Threads: 5 on 4 virtual cores
Environment:
JULIA_PKG_USE_CLI_GIT = true
JULIA_NUM_THREADS = auto

Perhaps it would make sense to solve S\,\delta\alpha=F using an interative method. In this case you don’t need to materialize the covariance matrix, you just need the result of the linear transformation that maps \delta\alpha to S\,\delta\alpha. Up to some fiddling, this map basically reduces to two matrix-vector multiplications. Have a look at LinearMaps.jl. (Please be warned that I haven’t thought through the details).

I can only guess there is some issue with the BLAS call.

This is an interesting idea, thanks for the link. I believe this will not be faster here, because I expect that evaluating the matrix vector product once will likely have to actually compute each matrix element (though its probably a good idea to check this, perhaps there is a shortcut I am overlooking).

For Float32 this time is even reduced by another factor of 2 (yes, I think the precision loss should not matter here).
Thanks a bunch!
I suppose the only remaining mystery is why BLAS does not want to use threads for me, though MKL seems to be faster here anyway.

Hm, the page doesnt go too much into detail regarding this, but according to the guide everything should be fine when the number of BLAS threads is set correctly. So possibly being a bug with my BLAS artifact?

yup, that’s also what I get without MKL. Could you share what hardware youre running this on? Maybe thats our common denominator