Most efficient implementation of covariance matrix

The cov implementation in Statistics is multithreaded, because the slowest part of the cov computation is a matrix multiplication, which uses BLAS … and BLAS is multithreaded (as well as SIMD-vectorized).

In particular, if you have an m \times n matrix A, then to compute the covarance matrix you first subtract the column mean to obtain X, which has \Theta(mn) cost; then compute X^* X, which has \Theta(mn^2) cost; then do some normalization, which has \Theta(n^2) cost. So, in cases like yours where m and n are large, the X^* X computation should completely dominate.

You could try different matrix-multiplication libraries, like MKL.jl or Octavian.jl or AppleAccelerate.jl, to see if they are faster than the default OpenBLAS library on your machine. You could also try Float32 precision to see if that is accurate enough for your application.

That’s not surprising. LoopVectorization.jl is very good, but it’s best on single linear-time (BLAS1-like or maybe BLAS2-like) loops; higher-level transformations (like in Octavian.jl) are required to make matrix multiplication (BLAS3) fast enough to compete with optimized BLAS libraries for large matrices.

4 Likes