[ANN] CovarianceEstimation.jl -- fast and lightweight covariance estimation

package
announcement
statistics

#1

Hello,

CovarianceEstimation.jl is a new package for estimating covariance of given samples. It currently focuses on linear shrinkage methods (but has one nonlinear shrinkage algorithm) and is written in pure Julia.

There already is a similar package, CovarianceMatrices.jl but it calculates covariance of coefficient of regression models. CovarianceEstimation.jl is also aiming to be more lightweight.

I would like to thank Thibaut Lienart for his valuable contributions. Thanks to his work the package turned out to be significantly faster than scikit-learn and corpcor.

Kind regards
Mateusz Baran


#2

I’m not a statistician… why would one estimate covariance instead of running the standard cov function?


#3

I’m not a statistician either but the issue is that (especially when you have more features than observations), standard estimators implemented by cov are quite poor. There are other estimators that, by assuming certain properties of analyzed random variables, can return an answer that is closer to true covariance. For example, you can use such estimators to improve the predictive power of LDA.


#4

The standard “canonical” covariance estimator is known to be badly ill conditioned in a large number of use cases, indeed for instance when the number of samples is around or under the number of features, and generally when the number of features is large. And so for instance if you want to recover the precision matrix (inverse of the cov) which is useful to estimate dependence structures, the estimator will blow up.

There’s a simple benchmark in the docs showing a plot of the MSE from the generating covariance matrix (“ground truth”) to the recovered one where you can see that the estimators we implemented get significantly lower MSE than the canonical estimator.

Finally it’s worth noting that linear shrinkage estimators (and even in some cases the non linear one we implemented) have the same computational complexity as the base estimator and so are pretty much just as quick to get :slight_smile: (it’s worth being noted because the literature on covariance estimators is very messy and a number of methods are completely impractical from a computational perspective…)