Most efficient implementation of covariance matrix

Could this be linked to the weird interactions between Julia threads and BLAS threads?

https://docs.julialang.org/en/v1/manual/performance-tips/#man-multithreading-linear-algebra