Most efficient implementation of covariance matrix

Not for me, unfortunately:

julia> a = rand(4800,10000);
julia> BLAS.set_num_threads(4)

julia> @time cov(a);
 11.101160 seconds (12 allocations: 1.103 GiB)

julia> BLAS.set_num_threads(1)

julia> @time cov(a);
 10.656343 seconds (12 allocations: 1.103 GiB, 0.59% gc time)

I can only guess there is some issue with the BLAS call.

This is an interesting idea, thanks for the link. I believe this will not be faster here, because I expect that evaluating the matrix vector product once will likely have to actually compute each matrix element (though its probably a good idea to check this, perhaps there is a shortcut I am overlooking).