Hi everyone, I was computing some rather large covariance matrices and noticed that that implementation in Statistics is not multithreaded. TLDR, I want the following to be as fast as possible (ideally, for even larger matrices): julia> a = rand(20000,1500); @time vcov(a); 8.743414 seconds (1.51…

[image] Salmon: I was computing some rather large covariance matrices and noticed that that implementation in Statistics is not multithreaded. The cov implementation in Statistics is multithreaded, because the slowest part of the cov computation is a matrix multiplication, which uses BLAS … a…

Perhaps you can comment on the application you have in mind. In some geospatial applications we avoid building the covariance of all locations at once, because that doesn’t fit into memory, or is too slow.

Sure, I am experimenting with implementing a stochastic reconfiguration method: https://arxiv.org/abs/cond-mat/0502553 (see also Quantum Geometric Tensor and Stochastic Reconfiguration — NetKet ) It is somewhat similar to stochastic gradient descend. One tries to optimize a bunch of variational par…

[image] stevengj: The cov implementation in Statistics is multithreaded, because the slowest part of the cov computation is a matrix multiplication, which uses BLAS … and BLAS is multithreaded. strange, I only see one thread usage, despite setting the number of BLAS threads to 4 with BLAS.set…

Perhaps it would make sense to solve S\,\delta\alpha=F using an interative method. In this case you don’t need to materialize the covariance matrix, you just need the result of the linear transformation that maps \delta\alpha to S\,\delta\alpha. Up to some fiddling, this map basically reduces to two…

[image] Salmon: strange, I only see one thread usage How are you determining thread usage? If I simply look at the performance, the impact of multiple threads is clearly apparent: julia> using Statistics, BenchmarkTools, LinearAlgebra julia> a = rand(20000,1500); julia> BLAS.set_num_threa…

[image] stevengj: How are you determining thread usage? If I simply look at the performance, the impact of multiple threads is clearly apparent Not for me, unfortunately: julia> a = rand(4800,10000); julia> BLAS.set_num_threads(4) julia> @time cov(a); 11.101160 seconds (12 allocations: 1.…

[image] stevengj: You could try different matrix-multiplication libraries, like MKL.jl or Octavian.jl or AppleAccelerate.jl , to see if they are faster than the default OpenBLAS library on your machine. You could also try Float32 precision to see if that is accurate enough for your application. …

I get performance similar to this whether I use julia -t 1 or julia -t 4. Is that what you mean? In particular, I get the following times with a single julia thread: 11.757666 seconds (12 allocations: 1.103 GiB, 0.04% gc time) 11.723894 seconds (12 allocations: 1.103 GiB, 0.12% gc time) and the f…

[image] gdalle: Could this be linked to the weird interactions between Julia threads and BLAS threads? Hm, the page doesnt go too much into detail regarding this, but according to the guide everything should be fine when the number of BLAS threads is set correctly. So possibly being a bug wit…

Most efficient implementation of covariance matrix

General Usage Performance

gdalle May 27, 2024, 9:35pm 10

Could this be linked to the weird interactions between Julia threads and BLAS threads?

https://docs.julialang.org/en/v1/manual/performance-tips/#man-multithreading-linear-algebra

Bad performances when using Multithreading and Distributed with heavy LinearAlgebra calculations

Topic		Replies	Views
Matrix multiplication is slower when multithreading in Julia Performance question , multithreading , linearalgebra	13	4427	January 21, 2022
Poor performance multiplying many (large) matrices multithreaded Performance question , linearalgebra	11	2619	July 13, 2020
Multithreaded MatVec Numerics multithreading , matrices	10	2113	February 4, 2022
Multi-threading of julia-1.8.5 does not improve speed when combined with BLAS New to Julia	17	1688	May 1, 2023
Data structures for threaded computing Performance	23	3063	October 23, 2019

Most efficient implementation of covariance matrix

Related topics