I am dealing with huge sparse arrays of dimensions greater than 10^6. I am interested in finding only few extremal eigenvalues. To this end, I tried to use the Arpack package but I noticed that if the size of the sparse array get too big, the number of used CPUs in use decreases. For example
using LinearAlgebra
using Arpack
using SparseArrays
BLAS.set_num_threads(16)
A = sprand(100000,100000,0.001)
eigs(A,nev = 20,which=:SR)
Then I get this from the top command
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
38936 ahmed 20 0 19.901g 0.016t 133852 R 1597 2.2 21:09.33 julia
Now if I increase the dimension of A
using LinearAlgebra
using Arpack
using SparseArrays
BLAS.set_num_threads(16)
A = sprand(1000000,1000000,0.001)
eigs(A,nev = 20,which=:SR)
Then the top command reads
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
38936 ahmed 20 0 34.919g 0.030t 133852 R 155.7 4.1 44:02.02 julia
One notices that not all CPUs are being used. The decrease is in fact drastic! Any explanation for this weird behavior or what might be wrong? This has been tested on two machines, one with 12 physical cores and one with 24 physical cores.
I also tested this on similar packages like KrylovKit and it also shows the same behavior which might be related to how Julia itself deals with huge operations on huge sparse arrays. Any help is appreciated
If you’re using a Krylov method (or any other method that relies on matrix–vector products), then perhaps this package: https://github.com/jagot/ThreadedSparseArrays.jl could be useful? It tries to speed-up the multiplication by threading over the output columns, so for matrix–vector products, you have to use mul!(y, transpose(At), x) for At = ThreadedSparseMatrixCSC(A), where A is your sparse matrix.
The package is mostly a proof-of-concept right now, but seems to work.
BLAS is used for things like orthogonalization and scalar products in Krylov methods, but direct sparse solvers can’t use it in a significant way ( I think ).
That’s true, but my impression is that usually the matrix-vector products account for the bulk of the time (unless you are building up a huge subspace without restarting).
@stevengj
That’s true, but my impression is that usually the matrix-vector products account for the bulk of the time (unless you are building up a huge subspace without restarting).
I’ve had cases where we had to do a few hundred Krylovs/Newton because our preconditioner, while scalable, was still pretty bad. The Jacobian-vector product was expensive, but the orthogonalization was expensive enough to be painful. Someone from the Trilinos project pointed me at classical Gram-Schmidt (twice!) for the orthogonalization, and that helped a lot.
I don’t have much experience with sparse matrices, can you expand on why sparse algorithms (eg spmv) can’t use threading? Because they’re bandwidth-bound?
I don’t think it’s so much that they can’t, but rather that Julia’s built-in sparse library has not been threaded yet. See e.g. this open PR that the package I mentioned above was based on/inspired by.
Then of course, depending on which storage you use (column- or row-major), some operations are easier to parallelize than others.