Hi,
I am working with large sparse matrices (10^5 x 10^5). I would like to make full use of the cores / threads of my CPU to accelerate the multiplication process. As of now, I am using mul!:
using SparseArrays,LinearAlgebra
n = 10000
L = sprand(Float64,n,n,0.3)
u0 = sprand(Float64,n,0.3)
u1 = similar(u0)
mul!(u1,L,u0);
However, this only uses one thread.
Is there a package or a trick to parallelise the multiplication? I know I could always do something like:
function parallel_mul!(u1,L,u0)
Threads.@threads for i in 1:size(u1)[1]
for j in 1:size(u1)[2]
mul!(u1[i,j],L[i,:],u0[:,j])
end
end
end
But maybe there is something more straightforward? Else, may be it is worth making parallel_mul! more robust and implement it in some package?
There is often no advantage of having the vector RHS be sparse unless you have hyper sparse matrices (but then you shouldnāt use CSC sparse format anyway)
For threaded sparse matvec (at least in the transpose case), see https://github.com/JuliaLang/julia/pull/29525. It cannot really be merged yet because the threading infrastructure in Base is not there yet.
Note that this matrix is neither large nor sparse by the standards of realistic applications, so benchmarks of this case may be misleading.
30% nonzero is not usually sparse enough to be worthwhile using sparse algorithms for (unless the sparsity has a very special pattern). Real sparse matrices typically have only small number of nonzero elements per row, e.g. ā² 10 is typical in FEM-like situations. A moderate sized problem of this sort (I wouldnāt call it ālargeā) would be 10āµĆ10āµ with 10ā 10āµ nonzeros, or 0.01% (1e-4) sparsity. (For example sprand(10^5, 10^5, 1e-4), though sparsity patterns are rarely random in practice.)
Thanks all, now it makes sense. I knew about BLAS.set_num_threads() but it did not change anything in my case as computation time was limited - as said by @Per - by memory bandwidth.
Right - the actual matrices I am playing with have 1e-3 sparsity, with a block pattern, where in this case it really makes sense to use sparse algorithms.
Wow, yes it changes a lot! I donāt really understand why but, it is for sure the way to go.
Now using dense matrices, BLAS runs indeed multi-threaded and setting multiple threads improve a lot the operation.