Is there an easy way to parallelise matrix multiplication?

Hi,
I am working with large sparse matrices (10^5 x 10^5). I would like to make full use of the cores / threads of my CPU to accelerate the multiplication process. As of now, I am using mul!:

using SparseArrays,LinearAlgebra
n = 10000
L = sprand(Float64,n,n,0.3)
u0 = sprand(Float64,n,0.3)
u1 = similar(u0)
mul!(u1,L,u0);

However, this only uses one thread.
Is there a package or a trick to parallelise the multiplication? I know I could always do something like:

function parallel_mul!(u1,L,u0)
 Threads.@threads for i in 1:size(u1)[1]
  for j in 1:size(u1)[2]
   mul!(u1[i,j],L[i,:],u0[:,j])
  end
 end
end

But maybe there is something more straightforward? Else, may be it is worth making parallel_mul! more robust and implement it in some package?

I think BLAS will run multi-threaded by default in Julia, but otherwise you can control it via BLAS.set_num_threads.

However, a matrix-vector multiplication, as in your example, is limited by memory bandwidth, not CPU, so thereā€™s no point running it multi-threaded.

The matrix is sparse in the example

This is not true (dense matrix, n = 50000)

julia> BLAS.set_num_threads(1)

julia> @time mul!(u1,L,u0);
  1.429343 seconds (4 allocations: 160 bytes)

julia> BLAS.set_num_threads(8)

julia> @time mul!(u1,L,u0);
  0.577491 seconds (4 allocations: 160 bytes)
2 Likes

Using a dense RHS will improve performance by a lot:

julia> @time mul!(u1,L,u0);
  0.584000 seconds (4 allocations: 160 bytes)

julia> u0_d = rand(Float64,n)

julia> u1_d = similar(u0_d)

julia> @time mul!(u1_d,L,u0_d);
  0.022939 seconds (4 allocations: 160 bytes)

There is often no advantage of having the vector RHS be sparse unless you have hyper sparse matrices (but then you shouldnā€™t use CSC sparse format anyway)

For threaded sparse matvec (at least in the transpose case), see https://github.com/JuliaLang/julia/pull/29525. It cannot really be merged yet because the threading infrastructure in Base is not there yet.

5 Likes

Thatā€™s interesting. On my system, thereā€™s no difference in timing when changing the number of threads.

1 Like

Note that this matrix is neither large nor sparse by the standards of realistic applications, so benchmarks of this case may be misleading.

30% nonzero is not usually sparse enough to be worthwhile using sparse algorithms for (unless the sparsity has a very special pattern). Real sparse matrices typically have only small number of nonzero elements per row, e.g. ā‰² 10 is typical in FEM-like situations. A moderate sized problem of this sort (I wouldnā€™t call it ā€œlargeā€) would be 10āµĆ—10āµ with 10ā‹…10āµ nonzeros, or 0.01% (1e-4) sparsity. (For example sprand(10^5, 10^5, 1e-4), though sparsity patterns are rarely random in practice.)

2 Likes

Thanks all, now it makes sense. I knew about BLAS.set_num_threads() but it did not change anything in my case as computation time was limited - as said by @Per - by memory bandwidth.

Right - the actual matrices I am playing with have 1e-3 sparsity, with a block pattern, where in this case it really makes sense to use sparse algorithms.

Wow, yes it changes a lot! I donā€™t really understand why but, it is for sure the way to go.

Now using dense matrices, BLAS runs indeed multi-threaded and setting multiple threads improve a lot the operation.

You where using a sparse matrix which is not using BLAS but julias native sparse matrix multiplier, so how many threads BLAS uses has no effect.

1 Like

Oh, rightā€¦ Got it !