Is there an easy way to parallelise matrix multiplication?

Hi,
I am working with large sparse matrices (10^5 x 10^5). I would like to make full use of the cores / threads of my CPU to accelerate the multiplication process. As of now, I am using mul!:

using SparseArrays,LinearAlgebra
n = 10000
L = sprand(Float64,n,n,0.3)
u0 = sprand(Float64,n,0.3)
u1 = similar(u0)
mul!(u1,L,u0);

However, this only uses one thread.
Is there a package or a trick to parallelise the multiplication? I know I could always do something like:

function parallel_mul!(u1,L,u0)
 Threads.@threads for i in 1:size(u1)[1]
  for j in 1:size(u1)[2]
   mul!(u1[i,j],L[i,:],u0[:,j])
  end
 end
end

But maybe there is something more straightforward? Else, may be it is worth making parallel_mul! more robust and implement it in some package?

I think BLAS will run multi-threaded by default in Julia, but otherwise you can control it via BLAS.set_num_threads.

However, a matrix-vector multiplication, as in your example, is limited by memory bandwidth, not CPU, so there’s no point running it multi-threaded.

The matrix is sparse in the example

This is not true (dense matrix, n = 50000)

julia> BLAS.set_num_threads(1)

julia> @time mul!(u1,L,u0);
  1.429343 seconds (4 allocations: 160 bytes)

julia> BLAS.set_num_threads(8)

julia> @time mul!(u1,L,u0);
  0.577491 seconds (4 allocations: 160 bytes)
2 Likes

Using a dense RHS will improve performance by a lot:

julia> @time mul!(u1,L,u0);
  0.584000 seconds (4 allocations: 160 bytes)

julia> u0_d = rand(Float64,n)

julia> u1_d = similar(u0_d)

julia> @time mul!(u1_d,L,u0_d);
  0.022939 seconds (4 allocations: 160 bytes)

There is often no advantage of having the vector RHS be sparse unless you have hyper sparse matrices (but then you shouldn’t use CSC sparse format anyway)

For threaded sparse matvec (at least in the transpose case), see https://github.com/JuliaLang/julia/pull/29525. It cannot really be merged yet because the threading infrastructure in Base is not there yet.

3 Likes

That’s interesting. On my system, there’s no difference in timing when changing the number of threads.

Note that this matrix is neither large nor sparse by the standards of realistic applications, so benchmarks of this case may be misleading.

30% nonzero is not usually sparse enough to be worthwhile using sparse algorithms for (unless the sparsity has a very special pattern). Real sparse matrices typically have only small number of nonzero elements per row, e.g. ≲ 10 is typical in FEM-like situations. A moderate sized problem of this sort (I wouldn’t call it “large”) would be 10⁵×10⁵ with 10⋅10⁵ nonzeros, or 0.01% (1e-4) sparsity. (For example sprand(10^5, 10^5, 1e-4), though sparsity patterns are rarely random in practice.)

2 Likes

Thanks all, now it makes sense. I knew about BLAS.set_num_threads() but it did not change anything in my case as computation time was limited - as said by @Per - by memory bandwidth.

Right - the actual matrices I am playing with have 1e-3 sparsity, with a block pattern, where in this case it really makes sense to use sparse algorithms.

Wow, yes it changes a lot! I don’t really understand why but, it is for sure the way to go.

Now using dense matrices, BLAS runs indeed multi-threaded and setting multiple threads improve a lot the operation.

You where using a sparse matrix which is not using BLAS but julias native sparse matrix multiplier, so how many threads BLAS uses has no effect.

1 Like

Oh, right… Got it !