Best practice: threaded sparse matrix assembly

What is current state-of-the-art regarding threaded sparse matrix assembly?

We’re currently having an issue where sparse matrix assembly is taking more time than the linear solver (at low resolutions). One way to improve that is to use multithreading.
In C, I use to compute IJV independently per thread and then reassemble everything into a single triplet. This was not trivial and rather error prone.

I could imagine that there a better ways of dealing with this in Julia tooling, in particular with tools like ExtendableSparse ?
If someone has hints, or links to MWE of threaded sparse assemblies, it would be very welcome.

Cheers!

This may be specific to FinEtools, but perhaps you will find it useful. GitHub - PetrKryslUCSD/FinEtoolsMultithreading.jl: Parallel finite element library · GitHub
A paper was published as well https://www.sciencedirect.com/science/article/pii/S0045782524003323

Thank you for the answer, I will read through the paper!

Here, I paste a naive MWE, which I guess probably shows how things should not be done. Improvements are welcome !

using Base.Threads, ExtendableSparse, LinearAlgebra

@views function serial_assembly!(M, n)
    for i=1:n
        if i==1
            M[i,i  ] =  1.0  
            M[i,i+1] = -1.0
        elseif i==n
            M[i,i-1] = -1.0  
            M[i,i  ] =  1.0
        else
            M[i,i-1] = -1.0  
            M[i,i  ] =  2.0
            M[i,i+1] = -1.0
        end
    end
end

@views function threaded_assembly!(M, M_loc, n)
    Threads.@threads for i=1:n
        tid = threadid()
        if i==1
            M_loc[tid-1][i,i  ] =  1.0  
            M_loc[tid-1][i,i+1] = -1.0
        elseif i==n
            M_loc[tid-1][i,i-1] = -1.0  
            M_loc[tid-1][i,i  ] =  1.0
        else
            M_loc[tid-1][i,i-1] = -1.0  
            M_loc[tid-1][i,i  ] =  2.0
            M_loc[tid-1][i,i+1] = -1.0
        end
    end
end

let 
    n  = 10_000_000
    # Serial
    M1 = ExtendableSparseMatrix(n,n)
    @time serial_assembly!(M1, n)
    # Threaded
    M2 = ExtendableSparseMatrix(n,n)
    M2_loc = [ExtendableSparseMatrix(n,n) for i=1:nthreads()]
    @time threaded_assembly!(M2, M2_loc, n)
    # Reduce. Looks like a conversion to sparse is needed to allow for .+
    # This makes things a slightly faster
    @time for k=1:nthreads()
        M2 .= sparse(M2) .+ sparse(M2_loc[k])
    end
    # Check
    @show norm(M1 .- M2)
end

In this 1D example, the reduction is super slow (threaded slower than serial). For 2D problems, there is a bit of benefit of doing this, but it’s still a very inefficient way. What could be the most straightforward way to thread such type of assembly?

Your problem is not multithreading, it is updating sparse matrices in-place. Don’t do that. M[i,j] = 1.0 costs O(length(M.nzind)); if you do that for every nonzero entry, then you have quadratic runtime in the number of nonzero entries.

Look at the manual Sparse Arrays · The Julia Language

Instead of M[i,j] = 1.0, you do push!(I, i); push!(J, j); push!(V,1.0), and at some point you do M=sparse(I,J,V).

The final step of building the sparse matrix is basically a sort (conversion from COO to CSC), i.e. you have loglinear runtime in the number of entries.

If your code is very sparse centric (or need multi dimensional tensor sparse arrays), use Sparse and Structured Utilities · Finch.jl and fsparse or fsparse!.

I’m not sure that’s really the case here because I’m using ExtendableSparse. It’s indeed a bit faster to do the IJV then sparse() but I find using ExtendableSparseMatrix is more convenient in general.

What is really slow here is the reduction. But maybe it’s a bad idea to reduce directly sparse matrices (maybe one could gather the triplets back and recombine them and sparse a final matrix).

Thanks @yolhan_mannes for bringing Finch.jl in the discussion. One more package to discover !

edit: one way to speed up the reduction is indeed:

    @time begin 
        I = Int[]
        J = Int[]
        V = Float64[]
        for k=1:nthreads()
            I_loc, J_loc, V_loc = findnz(M2_loc[k])
            append!(I, I_loc)
            append!(J, J_loc)
            append!(V, V_loc)
        end
        M3 = sparse(I, J, V)
    end

instead of

   @time begin 
        for k=1:nthreads()
            M2 .= sparse(M2) .+ sparse(M2_loc[k])
        end
    end

This approach does not seem to take into account that the outputs of the various threads themselves are already sorted. I think an implementation using a sorting method that takes this structure into account, e.g. what MergeSort.jl seems to provide, might lead to a further improvement?

Yes, I kind of assume that sparse() would do its own internal reordering. I can have a look into MergeSorted.jl but I’m sure how this would combine with the internals of sparse().
…and the “faster” solution I discussed above seems to be only faster in the 1D MWE but not in the 2D (less sparse) production code.

Multi-threaded assembly · Ferrite.jl might also interest you.

Author of ExtendableSparse here: This package attempts to provide an efficient interface for assembling sparse matrices using the more intuitive (IMHO) A[i,j]+=x idiom. It uses an internal linked list structure which allows to collect entries in a efficient way and a flush! call wich turns this into a CSC structure after the end of an assembyl cycle. The parallel version of the code may still have some rough edges, so the MWE is welcome to explore these.

EDIT: Best practice: threaded sparse matrix assembly · Issue #71 · WIAS-PDELib/ExtendableSparse.jl · GitHub

You could do that by storing an internal CooBuffer::Vector{Tuple{Ti,Ti,Tv}}, or alternatively an internal CooDict::Dict{Tuple{Ti,Ti}, Tv} instead of the linked list.

So, to put some more numbers to it:

using BenchmarkTools, SparseArrays
begin
mutable struct CooBuffer
       const I::Vector{Int}
       const J::Vector{Int}
       const V::Vector{Float64}
 end

CooBuffer()=CooBuffer(Int[], Int[], Float64[]);

function Base.setindex!(b::CooBuffer, v, i, j)
       push!(b.I, i); push!(b.J, j); push!(b.V, v);
end

function Base.sizehint!(c::CooBuffer, n)
sizehint!(c.I, n); sizehint!(c.J, n); sizehint!(c.V, n); 
end

mutable struct CooBuffer2
       cur_size::Int
       const max_size::Int
       const I::Vector{Int}
       const J::Vector{Int}
       const V::Vector{Float64}
 end

CooBuffer2(n) = CooBuffer2(0, n, zeros(Int, n), zeros(Int, n), zeros(Float64, n))
function Base.setindex!(b::CooBuffer2, v, i, j)
       b.cur_size += 1
       idx = b.cur_size
       b.I[idx] = i
       b.J[idx] = j
       b.V[idx] = v
      nothing
end

function serial_assembly!(M, n)
    for i=1:n
        if i==1
            M[i,i  ] =  1.0  
            M[i,i+1] = -1.0
        elseif i==n
            M[i,i-1] = -1.0  
            M[i,i  ] =  1.0
        else
            M[i,i-1] = -1.0  
            M[i,i  ] =  2.0
            M[i,i+1] = -1.0
        end
    end
end



N=10_000_000

println("no sizehint")
@btime begin
   tmp = CooBuffer()
   serial_assembly!(tmp, N)
end

println("sizehint")
@btime begin
   tmp = CooBuffer()
   sizehint!(tmp, 3*N)
   serial_assembly!(tmp, N)
end
println("no push")
@btime begin
   tmp = CooBuffer2(3*N)
   serial_assembly!(tmp, N)
end

println("deepcopy")
tmp = CooBuffer2(3*N)
serial_assembly!(tmp, N)
@btime deepcopy(tmp);

sparsifyBuff(b, m, n) = sparse(b.I, b.J, b.V, m, n, nothing)
tmp = CooBuffer()
serial_assembly!(tmp, N)
println("coo->csc")
@btime sparsifyBuff(tmp, N, N);
nothing
end

With that, I get output

no sizehint
  325.538 ms (136 allocations: 2.65 GiB)
sizehint
  142.526 ms (11 allocations: 686.65 MiB)
no push
  55.298 ms (11 allocations: 686.65 MiB)
deepcopy
  50.393 ms (17 allocations: 686.65 MiB)
coo->csc
  145.477 ms (22 allocations: 1.12 GiB)

From that, we can conclude: try to sizehint your buffers. Maybe don’t use push!, that apparently has issues with loop unrolling (the extra cost is ~1 ns per pushed element). With that, I roughly reach “allocate + memcpy” speed for construction.

Unless you find a good way to multithread coo->csc conversion, there will be very limited gains from multithreading the assembly of the coo matrix. If you decide to multithread that, don’t use nthreads, you’re memory bandwidth bound, not compute bound. Look up how many cores are needed to saturate your memory.

Also don’t forget to check

cat /sys/kernel/mm/transparent_hugepage/enabled 
[always] madvise never

If I set that to madvise (which some linux distros have as default), my timings degrade to

no sizehint
  651.638 ms (136 allocations: 2.65 GiB)
sizehint
  200.707 ms (11 allocations: 686.65 MiB)
no push
  129.375 ms (11 allocations: 686.65 MiB)
deepcopy
  123.075 ms (17 allocations: 686.65 MiB)
coo->csc
  231.989 ms (22 allocations: 1.12 GiB)

Just to clarify: the “linked list” format I mentioned is modeled after the one described in Saad’s whitepaper. This allows to identify already existing entries, and to support “+=” , “-=” and the like. I don’t see if this is possible with COO. Also, COO uses huge intermediate storage when assembling e.g. 3D FEM problems.

Thanks! After reading your code a little, it is clear that the linked list is not as catastrophic as I assumed if the matrix is very sparse. And benchmarking it along the other variants above shows that it’s performing reasonably well. (350 ms for collecting entries in serial_assembly!, 350 ms for LNK->CSC conversion)

SparseArrays.sparse(I,I,V,m,n,+) supports that (it adds up different values with same indices). I used the mergefuntion nothing because OPs example never adds to indices.

If you need to support M[i,j] = f(M[i,j]) then this doesn’t fly anymore.

Btw, your library could keep the linked lists sorted without extra cost? Since an insert always searches for the (i,j)-pair anyways, you could simply insert at the correct position?

If you have a lot of cases where you need to add multiple values for the same (i,j) position, then there is a trade-off between Vector and linked-list / Dict-based buffering: Vector-based buffering will consolidate such entries at the very end, at the price of needing more intermediate storage, and linked-list / Dict-based buffering will do so immediately, at the cost of more expensive inserts. Dict-based will give you faster inserts/edits, but will cost you ~20%-50% intermediate storage overhead.

It’s not plausible to me that the linked list format “lies on the pareto frontier”, i.e. is ever a good idea. That being said, you’re the experienced expert, and if you say so, then I’m gonna believe you. (but '94 is a long time ago, and 30 years of hardware progress have not been kind to linked lists – memory capacity, bandwidth and compute have all outpaced memory latency.)

That “linked list” format is implemented using arrays of pointers which can be grown at the end via push!. Cache-wise it is of course not optimal as column data are not contiguous in memory, but it is very convenient API-wise.