Best practice: threaded sparse matrix assembly

Thank you for the answer, I will read through the paper!

Here, I paste a naive MWE, which I guess probably shows how things should not be done. Improvements are welcome !

using Base.Threads, ExtendableSparse, LinearAlgebra

@views function serial_assembly!(M, n)
    for i=1:n
        if i==1
            M[i,i  ] =  1.0  
            M[i,i+1] = -1.0
        elseif i==n
            M[i,i-1] = -1.0  
            M[i,i  ] =  1.0
        else
            M[i,i-1] = -1.0  
            M[i,i  ] =  2.0
            M[i,i+1] = -1.0
        end
    end
end

@views function threaded_assembly!(M, M_loc, n)
    Threads.@threads for i=1:n
        tid = threadid()
        if i==1
            M_loc[tid-1][i,i  ] =  1.0  
            M_loc[tid-1][i,i+1] = -1.0
        elseif i==n
            M_loc[tid-1][i,i-1] = -1.0  
            M_loc[tid-1][i,i  ] =  1.0
        else
            M_loc[tid-1][i,i-1] = -1.0  
            M_loc[tid-1][i,i  ] =  2.0
            M_loc[tid-1][i,i+1] = -1.0
        end
    end
end

let 
    n  = 10_000_000
    # Serial
    M1 = ExtendableSparseMatrix(n,n)
    @time serial_assembly!(M1, n)
    # Threaded
    M2 = ExtendableSparseMatrix(n,n)
    M2_loc = [ExtendableSparseMatrix(n,n) for i=1:nthreads()]
    @time threaded_assembly!(M2, M2_loc, n)
    # Reduce. Looks like a conversion to sparse is needed to allow for .+
    # This makes things a slightly faster
    @time for k=1:nthreads()
        M2 .= sparse(M2) .+ sparse(M2_loc[k])
    end
    # Check
    @show norm(M1 .- M2)
end

In this 1D example, the reduction is super slow (threaded slower than serial). For 2D problems, there is a bit of benefit of doing this, but it’s still a very inefficient way. What could be the most straightforward way to thread such type of assembly?