Increasing the solution speed of sparse linear system

Hello,

Does Julia have any package for sparse linear solvers for LU decomposed sparse matrices that are multi-threaded?

I want to improve the solution speed of a big system which I have to solve ~500-700 times.

    using LinearAlgebra
    using SparseArrays
    function create_mat(n::Int)
        diagonal = spdiagm(0 => [1.0 for i in 1:n]) 
        off_diagonal = spdiagm(-1 => [-0.01 for i in 2:n])
        mat = diagonal + off_diagonal + off_diagonal' 
        return mat
    end
    
    
    n = 1000000 
    A = create_mat(n)
    lu_A = lu(A , check=true);

    x = randn(n);

    for i = 1:500
        x = lu_A \ x;
    end

I am interested in keeping the LU factorization.
This example appears to be solving with a single thread.
Are there any multi-threaded sparse linear solvers I can use to solve this system faster?

Best Regards

1 Like

You likely want to use BandedMatrices.jl if you have sparse matrices with a banded structure since they will give you O(n) factorization algorithms. In this case, it will still run single threaded, but it will be a lot more efficient.

function create_band_mat(n::Int)
    return Symmetric(BandedMatrix(-1=>fill(-.01, n-1), 0=>fill(1.0,n), 1=>fill(-.01,n-1))) 
end
julia> @time A_band_chol = cholesky(A_band);
  0.098926 seconds (3 allocations: 22.888 MiB)
julia> @time A_band_chol\x;
  0.019270 seconds (2 allocations: 7.629 MiB)
2 Likes

Hi, thanks for the reply.

I would like to keep the LU factorization (my actual matrix is not exactly banded) and just increase the speed of lu_A \ x.

Given that A will remain constant. Is it possible to perform the ‘\’ operation concurrently and not in serial?

You can do all solves at once:

x=randn(n,500)

lu_A\x

or if you don’t have all x before you can at least use ldiv!(y,lu_A,x) to avoid allocations in each solve.

And the solve is multi-threaded through BLAS, but you can also try Pardiso.jl as an alternative.

There is LinearSolve.jl, which defines/wraps a lot of linear solvers in a common interface. It belongs to the SciML ecosystem, which is typically very performant.

1 Like

Here is my attempt at using LinearSolve.

using LinearAlgebra, SparseArrays, LinearSolve, Sparspak
import Pardiso
using BenchmarkTools

n = 1000000
A = create_mat(n)
lu_A = lu(A , check=true);
x = rand(n);
y=zeros(n);

@btime solve(LinearProblem(A, x;), UMFPACKFactorization());
@btime solve(LinearProblem(A, x;), KLUFactorization());
@btime solve(LinearProblem(A, x;), MKLPardisoFactorize());
@btime solve(LinearProblem(A, x;), SparspakFactorization());
@btime lu_A\x;
@btime ldiv!(y, lu_A, x);
  814.338 ms (95 allocations: 1.04 GiB)
  259.144 ms (56 allocations: 549.32 MiB)
  1.280 s (47 allocations: 118.26 MiB)
  1.275 s (182 allocations: 595.22 MiB)
  17.547 ms (4 allocations: 15.26 MiB)
  15.321 ms (0 allocations: 0 bytes)

The times are not fair has LinearSolve have to decompose + solve.
Can LinearSolve know that it only needs to decompose the matrix once and solve for many “b”'s?

Yes

Shouldn’t this work?

function solve(A, x)
    algs = [
                UMFPACKFactorization(),
                KLUFactorization(),
                MKLPardisoFactorize(),
                SparspakFactorization()
            ]

    for i in 1:4
        prob = LinearProblem(A, x);
        linsolve=init(prob);  
        sol1=solve!(linsolve, algs[i]);
        linsolve.b = x;
        @btime sol2=solve!(linsolve, algs[i]);
    end 
end

solve(A, x)

Interesting. That’s worth digging into. @Oscar_Smith does a quick flamegraph reveal anything?

This is just missing some necessary macro interpolation. This works.

function benchsolve(A, x)
    algs = [
                UMFPACKFactorization(),
                KLUFactorization(),
                MKLPardisoFactorize(),
                SparspakFactorization()
            ]
 
    for alg in algs
        prob = LinearProblem(A, x);
        linsolve=init(prob, alg);
        linsolve.b = x;
        @btime sol2=solve!($linsolve, $alg) setup=$linsolve.b.=$x;
    end 
end

Updated times:

UMFPACKFactorization(), 15.598 ms (0 allocations: 0 bytes)
KLUFactorization(),  13.361 ms (1 allocation: 16 bytes)
MKLPardisoFactorize(), (4 allocations: 15.26 MiB)
SparspakFactorization(), (10 allocations: 22.89 MiB)

lu_A\x;  17.547 ms (4 allocations: 15.26 MiB)
ldiv!(lu_A, x);  15.321 ms (0 allocations: 0 bytes)

I guess there is a slight improvement with KLUFactorization().
But only MKLPardisoFactorize() used my machine resources. The rest used ~30% capacity. So I guess they are not working in parallel.

It is one that I think is inherently pretty much impossible to parallelize. The function runs quickly, and the dependency chain is pretty long. If you want parallelism, your best bet is to have separate cores running different linear solves.

1 Like

Well it’s possible the new ParU method in SuiteSparse could do better. @rayegun how’s that coming along?

Is it normal for the allocations to be high ?

function benchsolve(A, x)
    algs = [
                # UMFPACKFactorization(),
                # KLUFactorization(),
                MKLPardisoFactorize(),
                # SparspakFactorization()
            ]
    for alg in algs
        prob = LinearProblem(A, x);
        linsolve=init(prob, alg);

        for i in 1:500
            solve!(linsolve, alg);
            linsolve.b .= linsolve.u;
        end
    end 
end

@btime benchsolve(A, x);

Gives: 13.833 s (3042 allocations: 7.55 GiB)

No, not normal. Something does seem off there. What’s the reproducer?