Worse Performance with Approximate Minimum Degree Reordering of Sparse Matrices + LU Than `\`?

I’m trying to test the supposed boost in performance (as seen from Matlab) when AMD reordering is used before factorization with the lu when solving Ax = b, where A is a real valued n \times n sparse matrix. Here is my attempt, with the same data (A is about 5200 \times 5200) loaded from the Matlab function (original formulation is J\Delta x = -F):

using SparseArrays, LinearAlgebra, AMD, BenchmarkTools, MAT

A = matread("test/case2869_J.mat")["J"];
b = -matread("test/case2869_F.mat")["F"];
p = amd(A);

native_sol(A, b) = A \ b
lu_sol(A, b) = lu(A) \ b
function lu_amd_sol(A, b, p)
    x = zero.(b)
    x[p] = lu(A[p,p]) \ b[p]
    return x
end

@info("native"); @btime native_sol($A, $b);     @assert(norm(A*native_sol(A,b) - b) <= 1e-10)
@info("lu");     @btime lu_sol($A, $b);         @assert(norm(A*lu_sol(A,b) - b) <= 1e-10)
@info("lu amd"); @btime lu_amd_sol($A, $b, $p); @assert(norm(A*lu_amd_sol(A,b,p) - b) <= 1e-10)

Here is the result:

[ Info: native
  13.180 ms (76 allocations: 7.79 MiB)
[ Info: lu
  13.296 ms (74 allocations: 7.75 MiB)
[ Info: lu amd
  13.655 ms (99 allocations: 8.66 MiB)

On Matlab the AMD+LU method is about 2.8 times faster (~4.9ms on average). Not only there is no performance improvement, the built in \ operator is even faster than either lu or lu_amd.

The sparsity pattern on J before and after AMD reordering are (as plotted with spy on Matlab, and Plot.spy on Julia gives the same result):

Any idea why this might be? Perhaps I’m doing something wrong?

Thanks in advance!

1 Like

A \ b and lu(A) \ b should do exactly the same. From the documentation of \ (accessible by typing ?\ on the REPL):

For non-triangular square matrices, an LU factorization is used.

Furthermore, it seems that lu(::SparseMatrixCSC) automatically applies a fill-reducing ordering and does not provide an option to turn this off. The documentation isn’t quite clear, but I recommend you have a look at it yourself by typing ?lu and scrolling down to the lu(A::SparseMatrixCSC; check = true) -> F::UmfpackLU section. What this means is that there is no benefit in replacing lu(A) with (p = amd(A); lu(A[p,p])) since lu(A) will not preserve the AMD ordering but instead apply its own ordering.

I suspect the better performance of Matlab arises because Matlab recognises that your matrix is symmetric and uses a Cholesky instead of a LU factorisation. As far as I know, Julia does not do that for you, but you can achieve the same effect by calling cholesky() instead of lu(). Like lu(), cholesky() will by default automatically apply a fill-reducing ordering, but this function also allows you to specify your own ordering through cholesky(A, perm = amd(A)). Once again, I recommend you have a look at the documentation by typing ?cholesky and scrolling down to the cholesky(A; shift = 0.0, check = true, perm = nothing) -> CHOLMOD.Factor section.

Hope this helps. If it does, I would be curious to see the runtimes for cholesky(A) and cholesky(A, perm=amd(A)).

4 Likes

Thanks for your reply! That’s very interesting, I didn’t realize \ does the LU factorization for you already. Should I maybe open an issue on clarifying/adding custom permutation in lu?

I think Matlab actually does use LU + AMD rather than Cholesky; here’s their relevant source code:

q = amd(A);     %% permutation vector for AMD reordering
[L, U, p] = lu(A(q,q), 'vector');
x = zeros(size(A, 1), 1);
x(q) = U \ ( L \ b(q(p)) );

Unfortunately the matrices I’m working with are generally not positive definite or perfectly symmetric, so I can’t use Cholesky for this. However, I played around with Cholesky with randomly generated symmetric and positive definite matrices of roughly the same size and number of non-zero elements, and here’s the code and result:

using SparseArrays, LinearAlgebra, AMD, BenchmarkTools, Arpack

function amd_build(n::Int)
    ts = time()
    psorted = true
    A = spzeros(n, n)
    I = spdiagm(0=>ones(n))
    p = zeros(Int, n)

    while psorted && time() - ts < 60.0  # try for 60 seconds if p is just the original permutation
        A = sprand(n, n, 0.001)
        A = 0.5.*(A + A')
        A .+= 20*I
        p = amd(A)
        psorted = issorted(p)
    end

    psorted && @warn("p is still sorted")
    return A, p
end

n = 5200;
b = randn(n);
A, p = amd_build(n);
@info("Nonzeros in A: $(nnz(A))")
@assert(issymmetric(A))
@assert(isposdef(A))

@info("Cholesky");
Ac = cholesky(A);
sol1 = Ac \ b;
@assert(norm(A*sol1 - b) <= 1e-10);
@btime $Ac \ $b;

@info("Cholesky + amd");
Aca = cholesky(A, perm=p);
sol2 = Aca \ b;
@assert(norm(A*sol2 - b) <= 1e-10);
@btime $Aca \ $b;
[ Info: Nonzeros in A: 59110
[ Info: Cholesky
  4.186 ms (12 allocations: 180.43 KiB)
[ Info: Cholesky + amd
  4.202 ms (12 allocations: 180.43 KiB)

My Julia version info:

Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "C:\Users\Me\AppData\Local\atom\app-1.49.0\atom.exe"  -a
  JULIA_NUM_THREADS = 6

This is pretty strange…

For non-positive-definite matrices you can use the ldlt() factorisation. Also, another thing worth trying is Symmetric(A)\b. The extra Symmetric indicates that you know that the matrix in question is symmetric, such that \ can dispatch to more efficient algorithms (e.g. ldlt).

I don’t think your runtime observations are strange. If you don’t specify a permutation, cholesky() will determine one for you, and apparently the one it finds is about as good as AMD. If you want to compare against unpermuted, use perm = 1:n.

2 Likes

You’re right, here’s the result of unpermuted Cholesky vs AMD Cholesky:

[ Info: Nonzeros in A: 59466
[ Info: Cholesky
  7.399 ms (12 allocations: 181.81 KiB)
[ Info: Cholesky + amd
  4.148 ms (12 allocations: 179.90 KiB)

Back to my application, the matrices I have are not actually symmetric, so the result of J \ -F is not equal to that of Symmetric(J) \ -F, although performance wise the latter does improve to the Matlab level.

I opened an issue on adding perm support in lu here.

1 Like