Inverting a symmetric matrix is not faster than inverting a random one

I thought that inverting a symmetric matrix would be faster, since there are optimised algorithms and I also saw that there is a method explicity for Symmetric, but maybe I am overlooking some trivial things, so please help :wink:

This is the method I found with methods(inv)

[9] inv(A::LinearAlgebra.Symmetric{<:Any, <:StridedMatrix{T} where T}) in LinearAlgebra at /Users/tamasgal/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/symmetric.jl:647

and in the example below M is a subtype of that Symmetric type:

julia> N = 1000; A = rand(N, N); M = Symmetric(A'A);

julia> typeof(M) <: LinearAlgebra.Symmetric{<:Any, <:StridedMatrix{T} where T}
true

So why is there no difference between a random matrix and a symmetric one?

julia> using BenchmarkTools, LinearAlgebra

julia> N = 100; A = rand(N, N); M = A'A;

julia> @benchmark inv(M)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  415.816 ΞΌs …   5.373 ms  β”Š GC (min … max): 0.00% … 90.70%
 Time  (median):     473.493 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   481.312 ΞΌs Β± 105.862 ΞΌs  β”Š GC (mean Β± Οƒ):  0.54% Β±  2.66%

              β–‚β–‚β–ƒβ–…β–…β–†β–‡β–‡β–ˆβ–ˆβ–…β–†β–„β–…β–„β–„β–ƒβ–‚β–
  β–‚β–β–‚β–‚β–‚β–ƒβ–ƒβ–„β–…β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–…β–…β–…β–…β–„β–„β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–…
  416 ΞΌs           Histogram: frequency by time          575 ΞΌs <

 Memory estimate: 129.17 KiB, allocs estimate: 5.

julia> N = 100; A = rand(N, N); M = Symmetric(A'A);

julia> @benchmark inv(M)
BenchmarkTools.Trial: 9563 samples with 1 evaluation.
 Range (min … max):  437.397 ΞΌs …   5.713 ms  β”Š GC (min … max): 0.00% … 90.19%
 Time  (median):     509.356 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   518.800 ΞΌs Β± 133.612 ΞΌs  β”Š GC (mean Β± Οƒ):  0.91% Β±  3.63%

                   β–‚β–ƒβ–„β–†β–…β–†β–‡β–‡β–ˆβ–‡β–‡β–…β–…β–…β–„β–‚β–ƒβ–β–
  β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–ƒβ–„β–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–…β–…β–„β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β– β–„
  437 ΞΌs           Histogram: frequency by time          600 ΞΌs <

 Memory estimate: 129.20 KiB, allocs estimate: 6.

also I see O(N^2), so I guess it’s using the Coppersmith–Winograd algorithm and not something specialised like LDU:

julia> N = 10; A = rand(N, N); M = Symmetric(A'A);

julia> @btime inv(M);
  1.442 ΞΌs (5 allocations: 6.17 KiB)

julia> N = 100; A = rand(N, N); M = Symmetric(A'A);

julia> @btime inv(M);
  146.709 ΞΌs (6 allocations: 129.20 KiB)

julia> N = 1000; A = rand(N, N); M = Symmetric(A'A);

julia> @btime inv(M);
  16.408 ms (6 allocations: 8.13 MiB)

Not an expert here, but when you use BenchmarkTools, don’t forget to interpolate all global variables with a dollar sign to avoid skewed results. I’m assuming it doesn’t change much in this case?

As far as I can tell from the source code, both code paths lead to an lu call and the Symmetric method mostly makes sure that the output also becomes Symmetric.

I suspect that you are drawing too strong complexity conclusions from your samples.

1 Like

If you follow the chain of calls (tip: use @edit inv(Symmetric(randn(2,2)))) you’ll see that in the end the same method (lu) is called. You can access LAPACK’s specialized symmetric routines with something like

function inv_sym(A)
    A, ipiv, info = LAPACK.sytrf!('U', A)
    LAPACK.sytri!('U', A, ipiv)
end

I tested on 1k x 1k, 100x100 and 20x20

With OpenBLAS, for 1k and 100, both this and inv took about the same time; for 20, the second (inv, so just lu) was about 3x faster.

With MKL, for 1k it took the same time, for 100 inv was about twice as slow, and for 20 inv was about 3x as slow.

In my experience OpenBLAS regularly displayed very strange performance characteristics. I’d start by using MKL, then maybe investigating direct LAPACK calls if you find the julia codepaths are suboptimal. (the other thing is that there are very few cases where you need explicit inverses, so maybe start with seeing if you can’t get rid of them)

1 Like

This has nothing to do with the underlying algorithm (which is not coppersmith-winograd or any other fancy algorithm, and is just plain old N^3), and is probably just a sign the problem is bandwidth-bound. You should see N^3 scaling for much larger matrices

2 Likes

No worries, I checked that before of course. No problems here :wink:

This isn’t just symmetric, it’s symmetric positive-definite, in which case you could call the cholesky factorization, which really is faster than lu.

(Of course, the usual caveats about matrix inversion apply β€” most of the time, you don’t need to compute an inverse explicitly, because you can just work with the factorization instead, e.g. to repeatedly solve different right-hand-sides. When you do A \ b it never computes a matrix inverse.)

3 Likes

Thanks, I am a bit confused since the inv_sym() is twice as slow compared to inv(), at least on my M1. I don’t have access to MKL on this machine since it’s not Intel.

Anyways, I think it would make sense to use the power of (multiple) dispatch to call the optimised algorithm, otherwise I don’t really see the point of a special method for int(). I mean, the first thing which comes to my mind is that it’s probably choosing a better algorithm…

julia> N = 100; A = rand(N, N); M = A'A;

julia> @benchmark inv($M)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  385.735 ΞΌs …  14.834 ms  β”Š GC (min … max): 0.00% … 96.37%
 Time  (median):     467.223 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   477.210 ΞΌs Β± 208.396 ΞΌs  β”Š GC (mean Β± Οƒ):  1.18% Β±  3.48%

                    β–‚β–„β–„β–„β–„β–…β–‡β–‡β–ˆβ–‡β–ˆβ–…β–„β–…β–„β–ƒβ–‚β–‚
  β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–„β–ƒβ–…β–…β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–„β–…β–„β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–β–β–β– β–„
  386 ΞΌs           Histogram: frequency by time          564 ΞΌs <

 Memory estimate: 129.17 KiB, allocs estimate: 5.

julia> @benchmark inv_sym($M)
BenchmarkTools.Trial: 6344 samples with 1 evaluation.
 Range (min … max):  676.431 ΞΌs …  10.389 ms  β”Š GC (min … max): 0.00% … 92.72%
 Time  (median):     768.999 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   784.270 ΞΌs Β± 195.641 ΞΌs  β”Š GC (mean Β± Οƒ):  0.64% Β±  2.66%

               β–β–β–„β–„β–†β–†β–‡β–‡β–†β–ˆβ–†β–…β–…β–ƒβ–‚β–‚
  β–β–β–β–β–β–‚β–‚β–‚β–ƒβ–„β–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–†β–…β–„β–„β–ƒβ–„β–ƒβ–ƒβ–‚β–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–β–β–β–β–β–β– β–„
  676 ΞΌs           Histogram: frequency by time          928 ΞΌs <

 Memory estimate: 51.88 KiB, allocs estimate: 4.

Anyways, I think it would make sense to use the power of (multiple) dispatch to call the optimised algorithm, otherwise I don’t really see the point of a special method for int() .

The problem is that which algorithm is faster very much depends on your matrix size, BLAS implementation and hardware, so it’s tricky for julia to know the right thing to do. In this case, I suspect it’s just that OpenBLAS’s LDLT is just not very good. There’s the open issue BLAS support for M1 ARM64 via Apple's Accelerate Β· Issue #42312 Β· JuliaLang/julia Β· GitHub for a faster BLAS on apple

1 Like
julia> N = 4000; A = rand(N, N); M = Symmetric(A'A);

julia> @benchmark inv($M)
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  537.033 ms … 539.106 ms  β”Š GC (min … max): 0.14% … 0.08%
 Time  (median):     538.214 ms               β”Š GC (median):    0.14%
 Time  (mean Β± Οƒ):   538.154 ms Β± 673.172 ΞΌs  β”Š GC (mean Β± Οƒ):  0.12% Β± 0.03%

  β–ˆ         β–ˆ       β–ˆ   β–ˆ         β–ˆ   β–ˆ       β–ˆ β–ˆ        β–ˆ    β–ˆ  
  β–ˆβ–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–ˆβ–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–ˆβ–β–β–β–β–β–β–β–ˆβ–β–ˆβ–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–ˆ ▁
  537 ms           Histogram: frequency by time          539 ms <

 Memory estimate: 139.68 MiB, allocs estimate: 6.

julia> N = 8000; A = rand(N, N); M = Symmetric(A'A);

julia> @benchmark inv($M)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  4.096 s …  4.107 s  β”Š GC (min … max): 0.01% … 0.01%
 Time  (median):     4.101 s             β”Š GC (median):    0.01%
 Time  (mean Β± Οƒ):   4.101 s Β± 7.605 ms  β”Š GC (mean Β± Οƒ):  0.01% Β± 0.00%

  β–ˆ                                                      β–ˆ  
  β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  4.1 s         Histogram: frequency by time        4.11 s <

 Memory estimate: 529.36 MiB, allocs estimate: 6.

As others have noted, this uses LU which calls LAPACK dgetrf. LAPACK does blocking to improve performance where appropriate, but the routines opt out of this for small matrices. Somewhere in your increasing n, LAPACK would have switched to using blocking, effectively changing the details of the algorithm. And in general LAPACK is not especially optimized for smaller matrices. You need to go with some larger matrices if you want to observe the O(n^3) behavior. I’ve never heard of Coppersmith-Winograd actually being used for real matrix computations.

1 Like

WRT the quadratic time you’re seeing, that’s just because LU factorizations have high quadratic overhead and look like they have quadratic looking runtime for a shockingly long time. Looking at the plot below, you can see that for sizes between 8x8 and 512x512, the time for inv is almost completely quadratic, but for 2048x2048 and above, the time is almost completely cubic.

1 Like

7 posts were split to a new topic: Asymptotically faster matrix-multiplication algorithms

As you observed yourself it’s not clear that the specialized algorithms necessarily are faster, so that might explain why it hasn’t been implemented.

It allows propagation of the Symmetric type, which seems worthwhile in itself since later computations might be able make good use of it.

1 Like

That’s a very interesting plot. I tried to predict the crossing point with a back of the envelope computation, using data from https://freecontent.manning.com/performance-limits-and-profiling/ which quotes 260Gflops/s and a bandwidth of 22GB/s. Given we transfer 8 N^2 bytes and do N^3 flops on them, that suggests the crossing point is around N=100, whereas we observe a much higher crossing point. Any idea what is wrong in the above computation? I ignored caches, but cache effects if anything should decrease the crossing point, not increase it. Or is memory latency rather than bandwidth the limiting factor?

1 Like

Absolutely, propagating the type (symmetric) makes of course sense, I did not think about that!

There’s a few things you’re missing. The first is multi-threading which lets you gain extra efficiency in the 100-400 range. Also, pivoting and computing the inv on top of the LU add a decent amount of n^2 overhead.

Computing the inv on top of LU is O(n^3).

In general, multithreading makes this a lot more difficult to make sense of, because as the problem gets bigger it starts to benefit from more threads, somewhat counteracting the O(n^3) growth. To separate out this effect, I would benchmark with a single thread:

julia> using LinearAlgebra

julia> BLAS.set_num_threads(1)

With a single thread, I’ve typically gotten pretty clear \Theta(n^3) LU behavior for n \gtrsim 100 in the past, e.g. in this notebook.

3 Likes