# Inverting a symmetric matrix is not faster than inverting a random one

I thought that inverting a symmetric matrix would be faster, since there are optimised algorithms and I also saw that there is a method explicity for `Symmetric`, but maybe I am overlooking some trivial things, so please help

This is the method I found with `methods(inv)`

``````[9] inv(A::LinearAlgebra.Symmetric{<:Any, <:StridedMatrix{T} where T}) in LinearAlgebra at /Users/tamasgal/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/symmetric.jl:647
``````

and in the example below `M` is a subtype of that `Symmetric` type:

``````julia> N = 1000; A = rand(N, N); M = Symmetric(A'A);

julia> typeof(M) <: LinearAlgebra.Symmetric{<:Any, <:StridedMatrix{T} where T}
true
``````

So why is there no difference between a random matrix and a symmetric one?

``````julia> using BenchmarkTools, LinearAlgebra

julia> N = 100; A = rand(N, N); M = A'A;

julia> @benchmark inv(M)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  415.816 ΞΌs β¦   5.373 ms  β GC (min β¦ max): 0.00% β¦ 90.70%
Time  (median):     473.493 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   481.312 ΞΌs Β± 105.862 ΞΌs  β GC (mean Β± Ο):  0.54% Β±  2.66%

βββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
416 ΞΌs           Histogram: frequency by time          575 ΞΌs <

Memory estimate: 129.17 KiB, allocs estimate: 5.

julia> N = 100; A = rand(N, N); M = Symmetric(A'A);

julia> @benchmark inv(M)
BenchmarkTools.Trial: 9563 samples with 1 evaluation.
Range (min β¦ max):  437.397 ΞΌs β¦   5.713 ms  β GC (min β¦ max): 0.00% β¦ 90.19%
Time  (median):     509.356 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   518.800 ΞΌs Β± 133.612 ΞΌs  β GC (mean Β± Ο):  0.91% Β±  3.63%

βββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
437 ΞΌs           Histogram: frequency by time          600 ΞΌs <

Memory estimate: 129.20 KiB, allocs estimate: 6.
``````

also I see O(N^2), so I guess itβs using the CoppersmithβWinograd algorithm and not something specialised like LDU:

``````julia> N = 10; A = rand(N, N); M = Symmetric(A'A);

julia> @btime inv(M);
1.442 ΞΌs (5 allocations: 6.17 KiB)

julia> N = 100; A = rand(N, N); M = Symmetric(A'A);

julia> @btime inv(M);
146.709 ΞΌs (6 allocations: 129.20 KiB)

julia> N = 1000; A = rand(N, N); M = Symmetric(A'A);

julia> @btime inv(M);
16.408 ms (6 allocations: 8.13 MiB)
``````

Not an expert here, but when you use BenchmarkTools, donβt forget to interpolate all global variables with a dollar sign to avoid skewed results. Iβm assuming it doesnβt change much in this case?

As far as I can tell from the source code, both code paths lead to an `lu` call and the Symmetric method mostly makes sure that the output also becomes Symmetric.

I suspect that you are drawing too strong complexity conclusions from your samples.

1 Like

If you follow the chain of calls (tip: use `@edit inv(Symmetric(randn(2,2)))`) youβll see that in the end the same method (lu) is called. You can access LAPACKβs specialized symmetric routines with something like

``````function inv_sym(A)
A, ipiv, info = LAPACK.sytrf!('U', A)
LAPACK.sytri!('U', A, ipiv)
end
``````

I tested on 1k x 1k, 100x100 and 20x20

With OpenBLAS, for 1k and 100, both this and `inv` took about the same time; for 20, the second (`inv`, so just lu) was about 3x faster.

With MKL, for 1k it took the same time, for 100 inv was about twice as slow, and for 20 inv was about 3x as slow.

In my experience OpenBLAS regularly displayed very strange performance characteristics. Iβd start by using MKL, then maybe investigating direct LAPACK calls if you find the julia codepaths are suboptimal. (the other thing is that there are very few cases where you need explicit inverses, so maybe start with seeing if you canβt get rid of them)

1 Like

This has nothing to do with the underlying algorithm (which is not coppersmith-winograd or any other fancy algorithm, and is just plain old N^3), and is probably just a sign the problem is bandwidth-bound. You should see N^3 scaling for much larger matrices

2 Likes

No worries, I checked that before of course. No problems here

This isnβt just symmetric, itβs symmetric positive-definite, in which case you could call the `cholesky` factorization, which really is faster than `lu`.

(Of course, the usual caveats about matrix inversion apply β most of the time, you donβt need to compute an inverse explicitly, because you can just work with the factorization instead, e.g. to repeatedly solve different right-hand-sides. When you do `A \ b` it never computes a matrix inverse.)

3 Likes

Thanks, I am a bit confused since the `inv_sym()` is twice as slow compared to `inv()`, at least on my M1. I donβt have access to MKL on this machine since itβs not Intel.

Anyways, I think it would make sense to use the power of (multiple) dispatch to call the optimised algorithm, otherwise I donβt really see the point of a special method for `int()`. I mean, the first thing which comes to my mind is that itβs probably choosing a better algorithmβ¦

``````julia> N = 100; A = rand(N, N); M = A'A;

julia> @benchmark inv(\$M)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max):  385.735 ΞΌs β¦  14.834 ms  β GC (min β¦ max): 0.00% β¦ 96.37%
Time  (median):     467.223 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   477.210 ΞΌs Β± 208.396 ΞΌs  β GC (mean Β± Ο):  1.18% Β±  3.48%

ββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
386 ΞΌs           Histogram: frequency by time          564 ΞΌs <

Memory estimate: 129.17 KiB, allocs estimate: 5.

julia> @benchmark inv_sym(\$M)
BenchmarkTools.Trial: 6344 samples with 1 evaluation.
Range (min β¦ max):  676.431 ΞΌs β¦  10.389 ms  β GC (min β¦ max): 0.00% β¦ 92.72%
Time  (median):     768.999 ΞΌs               β GC (median):    0.00%
Time  (mean Β± Ο):   784.270 ΞΌs Β± 195.641 ΞΌs  β GC (mean Β± Ο):  0.64% Β±  2.66%

ββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
676 ΞΌs           Histogram: frequency by time          928 ΞΌs <

Memory estimate: 51.88 KiB, allocs estimate: 4.
``````

Anyways, I think it would make sense to use the power of (multiple) dispatch to call the optimised algorithm, otherwise I donβt really see the point of a special method for `int()` .

The problem is that which algorithm is faster very much depends on your matrix size, BLAS implementation and hardware, so itβs tricky for julia to know the right thing to do. In this case, I suspect itβs just that OpenBLASβs LDLT is just not very good. Thereβs the open issue BLAS support for M1 ARM64 via Apple's Accelerate Β· Issue #42312 Β· JuliaLang/julia Β· GitHub for a faster BLAS on apple

1 Like
``````julia> N = 4000; A = rand(N, N); M = Symmetric(A'A);

julia> @benchmark inv(\$M)
BenchmarkTools.Trial: 10 samples with 1 evaluation.
Range (min β¦ max):  537.033 ms β¦ 539.106 ms  β GC (min β¦ max): 0.14% β¦ 0.08%
Time  (median):     538.214 ms               β GC (median):    0.14%
Time  (mean Β± Ο):   538.154 ms Β± 673.172 ΞΌs  β GC (mean Β± Ο):  0.12% Β± 0.03%

β         β       β   β         β   β       β β        β    β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
537 ms           Histogram: frequency by time          539 ms <

Memory estimate: 139.68 MiB, allocs estimate: 6.

julia> N = 8000; A = rand(N, N); M = Symmetric(A'A);

julia> @benchmark inv(\$M)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min β¦ max):  4.096 s β¦  4.107 s  β GC (min β¦ max): 0.01% β¦ 0.01%
Time  (median):     4.101 s             β GC (median):    0.01%
Time  (mean Β± Ο):   4.101 s Β± 7.605 ms  β GC (mean Β± Ο):  0.01% Β± 0.00%

β                                                      β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
4.1 s         Histogram: frequency by time        4.11 s <

Memory estimate: 529.36 MiB, allocs estimate: 6.
``````

As others have noted, this uses LU which calls LAPACK `dgetrf`. LAPACK does blocking to improve performance where appropriate, but the routines opt out of this for small matrices. Somewhere in your increasing n, LAPACK would have switched to using blocking, effectively changing the details of the algorithm. And in general LAPACK is not especially optimized for smaller matrices. You need to go with some larger matrices if you want to observe the O(n^3) behavior. Iβve never heard of Coppersmith-Winograd actually being used for real matrix computations.

1 Like

WRT the quadratic time youβre seeing, thatβs just because LU factorizations have high quadratic overhead and look like they have quadratic looking runtime for a shockingly long time. Looking at the plot below, you can see that for sizes between 8x8 and 512x512, the time for `inv` is almost completely quadratic, but for 2048x2048 and above, the time is almost completely cubic.

1 Like

7 posts were split to a new topic: Asymptotically faster matrix-multiplication algorithms

As you observed yourself itβs not clear that the specialized algorithms necessarily are faster, so that might explain why it hasnβt been implemented.

It allows propagation of the Symmetric type, which seems worthwhile in itself since later computations might be able make good use of it.

1 Like

Thatβs a very interesting plot. I tried to predict the crossing point with a back of the envelope computation, using data from Performance Limits and Profiling - Manning which quotes 260Gflops/s and a bandwidth of 22GB/s. Given we transfer 8 N^2 bytes and do N^3 flops on them, that suggests the crossing point is around N=100, whereas we observe a much higher crossing point. Any idea what is wrong in the above computation? I ignored caches, but cache effects if anything should decrease the crossing point, not increase it. Or is memory latency rather than bandwidth the limiting factor?

1 Like

Absolutely, propagating the type (symmetric) makes of course sense, I did not think about that!

Thereβs a few things youβre missing. The first is multi-threading which lets you gain extra efficiency in the 100-400 range. Also, pivoting and computing the `inv` on top of the `LU` add a decent amount of n^2 overhead.

Computing the `inv` on top of `LU` is O(n^3).

In general, multithreading makes this a lot more difficult to make sense of, because as the problem gets bigger it starts to benefit from more threads, somewhat counteracting the O(n^3) growth. To separate out this effect, I would benchmark with a single thread:

``````julia> using LinearAlgebra