I thought that inverting a symmetric matrix would be faster, since there are optimised algorithms and I also saw that there is a method explicity for Symmetric, but maybe I am overlooking some trivial things, so please help
This is the method I found with methods(inv)
[9] inv(A::LinearAlgebra.Symmetric{<:Any, <:StridedMatrix{T} where T}) in LinearAlgebra at /Users/tamasgal/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/symmetric.jl:647
and in the example below M is a subtype of that Symmetric type:
julia> N = 1000; A = rand(N, N); M = Symmetric(A'A);
julia> typeof(M) <: LinearAlgebra.Symmetric{<:Any, <:StridedMatrix{T} where T}
true
So why is there no difference between a random matrix and a symmetric one?
julia> using BenchmarkTools, LinearAlgebra
julia> N = 100; A = rand(N, N); M = A'A;
julia> @benchmark inv(M)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 415.816 ΞΌs β¦ 5.373 ms β GC (min β¦ max): 0.00% β¦ 90.70%
Time (median): 473.493 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 481.312 ΞΌs Β± 105.862 ΞΌs β GC (mean Β± Ο): 0.54% Β± 2.66%
ββββ β ββββββ βββ βββββ
βββββββββ βββββββββββββββββββββββββββ β β β ββββββββββββββββββββββ β
416 ΞΌs Histogram: frequency by time 575 ΞΌs <
Memory estimate: 129.17 KiB, allocs estimate: 5.
julia> N = 100; A = rand(N, N); M = Symmetric(A'A);
julia> @benchmark inv(M)
BenchmarkTools.Trial: 9563 samples with 1 evaluation.
Range (min β¦ max): 437.397 ΞΌs β¦ 5.713 ms β GC (min β¦ max): 0.00% β¦ 90.19%
Time (median): 509.356 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 518.800 ΞΌs Β± 133.612 ΞΌs β GC (mean Β± Ο): 0.91% Β± 3.63%
βββββ βββββββ β β βββββ
ββββββββββββββ βββββββββββββββββββββββββββ β βββββββββββββββββββ β
437 ΞΌs Histogram: frequency by time 600 ΞΌs <
Memory estimate: 129.20 KiB, allocs estimate: 6.
also I see O(N^2), so I guess itβs using the CoppersmithβWinograd algorithm and not something specialised like LDU:
julia> N = 10; A = rand(N, N); M = Symmetric(A'A);
julia> @btime inv(M);
1.442 ΞΌs (5 allocations: 6.17 KiB)
julia> N = 100; A = rand(N, N); M = Symmetric(A'A);
julia> @btime inv(M);
146.709 ΞΌs (6 allocations: 129.20 KiB)
julia> N = 1000; A = rand(N, N); M = Symmetric(A'A);
julia> @btime inv(M);
16.408 ms (6 allocations: 8.13 MiB)
Not an expert here, but when you use BenchmarkTools, donβt forget to interpolate all global variables with a dollar sign to avoid skewed results. Iβm assuming it doesnβt change much in this case?
As far as I can tell from the source code, both code paths lead to an lu call and the Symmetric method mostly makes sure that the output also becomes Symmetric.
I suspect that you are drawing too strong complexity conclusions from your samples.
If you follow the chain of calls (tip: use @edit inv(Symmetric(randn(2,2)))) youβll see that in the end the same method (lu) is called. You can access LAPACKβs specialized symmetric routines with something like
function inv_sym(A)
A, ipiv, info = LAPACK.sytrf!('U', A)
LAPACK.sytri!('U', A, ipiv)
end
I tested on 1k x 1k, 100x100 and 20x20
With OpenBLAS, for 1k and 100, both this and inv took about the same time; for 20, the second (inv, so just lu) was about 3x faster.
With MKL, for 1k it took the same time, for 100 inv was about twice as slow, and for 20 inv was about 3x as slow.
In my experience OpenBLAS regularly displayed very strange performance characteristics. Iβd start by using MKL, then maybe investigating direct LAPACK calls if you find the julia codepaths are suboptimal. (the other thing is that there are very few cases where you need explicit inverses, so maybe start with seeing if you canβt get rid of them)
This has nothing to do with the underlying algorithm (which is not coppersmith-winograd or any other fancy algorithm, and is just plain old N^3), and is probably just a sign the problem is bandwidth-bound. You should see N^3 scaling for much larger matrices
This isnβt just symmetric, itβs symmetric positive-definite, in which case you could call the cholesky factorization, which really is faster than lu.
(Of course, the usual caveats about matrix inversion apply β most of the time, you donβt need to compute an inverse explicitly, because you can just work with the factorization instead, e.g. to repeatedly solve different right-hand-sides. When you do A \ b it never computes a matrix inverse.)
Thanks, I am a bit confused since the inv_sym() is twice as slow compared to inv(), at least on my M1. I donβt have access to MKL on this machine since itβs not Intel.
Anyways, I think it would make sense to use the power of (multiple) dispatch to call the optimised algorithm, otherwise I donβt really see the point of a special method for int(). I mean, the first thing which comes to my mind is that itβs probably choosing a better algorithmβ¦
julia> N = 100; A = rand(N, N); M = A'A;
julia> @benchmark inv($M)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 385.735 ΞΌs β¦ 14.834 ms β GC (min β¦ max): 0.00% β¦ 96.37%
Time (median): 467.223 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 477.210 ΞΌs Β± 208.396 ΞΌs β GC (mean Β± Ο): 1.18% Β± 3.48%
ββββββ ββββββ ββ ββββ
βββββββββββββββ β ββββββββββββββββββββββββββββ βββββββββββββββββ β
386 ΞΌs Histogram: frequency by time 564 ΞΌs <
Memory estimate: 129.17 KiB, allocs estimate: 5.
julia> @benchmark inv_sym($M)
BenchmarkTools.Trial: 6344 samples with 1 evaluation.
Range (min β¦ max): 676.431 ΞΌs β¦ 10.389 ms β GC (min β¦ max): 0.00% β¦ 92.72%
Time (median): 768.999 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 784.270 ΞΌs Β± 195.641 ΞΌs β GC (mean Β± Ο): 0.64% Β± 2.66%
ββββββββββββ β βββ
βββββββββββ ββββββββββββββββββββββββ ββββββββββββββββββββββββββ β
676 ΞΌs Histogram: frequency by time 928 ΞΌs <
Memory estimate: 51.88 KiB, allocs estimate: 4.
Anyways, I think it would make sense to use the power of (multiple) dispatch to call the optimised algorithm, otherwise I donβt really see the point of a special method for int() .
The problem is that which algorithm is faster very much depends on your matrix size, BLAS implementation and hardware, so itβs tricky for julia to know the right thing to do. In this case, I suspect itβs just that OpenBLASβs LDLT is just not very good. Thereβs the open issue BLAS support for M1 ARM64 via Apple's Accelerate Β· Issue #42312 Β· JuliaLang/julia Β· GitHub for a faster BLAS on apple
julia> N = 4000; A = rand(N, N); M = Symmetric(A'A);
julia> @benchmark inv($M)
BenchmarkTools.Trial: 10 samples with 1 evaluation.
Range (min β¦ max): 537.033 ms β¦ 539.106 ms β GC (min β¦ max): 0.14% β¦ 0.08%
Time (median): 538.214 ms β GC (median): 0.14%
Time (mean Β± Ο): 538.154 ms Β± 673.172 ΞΌs β GC (mean Β± Ο): 0.12% Β± 0.03%
β β β β β β β β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
537 ms Histogram: frequency by time 539 ms <
Memory estimate: 139.68 MiB, allocs estimate: 6.
julia> N = 8000; A = rand(N, N); M = Symmetric(A'A);
julia> @benchmark inv($M)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min β¦ max): 4.096 s β¦ 4.107 s β GC (min β¦ max): 0.01% β¦ 0.01%
Time (median): 4.101 s β GC (median): 0.01%
Time (mean Β± Ο): 4.101 s Β± 7.605 ms β GC (mean Β± Ο): 0.01% Β± 0.00%
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
4.1 s Histogram: frequency by time 4.11 s <
Memory estimate: 529.36 MiB, allocs estimate: 6.
As others have noted, this uses LU which calls LAPACK dgetrf. LAPACK does blocking to improve performance where appropriate, but the routines opt out of this for small matrices. Somewhere in your increasing n, LAPACK would have switched to using blocking, effectively changing the details of the algorithm. And in general LAPACK is not especially optimized for smaller matrices. You need to go with some larger matrices if you want to observe the O(n^3) behavior. Iβve never heard of Coppersmith-Winograd actually being used for real matrix computations.
WRT the quadratic time youβre seeing, thatβs just because LU factorizations have high quadratic overhead and look like they have quadratic looking runtime for a shockingly long time. Looking at the plot below, you can see that for sizes between 8x8 and 512x512, the time for inv is almost completely quadratic, but for 2048x2048 and above, the time is almost completely cubic.
As you observed yourself itβs not clear that the specialized algorithms necessarily are faster, so that might explain why it hasnβt been implemented.
It allows propagation of the Symmetric type, which seems worthwhile in itself since later computations might be able make good use of it.
Thatβs a very interesting plot. I tried to predict the crossing point with a back of the envelope computation, using data from https://freecontent.manning.com/performance-limits-and-profiling/ which quotes 260Gflops/s and a bandwidth of 22GB/s. Given we transfer 8 N^2 bytes and do N^3 flops on them, that suggests the crossing point is around N=100, whereas we observe a much higher crossing point. Any idea what is wrong in the above computation? I ignored caches, but cache effects if anything should decrease the crossing point, not increase it. Or is memory latency rather than bandwidth the limiting factor?
Thereβs a few things youβre missing. The first is multi-threading which lets you gain extra efficiency in the 100-400 range. Also, pivoting and computing the inv on top of the LU add a decent amount of n^2 overhead.
In general, multithreading makes this a lot more difficult to make sense of, because as the problem gets bigger it starts to benefit from more threads, somewhat counteracting the O(n^3) growth. To separate out this effect, I would benchmark with a single thread:
julia> using LinearAlgebra
julia> BLAS.set_num_threads(1)
With a single thread, Iβve typically gotten pretty clear \Theta(n^3) LU behavior for n \gtrsim 100 in the past, e.g. in this notebook.