Hi all,
Iβm working in a problem where I need to solve various Ax=b systems, where A is a dense (but usually not that large) matrix. Initially I was doing this by directly computing inverses and multiplying. Awful to do, but works for the cases I was dealing with at the moment. Now I want to do this more properly by taking the LU factorization and applying it. In particular I need to do one AX=B system (with all being square matrices) and one Ax=b system. So I went ahead and ran some benchmarks between getting and applying LU factorization versus simply using the matrix inverse
using LinearAlgebra
using BenchmarkTools
n=20
A = rand(n,n); B = rand(n,n); C = rand(n,n); b = rand(n); c = rand(n);
##
@benchmark begin
LU_A = lu($A);
C .= LU_A\$B
c .= LU_A\$b
end
##
@benchmark begin
inv_A = inv($A)
mul!($C,inv_A,$B)
mul!($c,inv_A,$b)
end
Benchmark using LU gives me a median time of 6.7 microseconds, while using an inverse and in-place multiplication I get a result in 5.1 microseconds. Is it really the case that for a problem this size the inverse is the fastest way to go (despite potential numerical problems?), or am I missing some obvious optimization?
I also thought for something this size that StaticArrays would make things much faster
using LinearAlgebra
using BenchmarkTools
using StaticArrays
n=20
As = @MMatrix rand(n,n); Bs = @MMatrix rand(n,n); Cs = @MMatrix rand(n,n); b = rand(n); cs = @MVector rand(n);
##
@benchmark begin
LU_A = lu($As);
$Cs .= LU_A\$Bs;
$cs .= LU_A\$b;
end
##
@benchmark begin
inv_A = inv($As)
mul!($Cs,inv_A,$Bs)
mul!($cs,inv_A,$b)
end
This gives me median times of 8.2 and 5.9 microseconds, both slower than the case with regular arrays, but also with the direct calculation of the inverse being the fastest.
There is an in-place version of \
that is called ldiv!
that you can use for the lu
case. inv
calls lu
under the hood, so it should be possible to get lu
strictly faster. In your case, you make use of static arrays in one of the cases, and that might use another algorithm under the hood.
These are the timings I get
julia> @benchmark begin
LU_A = lu($A);
C .= LU_A\$B
c .= LU_A\$b
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 9.488 ΞΌs β¦ 262.204 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 10.390 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 10.604 ΞΌs Β± 3.049 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββ
βββ
βββ
βββββββββ
ββββββββββββββββ
βββββββββββββββββββββββββββββββββββ β
9.49 ΞΌs Histogram: frequency by time 12.7 ΞΌs <
Memory estimate: 6.97 KiB, allocs estimate: 6.
julia> @benchmark begin
LU_A = lu($A);
ldiv!(C, LU_A, $B)
ldiv!(c, LU_A, $b)
end
##
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 9.247 ΞΌs β¦ 264.840 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 10.119 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 10.191 ΞΌs Β± 2.612 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββββββββββ
ββ β
βββββββββββββββββ
βββββββββββββββββββββββββββββ
ββ
ββββββββββββ β
9.25 ΞΌs Histogram: frequency by time 10.9 ΞΌs <
Memory estimate: 3.53 KiB, allocs estimate: 4.
julia> @benchmark begin
inv_A = inv($A)
mul!($C,inv_A,$B)
mul!($c,inv_A,$b)
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 10.290 ΞΌs β¦ 290.237 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 11.221 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 11.546 ΞΌs Β± 4.169 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
βββββ
βββ
ββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββ β
10.3 ΞΌs Histogram: frequency by time 13.8 ΞΌs <
Memory estimate: 13.59 KiB, allocs estimate: 4.
1 Like
And with
julia> LinearAlgebra.BLAS.set_num_threads(1)
julia> @benchmark begin
LU_A = lu($A);
ldiv!(C, LU_A, $B)
ldiv!(c, LU_A, $b)
end
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min β¦ max): 4.851 ΞΌs β¦ 225.703 ΞΌs β GC (min β¦ max): 0.00% β¦ 88.63%
Time (median): 5.004 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 5.116 ΞΌs Β± 3.541 ΞΌs β GC (mean Β± Ο): 1.11% Β± 1.59%
βββββββ
βββ
ββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββ β
4.85 ΞΌs Histogram: frequency by time 5.99 ΞΌs <
Memory estimate: 3.53 KiB, allocs estimate: 4.
3 Likes
Thanks! those are a couple of tricks I did not know about. With the following Iβm getting 6.1 microsec for LU, 5.1 microsec for inv
using LinearAlgebra
using BenchmarkTools
LinearAlgebra.BLAS.set_num_threads(8)
n=20
A = rand(n,n); B = rand(n,n); C = rand(n,n); b = rand(n); c = rand(n);
##
@benchmark begin
LU_A = lu($A);
ldiv!($C,LU_A,$B)
ldiv!($c,LU_A,$b)
end
##
@benchmark begin
inv_A = inv($A)
mul!($C,inv_A,$B)
mul!($c,inv_A,$b)
end
So theyβre quite close! However if I lower down to one thread (which is how I would mostly use it as parallelization is done at a higher level), I get a worse timing for LU, 7.3 microsec for LU versus 5.1 for inv (so I guess inv is not really being parallelized)