Performance of lu(A)\B versus inv(A)*B to solve AX=B

Hi all,

I’m working in a problem where I need to solve various Ax=b systems, where A is a dense (but usually not that large) matrix. Initially I was doing this by directly computing inverses and multiplying. Awful to do, but works for the cases I was dealing with at the moment. Now I want to do this more properly by taking the LU factorization and applying it. In particular I need to do one AX=B system (with all being square matrices) and one Ax=b system. So I went ahead and ran some benchmarks between getting and applying LU factorization versus simply using the matrix inverse

using LinearAlgebra
using BenchmarkTools
n=20
A = rand(n,n); B = rand(n,n); C = rand(n,n); b = rand(n); c = rand(n);
##
@benchmark begin
    LU_A = lu($A);
    C .= LU_A\$B
    c .= LU_A\$b
end
##
@benchmark begin
    inv_A = inv($A)
    mul!($C,inv_A,$B)
    mul!($c,inv_A,$b)
end

Benchmark using LU gives me a median time of 6.7 microseconds, while using an inverse and in-place multiplication I get a result in 5.1 microseconds. Is it really the case that for a problem this size the inverse is the fastest way to go (despite potential numerical problems?), or am I missing some obvious optimization?

I also thought for something this size that StaticArrays would make things much faster

using LinearAlgebra
using BenchmarkTools
using StaticArrays
n=20
As = @MMatrix rand(n,n); Bs = @MMatrix rand(n,n); Cs = @MMatrix rand(n,n); b = rand(n); cs = @MVector rand(n);
##
@benchmark begin
    LU_A = lu($As);
    $Cs .= LU_A\$Bs;
    $cs .= LU_A\$b;
end
##
@benchmark begin
    inv_A = inv($As)
    mul!($Cs,inv_A,$Bs)
    mul!($cs,inv_A,$b)
end

This gives me median times of 8.2 and 5.9 microseconds, both slower than the case with regular arrays, but also with the direct calculation of the inverse being the fastest.

There is an in-place version of \ that is called ldiv! that you can use for the lu case. inv calls lu under the hood, so it should be possible to get lu strictly faster. In your case, you make use of static arrays in one of the cases, and that might use another algorithm under the hood.

These are the timings I get

julia> @benchmark begin
           LU_A = lu($A);
           C .= LU_A\$B
           c .= LU_A\$b
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.488 ΞΌs … 262.204 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     10.390 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   10.604 ΞΌs Β±   3.049 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

            β–β–‚β–†β–†β–…β–ˆβ–†β–…β–†β–‚β–‚                                         
  β–β–β–β–β–‚β–‚β–ƒβ–„β–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–…β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–β–β–β–β–β– β–ƒ
  9.49 ΞΌs         Histogram: frequency by time         12.7 ΞΌs <

 Memory estimate: 6.97 KiB, allocs estimate: 6.

julia> @benchmark begin
           LU_A = lu($A);
           ldiv!(C, LU_A, $B)
           ldiv!(c, LU_A, $b)
       end
       ##
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.247 ΞΌs … 264.840 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     10.119 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   10.191 ΞΌs Β±   2.612 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                       β–‚β–„β–„β–ƒβ–†β–‡β–ˆβ–ˆβ–ˆβ–„β–†β–‡β–‡β–ƒβ–ƒβ–…β–„β–ƒ  ▁                    
  β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–„β–„β–…β–†β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–†β–…β–„β–„β–ƒβ–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚ β–…
  9.25 ΞΌs         Histogram: frequency by time         10.9 ΞΌs <

 Memory estimate: 3.53 KiB, allocs estimate: 4.

julia> @benchmark begin
           inv_A = inv($A)
           mul!($C,inv_A,$B)
           mul!($c,inv_A,$b)
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.290 ΞΌs … 290.237 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     11.221 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   11.546 ΞΌs Β±   4.169 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

           β–ƒβ–„β–‡β–…β–ˆβ–‡β–ˆβ–†β–…β–„β–‚β–                                         
  β–β–β–‚β–‚β–‚β–„β–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–„β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–‚β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β– β–ƒ
  10.3 ΞΌs         Histogram: frequency by time         13.8 ΞΌs <

 Memory estimate: 13.59 KiB, allocs estimate: 4.
1 Like

And with

julia> LinearAlgebra.BLAS.set_num_threads(1)

julia> @benchmark begin
           LU_A = lu($A);
           ldiv!(C, LU_A, $B)
           ldiv!(c, LU_A, $b)
       end
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.851 ΞΌs … 225.703 ΞΌs  β”Š GC (min … max): 0.00% … 88.63%
 Time  (median):     5.004 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   5.116 ΞΌs Β±   3.541 ΞΌs  β”Š GC (mean Β± Οƒ):  1.11% Β±  1.59%

      β–‚β–„β–ˆβ–‡β–ˆβ–„β–                                                  
  β–‚β–ƒβ–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–…β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  4.85 ΞΌs         Histogram: frequency by time        5.99 ΞΌs <

 Memory estimate: 3.53 KiB, allocs estimate: 4.
3 Likes

Thanks! those are a couple of tricks I did not know about. With the following I’m getting 6.1 microsec for LU, 5.1 microsec for inv

using LinearAlgebra
using BenchmarkTools

LinearAlgebra.BLAS.set_num_threads(8)

n=20
A = rand(n,n); B = rand(n,n); C = rand(n,n); b = rand(n); c = rand(n);
##
@benchmark begin
    LU_A = lu($A);
    ldiv!($C,LU_A,$B)
    ldiv!($c,LU_A,$b)
end
##
@benchmark begin
    inv_A = inv($A)
    mul!($C,inv_A,$B)
    mul!($c,inv_A,$b)
end

So they’re quite close! However if I lower down to one thread (which is how I would mostly use it as parallelization is done at a higher level), I get a worse timing for LU, 7.3 microsec for LU versus 5.1 for inv (so I guess inv is not really being parallelized)