Performance of lu(A)\B versus inv(A)*B to solve AX=B

Pablo_Marchant · December 8, 2023, 3:07pm

Hi all,

I’m working in a problem where I need to solve various Ax=b systems, where A is a dense (but usually not that large) matrix. Initially I was doing this by directly computing inverses and multiplying. Awful to do, but works for the cases I was dealing with at the moment. Now I want to do this more properly by taking the LU factorization and applying it. In particular I need to do one AX=B system (with all being square matrices) and one Ax=b system. So I went ahead and ran some benchmarks between getting and applying LU factorization versus simply using the matrix inverse

using LinearAlgebra
using BenchmarkTools
n=20
A = rand(n,n); B = rand(n,n); C = rand(n,n); b = rand(n); c = rand(n);
##
@benchmark begin
    LU_A = lu($A);
    C .= LU_A\$B
    c .= LU_A\$b
end
##
@benchmark begin
    inv_A = inv($A)
    mul!($C,inv_A,$B)
    mul!($c,inv_A,$b)
end

Benchmark using LU gives me a median time of 6.7 microseconds, while using an inverse and in-place multiplication I get a result in 5.1 microseconds. Is it really the case that for a problem this size the inverse is the fastest way to go (despite potential numerical problems?), or am I missing some obvious optimization?

I also thought for something this size that StaticArrays would make things much faster

using LinearAlgebra
using BenchmarkTools
using StaticArrays
n=20
As = @MMatrix rand(n,n); Bs = @MMatrix rand(n,n); Cs = @MMatrix rand(n,n); b = rand(n); cs = @MVector rand(n);
##
@benchmark begin
    LU_A = lu($As);
    $Cs .= LU_A\$Bs;
    $cs .= LU_A\$b;
end
##
@benchmark begin
    inv_A = inv($As)
    mul!($Cs,inv_A,$Bs)
    mul!($cs,inv_A,$b)
end

This gives me median times of 8.2 and 5.9 microseconds, both slower than the case with regular arrays, but also with the direct calculation of the inverse being the fastest.

baggepinnen · December 8, 2023, 3:12pm

There is an in-place version of \ that is called ldiv! that you can use for the lu case. inv calls lu under the hood, so it should be possible to get lu strictly faster. In your case, you make use of static arrays in one of the cases, and that might use another algorithm under the hood.

These are the timings I get

julia> @benchmark begin
           LU_A = lu($A);
           C .= LU_A\$B
           c .= LU_A\$b
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.488 μs … 262.204 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.390 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.604 μs ±   3.049 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

            ▁▂▆▆▅█▆▅▆▂▂                                         
  ▁▁▁▁▂▂▃▄▅▆████████████▆▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁ ▃
  9.49 μs         Histogram: frequency by time         12.7 μs <

 Memory estimate: 6.97 KiB, allocs estimate: 6.

julia> @benchmark begin
           LU_A = lu($A);
           ldiv!(C, LU_A, $B)
           ldiv!(c, LU_A, $b)
       end
       ##
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   9.247 μs … 264.840 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.119 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.191 μs ±   2.612 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                       ▂▄▄▃▆▇███▄▆▇▇▃▃▅▄▃  ▁                    
  ▂▁▁▂▂▂▂▂▂▂▂▃▃▃▄▄▅▆█▇████████████████████████▇▅▆▅▄▄▃▄▃▃▃▃▃▃▂▂ ▅
  9.25 μs         Histogram: frequency by time         10.9 μs <

 Memory estimate: 3.53 KiB, allocs estimate: 4.

julia> @benchmark begin
           inv_A = inv($A)
           mul!($C,inv_A,$B)
           mul!($c,inv_A,$b)
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.290 μs … 290.237 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.221 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.546 μs ±   4.169 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▃▄▇▅█▇█▆▅▄▂▁                                         
  ▁▁▂▂▂▄▄▆██████████████▆▅▄▄▃▃▂▂▂▂▂▂▁▂▂▂▂▂▂▃▂▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁ ▃
  10.3 μs         Histogram: frequency by time         13.8 μs <

 Memory estimate: 13.59 KiB, allocs estimate: 4.

baggepinnen · December 8, 2023, 3:16pm

And with

julia> LinearAlgebra.BLAS.set_num_threads(1)

julia> @benchmark begin
           LU_A = lu($A);
           ldiv!(C, LU_A, $B)
           ldiv!(c, LU_A, $b)
       end
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.851 μs … 225.703 μs  ┊ GC (min … max): 0.00% … 88.63%
 Time  (median):     5.004 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.116 μs ±   3.541 μs  ┊ GC (mean ± σ):  1.11% ±  1.59%

      ▂▄█▇█▄▁                                                  
  ▂▃▅████████▇▅▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂ ▃
  4.85 μs         Histogram: frequency by time        5.99 μs <

 Memory estimate: 3.53 KiB, allocs estimate: 4.

Pablo_Marchant · December 8, 2023, 4:48pm

Thanks! those are a couple of tricks I did not know about. With the following I’m getting 6.1 microsec for LU, 5.1 microsec for inv

using LinearAlgebra
using BenchmarkTools

LinearAlgebra.BLAS.set_num_threads(8)

n=20
A = rand(n,n); B = rand(n,n); C = rand(n,n); b = rand(n); c = rand(n);
##
@benchmark begin
    LU_A = lu($A);
    ldiv!($C,LU_A,$B)
    ldiv!($c,LU_A,$b)
end
##
@benchmark begin
    inv_A = inv($A)
    mul!($C,inv_A,$B)
    mul!($c,inv_A,$b)
end

So they’re quite close! However if I lower down to one thread (which is how I would mostly use it as parallelization is done at a higher level), I get a worse timing for LU, 7.3 microsec for LU versus 5.1 for inv (so I guess inv is not really being parallelized)

Topic		Replies	Views
Fastest way to perform A[a,:] * B * A’[:,b] where a and b are vectors of indices General Usage question , blas , mkl , linearalgebra , optimization	5	88	August 8, 2024
Why lu() factorization is not reducing the execution time? Performance question , linearalgebra	10	929	September 24, 2021
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3578	February 11, 2020
Explicit Inversion of a random matrix performs the same as Symmetric & SPD Performance question , linearalgebra	6	1034	October 25, 2018
Solving Ax=B for large matrix dimensions efficiently in Julia Performance linearalgebra	13	8463	January 27, 2025

Performance of lu(A)\B versus inv(A)*B to solve AX=B

Related topics