Performance gotcha in linear algebra lu()

PetrKryslUCSD · November 29, 2018, 5:14am

Curiosity killed the cat. I know. But I was curious: how would a naïve implementation of the LU factorization fare compared to what is built in and supported by BLAS 1, 2, 3? The result surprised me as I was able to beat the built-in lu!()! Not by much, but still. (EDIT: Note well, the beaten version was the generic LU factorization, not the blas-supported one.)

The complete code is here. (Note: both versions of the factorization are without pivoting.)

My configuration:

julia> versioninfo(verbose = true)                      
Julia Version 1.0.1                                     
Commit 0d713926f8 (2018-09-29 19:05 UTC)                
Platform Info:                                          
  OS: Windows (x86_64-w64-mingw32)                      
      Microsoft Windows [Version 10.0.17134.345]        
  CPU: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz:        
              speed         user         nice          sys         idle          irq                                   
       #1  2208 MHz    3902015            0      3664234     33590234       406906  ticks                              
       #2  2208 MHz    3563296            0      2253140     35339640        36734  ticks                              
       #3  2208 MHz    4818578            0      2935328     33402171        40687  ticks                              
       #4  2208 MHz    5637328            0      2455843     33062906        31609  ticks                              
                                                                                                                       
  Memory: 15.927024841308594 GB (7806.58984375 MB free)                                                                
  Uptime: 41156.1347257 sec                                                                                            
  Load Avg:  0.0  0.0  0.0                                                                                             
  WORD_SIZE: 64                                                                                                        
  LIBM: libopenlibm                                                                                                    
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

I’m sure this is not the definitive investigation of the matter, but you may find it of interest.

antoine-levitt · November 29, 2018, 10:32am

I can reproduce, I get almost 2x speedup on N=1000. I’m very confused by why BLAS isn’t more efficient.

foobar_lv2 · November 29, 2018, 1:02pm

That’s because lu(a, Val(false)) does not call into BLAS and instead uses the fallback generic_lufact!. Nevertheless, your naive code benchmarks faster than the LinearAlgebra generic fallback, so it would be interesting to see why and whether the stdlib version can be improved.

Tip: Don’t benchmark the inplace versions, use copy. Otherwise you will perform the next lu on the result of the last lu, with the possibility of hitting slow subnormals in the benchmark loop. The cost of creating a copy is negligible compared to the cost of the lu factorization.

juthohaegeman · November 29, 2018, 1:40pm

Or use: @btime lu!(b) setup=(b=copy($a))

foobar_lv2 · November 29, 2018, 1:54pm

Nope, setup does not reliably work for this.

julia> a=Ref(3.0);
julia> function f(a)
       a[] = -sqrt(a[])
       nothing
       end
julia> @btime f(a) setup=(a[]=1.0)
ERROR: DomainError with -1.0:

Need to do

julia> @btime f(a) setup=(a[]=1.0) evals=1
  49.000 ns (0 allocations: 0 bytes)
julia> @btime f($a) setup=(a[]=1.0) evals=1
  30.000 ns (0 allocations: 0 bytes)

But if you use @benchmarkable to set setup and evals, and then tune!, like BenchBasemarks.jl, then the evals is overwritten:

julia> bm = @benchmarkable f($a) setup=(a[]=1.0) evals=1
julia> @benchmark bm
BenchmarkTools.Trial:
[...]
julia> tune!(bm)
ERROR: DomainError with -1.0:

That’s a bug / bad design decision imo, and an issue is already open on basebenchmarks and benchmarktools.

PetrKryslUCSD · November 29, 2018, 4:03pm

With the suggestion of @juthohaegeman (concerning the setup of the benchmark test), and in order to test the blas-supported solver (running @btime lu!(b, Val(true)) setup=(b=copy($a)), i. e. with pivoting enabled), I get the following graph:

I do have some explanations now.

The built-in generic_lufact! leaves some performance on the table in the loop

The performance of the naïve implementation is matched (for larger number of equations) if this loop is replaced as

            # Update the rest
            for j = k+1:n
                nAkj = -A[k,j]
                for i = k+1:m
                    A[i,j] += A[i,k]*nAkj
                end
            end

When pivoting is turned off, generic_lufact! still allocates the pivot-position vector, and goes through the motions of working with it. This costs some performance for small matrix sizes. I suppose that is the price of doing business with the LU factorization object which expects the pivot vector during construction.

I guess there is a lesson here: For larger matrices, even if one is sure that no pivoting is necessary, it should not be turned off, because the call is routed to generic_lufact! with an attendant order-of-magnitude slow down. For small matrices it may pay to avoid the creation of the factorization object and use the naïve factorization.

antoine-levitt · November 29, 2018, 5:55pm

It should probably be noted in the docstring that the routine without pivoting is not BLAS, as this is a non trivial performance gotcha. OpenBLAS is not very fast for small matrices, it would be interesting to see what MKL does here.

foobar_lv2 · November 29, 2018, 9:02pm

The other kind of gotcha in loops is also really important: We know that the write to A[i,j]does not alias the read of A[k,j], but the compiler doesn’t know that. At least that’s what I think is going on.

Do you plan to make a PR?

PetrKryslUCSD · November 30, 2018, 12:13am

I have to admit I don’t know yet what to propose. One thing is the documentation, but there’s also the possibility of producing some code that would allow for additional optimization of the choice of the method based on the size of the problem.

stevengj · November 30, 2018, 4:49am

The pivot=false option is basically only useful for pedagogy (comparing to hand calculations or toy code for tiny matrices), and should never be used in real problems because it can make the calculation numerically unstable. Improving its performance would serve no practical purpose.

I agree that the documentation should be clearer on the meaning of the pivot argument.

PetrKryslUCSD · November 30, 2018, 4:30pm

That is a fair point. But we shouldn’t miss the important part of the story: the performance of the BLAS-supported LU factorization is erratic for smaller matrix sizes, and in fact it is outperformed by the generic Julia version (both WITH PIVOTING).

With just a small change to the code (the “better” implementation of the generic LU factorization), the performance can be improved further.

The break-even point between lu! and generic_lufact! is around 200 equations. NB: For 10 equations or less, the static-array solver implementation may provide further improvements in speed with the generic version of the factorization. (I don’t know if the static array can be passed to the BLAS-supported solver.)

I think it might be of interest to compare with the MKL solver. If anyone has access to it, would you please run the code in Testing LU · GitHub and post the generated graph?

EDIT: Additional results for complex matrices. Same computer as above.

PetrKryslUCSD · November 30, 2018, 4:49pm

Do you have access to MKL-Julia? Could you run the test https://gist.github.com/PetrKryslUCSD/758d19f03ee2f643bf1f76fe9f5c40ac?

PetrKryslUCSD · November 30, 2018, 5:54pm

Result from a Linux machine:

julia> versioninfo(verbose = true)
Julia Version 1.1.0-DEV.509
Commit 1db6047018 (2018-10-21 02:53 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
      "CentOS release 6.9 (Final)"
  uname: Linux 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64 x86_64
  CPU: AMD Opteron(tm) Processor 6380                 :
                 speed         user         nice          sys         idle          irq
       #1-64  1400 MHz  4259689828 s      22696 s  423913693 s  79618358095 s      42188 s

  Memory: 252.2855682373047 GB (198149.6875 MB free)
  Uptime: 1.3177519e7 sec
  Load Avg:  1.37744140625  1.42236328125  1.591796875
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, bdver1)

The “better” generic LU is now slower than the original!? No idea why.

antoine-levitt · November 30, 2018, 6:45pm

Unfortunately I don’t at the moment.

Also be careful that benchmarking on small matrices is tricky (cache and branch predictions effects), although that probably doesn’t explain why BLAS is so slow.

dpo · November 30, 2018, 6:56pm

I’m curious whether you see similar improvements for the Cholesky decomposition, where pivoting isn’t needed.

PetrKryslUCSD · November 30, 2018, 6:58pm

Correct, that might be an interesting experiment as well.

Juan · November 30, 2018, 10:32pm

If I had Julia with MKL built… Will all packages automatically benefit from it or just the packages that were explicitly designed to use it?

kristoffer.carlsson · November 30, 2018, 10:36pm

Any function that ends up calling BLAS would use MKL instead.

Ajaychat3 · December 1, 2018, 6:17am

If I install Mkl without compiling Julia with it, would I be able to use mkl for compilation of my code? If so how?

tkoolen · December 1, 2018, 7:37am

I suppose you can ccall specific functions from the dynamic libs, but why would you want to?

Topic		Replies	Views
LU factorization performance issue New to Julia linearalgebra	30	714	June 6, 2022
Linear solver \(A, B) performance vs Matlab A\b General Usage	32	7667	May 21, 2017
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36453	June 19, 2020
Julia matrix-multiplication performance Performance linearalgebra	20	8647	October 30, 2022
[ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays Package Announcements performance , blas	32	5779	July 5, 2020

Performance gotcha in linear algebra lu()

Related topics