Performance gotcha in linear algebra lu()

Curiosity killed the cat. I know. But I was curious: how would a naïve implementation of the LU factorization fare compared to what is built in and supported by BLAS 1, 2, 3? The result surprised me as I was able to beat the built-in lu!()! Not by much, but still. (EDIT: Note well, the beaten version was the generic LU factorization, not the blas-supported one.)

image

The complete code is here. (Note: both versions of the factorization are without pivoting.)

My configuration:

julia> versioninfo(verbose = true)                      
Julia Version 1.0.1                                     
Commit 0d713926f8 (2018-09-29 19:05 UTC)                
Platform Info:                                          
  OS: Windows (x86_64-w64-mingw32)                      
      Microsoft Windows [Version 10.0.17134.345]        
  CPU: Intel(R) Core(TM) i7-6650U CPU @ 2.20GHz:        
              speed         user         nice          sys         idle          irq                                   
       #1  2208 MHz    3902015            0      3664234     33590234       406906  ticks                              
       #2  2208 MHz    3563296            0      2253140     35339640        36734  ticks                              
       #3  2208 MHz    4818578            0      2935328     33402171        40687  ticks                              
       #4  2208 MHz    5637328            0      2455843     33062906        31609  ticks                              
                                                                                                                       
  Memory: 15.927024841308594 GB (7806.58984375 MB free)                                                                
  Uptime: 41156.1347257 sec                                                                                            
  Load Avg:  0.0  0.0  0.0                                                                                             
  WORD_SIZE: 64                                                                                                        
  LIBM: libopenlibm                                                                                                    
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)       

I’m sure this is not the definitive investigation of the matter, but you may find it of interest.

9 Likes

I can reproduce, I get almost 2x speedup on N=1000. I’m very confused by why BLAS isn’t more efficient.

That’s because lu(a, Val(false)) does not call into BLAS and instead uses the fallback generic_lufact!. Nevertheless, your naive code benchmarks faster than the LinearAlgebra generic fallback, so it would be interesting to see why and whether the stdlib version can be improved.

Tip: Don’t benchmark the inplace versions, use copy. Otherwise you will perform the next lu on the result of the last lu, with the possibility of hitting slow subnormals in the benchmark loop. The cost of creating a copy is negligible compared to the cost of the lu factorization.

2 Likes

Or use: @btime lu!(b) setup=(b=copy($a))

1 Like

Nope, setup does not reliably work for this.

julia> a=Ref(3.0);
julia> function f(a)
       a[] = -sqrt(a[])
       nothing
       end
julia> @btime f(a) setup=(a[]=1.0)
ERROR: DomainError with -1.0:

Need to do

julia> @btime f(a) setup=(a[]=1.0) evals=1
  49.000 ns (0 allocations: 0 bytes)
julia> @btime f($a) setup=(a[]=1.0) evals=1
  30.000 ns (0 allocations: 0 bytes)

But if you use @benchmarkable to set setup and evals, and then tune!, like BenchBasemarks.jl, then the evals is overwritten:

julia> bm = @benchmarkable f($a) setup=(a[]=1.0) evals=1
julia> @benchmark bm
BenchmarkTools.Trial:
[...]
julia> tune!(bm)
ERROR: DomainError with -1.0:

That’s a bug / bad design decision imo, and an issue is already open on basebenchmarks and benchmarktools.

4 Likes

With the suggestion of @juthohaegeman (concerning the setup of the benchmark test), and in order to test the blas-supported solver (running @btime lu!(b, Val(true)) setup=(b=copy($a)), i. e. with pivoting enabled), I get the following graph:

I do have some explanations now.

  1. The built-in generic_lufact! leaves some performance on the table in the loop
    image

The performance of the naïve implementation is matched (for larger number of equations) if this loop is replaced as

            # Update the rest
            for j = k+1:n
                nAkj = -A[k,j]
                for i = k+1:m
                    A[i,j] += A[i,k]*nAkj
                end
            end
  1. When pivoting is turned off, generic_lufact! still allocates the pivot-position vector, and goes through the motions of working with it. This costs some performance for small matrix sizes. I suppose that is the price of doing business with the LU factorization object which expects the pivot vector during construction.

I guess there is a lesson here: For larger matrices, even if one is sure that no pivoting is necessary, it should not be turned off, because the call is routed to generic_lufact! with an attendant order-of-magnitude slow down. For small matrices it may pay to avoid the creation of the factorization object and use the naïve factorization.

5 Likes

It should probably be noted in the docstring that the routine without pivoting is not BLAS, as this is a non trivial performance gotcha. OpenBLAS is not very fast for small matrices, it would be interesting to see what MKL does here.

2 Likes

The other kind of gotcha in loops is also really important: We know that the write to A[i,j]does not alias the read of A[k,j], but the compiler doesn’t know that. At least that’s what I think is going on.

Do you plan to make a PR?

I have to admit I don’t know yet what to propose. One thing is the documentation, but there’s also the possibility of producing some code that would allow for additional optimization of the choice of the method based on the size of the problem.

The pivot=false option is basically only useful for pedagogy (comparing to hand calculations or toy code for tiny matrices), and should never be used in real problems because it can make the calculation numerically unstable. Improving its performance would serve no practical purpose.

I agree that the documentation should be clearer on the meaning of the pivot argument.

2 Likes

That is a fair point. But we shouldn’t miss the important part of the story: the performance of the BLAS-supported LU factorization is erratic for smaller matrix sizes, and in fact it is outperformed by the generic Julia version (both WITH PIVOTING).

With just a small change to the code (the “better” implementation of the generic LU factorization), the performance can be improved further.

image

The break-even point between lu! and generic_lufact! is around 200 equations. NB: For 10 equations or less, the static-array solver implementation may provide further improvements in speed with the generic version of the factorization. (I don’t know if the static array can be passed to the BLAS-supported solver.)

I think it might be of interest to compare with the MKL solver. If anyone has access to it, would you please run the code in Testing LU · GitHub and post the generated graph?

EDIT: Additional results for complex matrices. Same computer as above.
image

Do you have access to MKL-Julia? Could you run the test https://gist.github.com/PetrKryslUCSD/758d19f03ee2f643bf1f76fe9f5c40ac?

Result from a Linux machine:
image

julia> versioninfo(verbose = true)
Julia Version 1.1.0-DEV.509
Commit 1db6047018 (2018-10-21 02:53 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
      "CentOS release 6.9 (Final)"
  uname: Linux 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64 x86_64
  CPU: AMD Opteron(tm) Processor 6380                 :
                 speed         user         nice          sys         idle          irq
       #1-64  1400 MHz  4259689828 s      22696 s  423913693 s  79618358095 s      42188 s

  Memory: 252.2855682373047 GB (198149.6875 MB free)
  Uptime: 1.3177519e7 sec
  Load Avg:  1.37744140625  1.42236328125  1.591796875
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, bdver1)

The “better” generic LU is now slower than the original!? No idea why.

Unfortunately I don’t at the moment.

Also be careful that benchmarking on small matrices is tricky (cache and branch predictions effects), although that probably doesn’t explain why BLAS is so slow.

1 Like

I’m curious whether you see similar improvements for the Cholesky decomposition, where pivoting isn’t needed.

Correct, that might be an interesting experiment as well.

If I had Julia with MKL built… Will all packages automatically benefit from it or just the packages that were explicitly designed to use it?

Any function that ends up calling BLAS would use MKL instead.

If I install Mkl without compiling Julia with it, would I be able to use mkl for compilation of my code? If so how?

I suppose you can ccall specific functions from the dynamic libs, but why would you want to?