Performance gotcha in linear algebra lu()

I am not adept in using C so would create functions in Julia for specific purposes. Just trying to understand whether compiling my functions with mkl would be possible when Julia itself is not compiled with mkl and whether this approach would give any added advantage of performance.

As of now, I have no specific issues concerning Julia performance.

For MKL vs openblas benchmarks, see Intel Developer Zone (obviously to take with a grain of salt…), GitHub - RoyiAvital/MatlabJuliaMatrixOperationsBenchmark: Benchmark MATLAB & Julia for Matrix Operations (although the methodolgy wrt benchmarking and threading, etc. was unclear there). There are a number of posts online with pieces of info, but no clear conclusion (it depends on the operation, the matrix size, the architecture, the threading…)

1 Like

I have access to a Linux machine set up for relatively accurate benchmarking (no hyperthreading, pegged CPU frequency) with MKL Julia.

For future reference if you want more responses: I can only access this machine over SSH with no graphical interface. Your code requires PGFPlotsX, which is a great plotting library, but I had to run the code in order to find out that it is a requirement. Is it really necessary for providing the data needed for the current purposes? You may also want to print versioninfo() and LinearAlgebra.BLAS.vendor() in the script.

That said, here’s the program output:

(Base.Threads).nthreads() = 1
(Ns, ts) = testinglu.timeit(LinearAlgebra.lu!) = ([5, 10, 15, 20, 30, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000], [3.07847e-7, 8.02338e-7, 1.78465e-6, 6.84148e-6, 1.73975e-5, 3.86158e-5, 8.95044e-5, 0.00027118, 0.00060874, 0.00104886, 0.00172513, 0.00240483, 0.00322736, 0.00438552, 0.00566971, 0.00727742, 0.0435355])
(Base.Threads).nthreads() = 1
(Ns, ts) = testinglu.timeit(LinearAlgebra.generic_lufact!) = ([5, 10, 15, 20, 30, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000], [1.64724e-7, 4.55662e-7, 1.18657e-6, 2.42105e-6, 7.1917e-6, 2.96379e-5, 0.000220847, 0.00169683, 0.00563228, 0.0132562, 0.0253846, 0.0432297, 0.0681191, 0.102371, 0.14459, 0.214419, 1.9939])
(Base.Threads).nthreads() = 1
(Ns, ts) = testinglu.timeit(testinglu.better_generic_lufact!) = ([5, 10, 15, 20, 30, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000], [1.69068e-7, 4.70305e-7, 1.23869e-6, 2.04597e-6, 5.18641e-6, 1.61539e-5, 9.21582e-5, 0.00066053, 0.00232595, 0.00564915, 0.0105165, 0.0178123, 0.0277523, 0.0422416, 0.0594111, 0.0897464, 1.29526])

And the plot (scped over and then converted to PNG because Discourse doesn’t support PDF natively, another annoyance):

z

Additional info:

julia> using InteractiveUtils

julia> versioninfo(verbose = true)
Julia Version 1.0.2
Commit d789231 (2018-11-08 20:11 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 16.04.5 LTS
  uname: Linux 4.4.0-134-generic #160-Ubuntu SMP Wed Aug 15 14:58:00 UTC 2018 x86_64 x86_64
  CPU: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz:
              speed         user         nice          sys         idle          irq
       #1  3600 MHz    1778097 s     105803 s     174268 s  445391346 s          0 s
       #2  3600 MHz    1713252 s     103801 s     187317 s  444278727 s          0 s
       #3  3600 MHz    1669932 s     116043 s     164717 s  445512201 s          0 s
       #4  3600 MHz    2549077 s     124092 s     190222 s  443796316 s          0 s

  Memory: 15.538345336914062 GB (3322.86328125 MB free)
  Uptime: 4.475557e6 sec
  Load Avg:  0.20751953125  0.37939453125  0.54296875
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)

# snip
julia> using LinearAlgebra

julia> LinearAlgebra.BLAS.vendor()
:mkl

The printed output isn’t for the same run as the plot because running the file as a script doesn’t print the plot file location.

4 Likes

This is great. Thank you Twan.

Apologies for the plotting code. I thought it best to post the code as-is, but perhaps it wasn’t such a great idea because people’s plotting needs and habits vary quite a bit.

Now for a bit of thinking about the meaning of these results. It would seem that for smaller matrices (let us say up to 50) it is not a bad idea to use the generic LU factorization instead of the built-in Blas-supported one irrespectively of whether the underlying library is MKL or OpenBLAS. However, if it is OpenBLAS it REALLY pays to use the generic LU factorization up to matrices of size 100 or 200.

1 Like

Thanks for the links. This may be useful.

At least on my system (Linux, Haswell), OpenBLAS does significantly better for small (n < 200) matrices when using BLAS.set_num_threads(1), giving results similar to @tkoolen’s MKL results in that range.

1 Like

I’ve noticed that for Cholesky decompositions as well: https://github.com/JuliaRobotics/RigidBodyDynamics.jl/issues/500. I think it’s https://github.com/xianyi/OpenBLAS/issues/1622.

2 Likes

Automatically? Or does the developer of the package need to change something (use MKL syntax)?
Say I’m using GLM.jl or MixedModels.jl.

Automatically.

My tests were run with 1 thread as well. Both on the Win system and on the Linux SMP machine.

Apologies, I have to backpedal here: my testing methodology was flawed (I controlled the number of Julia threads, not the number of blas threads). Here is the picture for the Windows laptop, corrected so that only a single thread is allowed in blas routines:
image

  1. The good news is that with a single thread blas-supported LU is a lot more efficient for small matrices, even though at least for my computer configuration for small matrices blas LU is still slower than the Julia generic version.
  2. One piece of bad news is that for large(r) matrices blas-supported LU with a single thread is noticeably slower than with multiple threads (for my two-core laptop around a factor of 2).
  3. Therefore, there is the second piece of bad news: it appears that the user must control the number of blas threads manually to get predictable and (somewhat) close-to-optimal performance.
1 Like

Excellent observation!

I’m not sure what to make of the referenced OpenBLAS thread though. It seemed to suggest at one point that the issue was solved. So why are we still seeing it now?

1 Like

@PetrKryslUCSD, @tkoolen, @kristoffer.carlsson, @antoine-levitt, @stevengj, @Ralph_Smith

I have an hypothesis on what is going on here. I incurred in the same “erratic” behavior while benchmarking Julia’s cholesky function. One likely explanation is admitting that after a certain matrix size LAPACK LU and Cholesky algorithms switch into partitioned mode, effectively transforming some of the operations into matrix-matrix operations that BLAS can parallelize (see Higham, 2011, p5, referenced in the link below). Now, this is expected to be effective only when the matrix-matrix operations are on large matrices. The complete analysis and code, with figures, is here.

PS: it seems to me that in LAPACK the threshold for switching into partitioned mode should be rised, or, even better, a way to optimize the threshold for a given hardware could be provided as an option, a bit in the spirit of FFTW. Also, pure-Julia versions of partitioned LU and Cholesky algorithms should ensure a performance comparable to LAPACK.

Yeah, this is known, though no action has been taken. See Functions are not guarded against early threading · Issue #1886 · xianyi/OpenBLAS · GitHub.