LU factorization performance issue

Elrod · June 6, 2022, 1:42am

I haven’t checked the OpenBLAS versions, but I’m guessing Julia 1.7 is still shipping an OpenBLAS predating tigerlake support.
I also recommend using MKL on Julia >= 1.7.

jling · June 6, 2022, 3:01am

even if user’s on AMD or ARM CPU?

I personally is really annoyed by Intel’s history with MKL and the fact that we will never be able to ship MKL with Julia anyway

Elrod · June 6, 2022, 3:03am

Definitely not if they’re on ARM, but worth considering if they’re on AMD. I would definitely recommend benchmarking on AMD, though.
But OP is on Intel, where it’s very unlikely to get worse performance, and for things like small size LU will probably get substantially better performance than OpenBLAS.

I wouldn’t say “never”. Microsoft ships R with MKL.
Microsoft has more lawyers than JC, but R also has a lot more GPL code than Julia (R is predominantly GPL, while Julia has very little).

EDIT:
I get

julia> using LinearAlgebra

julia> const B = rand(10000,10000);

julia> @time lu(B);
  1.504326 seconds (4 allocations: 763.016 MiB, 0.37% gc time)

julia> @time lu(B);
  1.517499 seconds (4 allocations: 763.016 MiB, 0.47% gc time)

julia> BLAS.set_num_threads(Sys.CPU_THREADS÷2);

julia> @time lu(B);
  1.284092 seconds (4 allocations: 763.016 MiB, 5.09% gc time)

julia> @time lu(B);
  1.233056 seconds (4 allocations: 763.016 MiB, 0.32% gc time)

julia> using MKL

julia> @time lu(B);
  1.047044 seconds (4 allocations: 763.016 MiB, 6.31% gc time)

julia> @time lu(B);
  0.976319 seconds (4 allocations: 763.016 MiB, 0.59% gc time)

julia> versioninfo()
Julia Version 1.9.0-DEV.635
Commit 5ef75cbf5b (2022-05-24 19:07 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.3 (ORCJIT, cascadelake)
  Threads: 36 on 36 virtual cores

Elrod · June 6, 2022, 3:17am

Trying on a 32 core AMD Epyc Zen3 system:

julia> const B = rand(10000,10000);

julia> @time lu(B);
 17.353556 seconds (4 allocations: 763.016 MiB, 0.13% gc time)

julia> @time lu(B);
 19.108399 seconds (4 allocations: 763.016 MiB, 0.51% gc time)

julia> BLAS.set_num_threads(Sys.CPU_THREADS÷2);

julia> @time lu(B);
  2.869467 seconds (4 allocations: 763.016 MiB, 4.66% gc time)

julia> @time lu(B);
  2.691990 seconds (4 allocations: 763.016 MiB, 0.07% gc time)

julia> using MKL

julia> @time lu(B);
  2.953514 seconds (4 allocations: 763.016 MiB, 2.14% gc time)

julia> @time lu(B);
  2.400246 seconds (4 allocations: 763.016 MiB, 0.18% gc time)

julia> BLAS.set_num_threads(Sys.CPU_THREADS÷2);

julia> @time lu(B);
  1.819789 seconds (4 allocations: 763.016 MiB, 5.38% gc time)

julia> @time lu(B);
  1.749505 seconds (4 allocations: 763.016 MiB, 0.18% gc time)

julia> versioninfo()
Julia Version 1.9.0-DEV.634
Commit 39a24eb0d0 (2022-05-22 22:04 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7513 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.3 (ORCJIT, znver3)
  Threads: 32 on 64 virtual cores

MKL is about 50% faster than OpenBLAS when using 32 threads. Far faster with the default number.

jling · June 6, 2022, 3:22am

do you have to set_num_threads() back to default after using MKL?

PetrKryslUCSD · June 6, 2022, 3:22am

I have tried using MKL with a sparse-matrix package. I have gotten more than 50% speed boost!
Surface Pro 7, with

Julia Version 1.7.3                                                                        
Commit 742b9abb4d (2022-05-06 12:58 UTC)                                                   
Platform Info:                                                                             
  OS: Windows (x86_64-w64-mingw32)                                                         
  CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz                                           
  WORD_SIZE: 64                                                                            
  LIBM: libopenlibm                                                                        
  LLVM: libLLVM-12.0.1 (ORCJIT, icelake-client)                                            
Environment:                                                                               
  JULIA_SSL_CA_ROOTS_PATH =

Elrod · June 6, 2022, 3:24am

No, but the earlier ccall to an OpenBLAS function isn’t going to change MKL’s settings.
See that setting the number of threads again improves MKL’s performance (>2 → <2 s).

Anyway, just to confirm:

julia> using LinearAlgebra

julia> BLAS.get_num_threads()
64

julia> BLAS.set_num_threads(Sys.CPU_THREADS÷2);

julia> BLAS.get_num_threads()
32

julia> using MKL

julia> BLAS.get_num_threads()
64

julia> BLAS.set_num_threads(Sys.CPU_THREADS÷2);

julia> BLAS.get_num_threads()
32

PetrKryslUCSD · June 6, 2022, 3:29am

What I wonder now is: do the sparse matrix routines of CHOMOD and UMFPACK take advantage of MKL
behind the scenes? Enabling MKL made absolutely no difference when calling cholesky and lu.

carstenbauer · June 6, 2022, 6:11am

FWIW, GitHub - carstenbauer/julia-mkl-amd: Intel MKL vs OpenBLAS on AMD HPC CPUs in Julia

giordano · June 6, 2022, 9:09am

For the record, Julia v1.7.3 comes with OpenBLAS 0.3.13:
https://github.com/JuliaLang/julia/blob/v1.7.3/stdlib/OpenBLAS_jll/Project.toml#L3
Support for Tiger Lake was added in OpenBLAS v0.3.14.

giordano · June 6, 2022, 9:29am

@Abhilash you can also try to set

export OPENBLAS_CORETYPE=SKYLAKEX

before starting Julia to force using the SkylakeX kernel, that’s what OpenBLAS uses anyway:
https://github.com/xianyi/OpenBLAS/blob/5e9a91259158aaccb70343f398df7394f12c6222/cpuid_x86.c#L1463-L1467
It just doesn’t detect it automatically in OpenBLAS 0.3.13.

I seem to remember there was an environment variable to show what target OpenBLAS chooses dynamically (I guess in your case it’s falling back to the generic x86_64 kernels), but I can’t find it in the documentation (nor grepping getenv in the source code of OpenBLAS, so maybe I dreamed it).

Topic		Replies	Views
Why lu() factorization is not reducing the execution time? Performance question , linearalgebra	10	949	September 24, 2021
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3609	February 11, 2020
OpenBLAS much slower on Windows than Linux Performance windows , blas , linearalgebra , openblas	4	1424	July 23, 2022
Independent LU factorization of small matrices not faster with threads Performance question	10	705	October 5, 2020
Increasing the solution speed of sparse linear system General Usage	14	499	May 19, 2024

LU factorization performance issue

Related topics