LU factorization performance issue

I haven’t checked the OpenBLAS versions, but I’m guessing Julia 1.7 is still shipping an OpenBLAS predating tigerlake support.
I also recommend using MKL on Julia >= 1.7.

1 Like

even if user’s on AMD or ARM CPU?

I personally is really annoyed by Intel’s history with MKL and the fact that we will never be able to ship MKL with Julia anyway

2 Likes

Definitely not if they’re on ARM, but worth considering if they’re on AMD. I would definitely recommend benchmarking on AMD, though.
But OP is on Intel, where it’s very unlikely to get worse performance, and for things like small size LU will probably get substantially better performance than OpenBLAS.

I wouldn’t say “never”. Microsoft ships R with MKL.
Microsoft has more lawyers than JC, but R also has a lot more GPL code than Julia (R is predominantly GPL, while Julia has very little).

EDIT:
I get

julia> using LinearAlgebra

julia> const B = rand(10000,10000);

julia> @time lu(B);
  1.504326 seconds (4 allocations: 763.016 MiB, 0.37% gc time)

julia> @time lu(B);
  1.517499 seconds (4 allocations: 763.016 MiB, 0.47% gc time)

julia> BLAS.set_num_threads(Sys.CPU_THREADS÷2);

julia> @time lu(B);
  1.284092 seconds (4 allocations: 763.016 MiB, 5.09% gc time)

julia> @time lu(B);
  1.233056 seconds (4 allocations: 763.016 MiB, 0.32% gc time)

julia> using MKL

julia> @time lu(B);
  1.047044 seconds (4 allocations: 763.016 MiB, 6.31% gc time)

julia> @time lu(B);
  0.976319 seconds (4 allocations: 763.016 MiB, 0.59% gc time)

julia> versioninfo()
Julia Version 1.9.0-DEV.635
Commit 5ef75cbf5b (2022-05-24 19:07 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.3 (ORCJIT, cascadelake)
  Threads: 36 on 36 virtual cores
3 Likes

Trying on a 32 core AMD Epyc Zen3 system:

julia> const B = rand(10000,10000);

julia> @time lu(B);
 17.353556 seconds (4 allocations: 763.016 MiB, 0.13% gc time)

julia> @time lu(B);
 19.108399 seconds (4 allocations: 763.016 MiB, 0.51% gc time)

julia> BLAS.set_num_threads(Sys.CPU_THREADS÷2);

julia> @time lu(B);
  2.869467 seconds (4 allocations: 763.016 MiB, 4.66% gc time)

julia> @time lu(B);
  2.691990 seconds (4 allocations: 763.016 MiB, 0.07% gc time)

julia> using MKL

julia> @time lu(B);
  2.953514 seconds (4 allocations: 763.016 MiB, 2.14% gc time)

julia> @time lu(B);
  2.400246 seconds (4 allocations: 763.016 MiB, 0.18% gc time)

julia> BLAS.set_num_threads(Sys.CPU_THREADS÷2);

julia> @time lu(B);
  1.819789 seconds (4 allocations: 763.016 MiB, 5.38% gc time)

julia> @time lu(B);
  1.749505 seconds (4 allocations: 763.016 MiB, 0.18% gc time)

julia> versioninfo()
Julia Version 1.9.0-DEV.634
Commit 39a24eb0d0 (2022-05-22 22:04 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7513 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.3 (ORCJIT, znver3)
  Threads: 32 on 64 virtual cores

MKL is about 50% faster than OpenBLAS when using 32 threads. Far faster with the default number.

1 Like

do you have to set_num_threads() back to default after using MKL?

I have tried using MKL with a sparse-matrix package. I have gotten more than 50% speed boost!
Surface Pro 7, with

Julia Version 1.7.3                                                                        
Commit 742b9abb4d (2022-05-06 12:58 UTC)                                                   
Platform Info:                                                                             
  OS: Windows (x86_64-w64-mingw32)                                                         
  CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz                                           
  WORD_SIZE: 64                                                                            
  LIBM: libopenlibm                                                                        
  LLVM: libLLVM-12.0.1 (ORCJIT, icelake-client)                                            
Environment:                                                                               
  JULIA_SSL_CA_ROOTS_PATH =    

No, but the earlier ccall to an OpenBLAS function isn’t going to change MKL’s settings.
See that setting the number of threads again improves MKL’s performance (>2 → <2 s).

Anyway, just to confirm:

julia> using LinearAlgebra

julia> BLAS.get_num_threads()
64

julia> BLAS.set_num_threads(Sys.CPU_THREADS÷2);

julia> BLAS.get_num_threads()
32

julia> using MKL

julia> BLAS.get_num_threads()
64

julia> BLAS.set_num_threads(Sys.CPU_THREADS÷2);

julia> BLAS.get_num_threads()
32
1 Like

What I wonder now is: do the sparse matrix routines of CHOMOD and UMFPACK take advantage of MKL
behind the scenes? Enabling MKL made absolutely no difference when calling cholesky and lu.

FWIW, GitHub - carstenbauer/julia-mkl-amd: Intel MKL vs OpenBLAS on AMD HPC CPUs in Julia

5 Likes

For the record, Julia v1.7.3 comes with OpenBLAS 0.3.13:

Support for Tiger Lake was added in OpenBLAS v0.3.14.

2 Likes

@Abhilash you can also try to set

export OPENBLAS_CORETYPE=SKYLAKEX

before starting Julia to force using the SkylakeX kernel, that’s what OpenBLAS uses anyway:

It just doesn’t detect it automatically in OpenBLAS 0.3.13.

I seem to remember there was an environment variable to show what target OpenBLAS chooses dynamically (I guess in your case it’s falling back to the generic x86_64 kernels), but I can’t find it in the documentation (nor grepping getenv in the source code of OpenBLAS, so maybe I dreamed it).

4 Likes