I haven’t checked the OpenBLAS versions, but I’m guessing Julia 1.7 is still shipping an OpenBLAS predating tigerlake support.
I also recommend using MKL
on Julia >= 1.7.
even if user’s on AMD or ARM CPU?
I personally is really annoyed by Intel’s history with MKL and the fact that we will never be able to ship MKL with Julia anyway
Definitely not if they’re on ARM, but worth considering if they’re on AMD. I would definitely recommend benchmarking on AMD, though.
But OP is on Intel, where it’s very unlikely to get worse performance, and for things like small size LU will probably get substantially better performance than OpenBLAS.
I wouldn’t say “never”. Microsoft ships R with MKL.
Microsoft has more lawyers than JC, but R also has a lot more GPL code than Julia (R is predominantly GPL, while Julia has very little).
EDIT:
I get
julia> using LinearAlgebra
julia> const B = rand(10000,10000);
julia> @time lu(B);
1.504326 seconds (4 allocations: 763.016 MiB, 0.37% gc time)
julia> @time lu(B);
1.517499 seconds (4 allocations: 763.016 MiB, 0.47% gc time)
julia> BLAS.set_num_threads(Sys.CPU_THREADS÷2);
julia> @time lu(B);
1.284092 seconds (4 allocations: 763.016 MiB, 5.09% gc time)
julia> @time lu(B);
1.233056 seconds (4 allocations: 763.016 MiB, 0.32% gc time)
julia> using MKL
julia> @time lu(B);
1.047044 seconds (4 allocations: 763.016 MiB, 6.31% gc time)
julia> @time lu(B);
0.976319 seconds (4 allocations: 763.016 MiB, 0.59% gc time)
julia> versioninfo()
Julia Version 1.9.0-DEV.635
Commit 5ef75cbf5b (2022-05-24 19:07 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.3 (ORCJIT, cascadelake)
Threads: 36 on 36 virtual cores
Trying on a 32 core AMD Epyc Zen3 system:
julia> const B = rand(10000,10000);
julia> @time lu(B);
17.353556 seconds (4 allocations: 763.016 MiB, 0.13% gc time)
julia> @time lu(B);
19.108399 seconds (4 allocations: 763.016 MiB, 0.51% gc time)
julia> BLAS.set_num_threads(Sys.CPU_THREADS÷2);
julia> @time lu(B);
2.869467 seconds (4 allocations: 763.016 MiB, 4.66% gc time)
julia> @time lu(B);
2.691990 seconds (4 allocations: 763.016 MiB, 0.07% gc time)
julia> using MKL
julia> @time lu(B);
2.953514 seconds (4 allocations: 763.016 MiB, 2.14% gc time)
julia> @time lu(B);
2.400246 seconds (4 allocations: 763.016 MiB, 0.18% gc time)
julia> BLAS.set_num_threads(Sys.CPU_THREADS÷2);
julia> @time lu(B);
1.819789 seconds (4 allocations: 763.016 MiB, 5.38% gc time)
julia> @time lu(B);
1.749505 seconds (4 allocations: 763.016 MiB, 0.18% gc time)
julia> versioninfo()
Julia Version 1.9.0-DEV.634
Commit 39a24eb0d0 (2022-05-22 22:04 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 × AMD EPYC 7513 32-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.3 (ORCJIT, znver3)
Threads: 32 on 64 virtual cores
MKL is about 50% faster than OpenBLAS when using 32 threads. Far faster with the default number.
do you have to set_num_threads()
back to default after using MKL
?
I have tried using MKL with a sparse-matrix package. I have gotten more than 50% speed boost!
Surface Pro 7, with
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, icelake-client)
Environment:
JULIA_SSL_CA_ROOTS_PATH =
No, but the earlier ccall
to an OpenBLAS function isn’t going to change MKL’s settings.
See that setting the number of threads again improves MKL’s performance (>2 → <2 s).
Anyway, just to confirm:
julia> using LinearAlgebra
julia> BLAS.get_num_threads()
64
julia> BLAS.set_num_threads(Sys.CPU_THREADS÷2);
julia> BLAS.get_num_threads()
32
julia> using MKL
julia> BLAS.get_num_threads()
64
julia> BLAS.set_num_threads(Sys.CPU_THREADS÷2);
julia> BLAS.get_num_threads()
32
What I wonder now is: do the sparse matrix routines of CHOMOD and UMFPACK take advantage of MKL
behind the scenes? Enabling MKL made absolutely no difference when calling cholesky
and lu
.
For the record, Julia v1.7.3 comes with OpenBLAS 0.3.13:
https://github.com/JuliaLang/julia/blob/v1.7.3/stdlib/OpenBLAS_jll/Project.toml#L3
Support for Tiger Lake was added in OpenBLAS v0.3.14.
@Abhilash you can also try to set
export OPENBLAS_CORETYPE=SKYLAKEX
before starting Julia to force using the SkylakeX kernel, that’s what OpenBLAS uses anyway:
https://github.com/xianyi/OpenBLAS/blob/5e9a91259158aaccb70343f398df7394f12c6222/cpuid_x86.c#L1463-L1467
It just doesn’t detect it automatically in OpenBLAS 0.3.13.
I seem to remember there was an environment variable to show what target OpenBLAS chooses dynamically (I guess in your case it’s falling back to the generic x86_64 kernels), but I can’t find it in the documentation (nor grepping getenv
in the source code of OpenBLAS, so maybe I dreamed it).