Hi everyone,
I’ve been ending up having to compute a lot of large dense non-hermitian eigenproblems on an HPC cluster that uses AMD EPYC Milan CPUs. I thought I could use Intel MKL, since it usually does a better job for zgeev
(and presumably many other LAPACK routines) than OpenBLAS.
It’s known for a while that Intel is actively trying to slow down AMD CPUs when using their MKL library. First, there was the MKL_DEBUG_CPU_TYPE
flag, which then was removed by Intel in the later versions of MKL. Fortunately, there exists a workaround, preloading a fake library using LD_PRELOAD
as described here. Unfortunately, this trick does not work for Julia, as discussed in a related post.
I illustrate the potential performance difference:
We can trick MKL using Julia v1.6.7 build from source and USE_INTEL_MKL=1
. Then, using the hack with LD_PRELOAD
gives
julia> using LinearAlgebra
julia> BLAS.set_num_threads(8)
julia> n=200; a = randn(ComplexF64,n,n);
julia> @time eigen(a);
0.835277 seconds (1.67 M allocations: 102.661 MiB, 1.57% gc time, 78.93% compilation time)
julia> n=2000; a = randn(ComplexF64,n,n);
julia> @time eigen(a);
5.833620 seconds (26 allocations: 133.959 MiB, 0.40% gc time)
without the LD_PRELOAD
:
julia> using LinearAlgebra
julia> BLAS.set_num_threads(8)
julia> n=200; a = randn(ComplexF64,n,n);
julia> @time eigen(a);
0.740193 seconds (1.67 M allocations: 102.664 MiB, 2.00% gc time, 90.19% compilation time)
julia> n=2000; a = randn(ComplexF64,n,n);
julia> @time eigen(a);
10.117982 seconds (26 allocations: 133.959 MiB, 0.21% gc time)
And using Julia v1.9 with MKL.jl
(with or without LD_PRELOAD
):
julia> using MKL
julia> using LinearAlgebra
julia> BLAS.set_num_threads(8)
julia> n=200; a = randn(ComplexF64,n,n);
julia> @time eigen(a);
1.192825 seconds (1.67 M allocations: 109.709 MiB, 6.84% gc time, 89.60% compilation time)
julia> n=2000; a = randn(ComplexF64,n,n);
julia> @time eigen(a);
9.663940 seconds (26 allocations: 133.974 MiB, 0.08% gc time)
For comleteness Julia v1.9 with OpenBLAS:
julia> using LinearAlgebra
julia> BLAS.set_num_threads(8)
julia> n=200; a = randn(ComplexF64,n,n);
julia> @time eigen(a);
1.312937 seconds (1.67 M allocations: 109.224 MiB, 8.13% gc time, 91.58% compilation time)
julia> n=2000; a = randn(ComplexF64,n,n);
julia> @time eigen(a);
8.914101 seconds (26 allocations: 126.161 MiB, 0.10% gc time)
Now, since I would like to avoid using Julia 1.6.7 and since MKL_jll.jl
exists, what can we do to make the best use of MKL on AMD CPUs? Is there a simple way of changing MKL.jl
or MKL_jll.jl
to make either the LD_PRELOAD
hack work, or is there a julia internal solution?
I hope I haven’t missed any discussions on this somewhere.