How to circumvent Intel's AMD discrimination in MKL from v1.7 onwards?

Hi everyone,
I’ve been ending up having to compute a lot of large dense non-hermitian eigenproblems on an HPC cluster that uses AMD EPYC Milan CPUs. I thought I could use Intel MKL, since it usually does a better job for zgeev (and presumably many other LAPACK routines) than OpenBLAS.

It’s known for a while that Intel is actively trying to slow down AMD CPUs when using their MKL library. First, there was the MKL_DEBUG_CPU_TYPE flag, which then was removed by Intel in the later versions of MKL. Fortunately, there exists a workaround, preloading a fake library using LD_PRELOAD as described here. Unfortunately, this trick does not work for Julia, as discussed in a related post.

I illustrate the potential performance difference:

We can trick MKL using Julia v1.6.7 build from source and USE_INTEL_MKL=1. Then, using the hack with LD_PRELOAD gives

julia> using LinearAlgebra

julia> BLAS.set_num_threads(8)

julia> n=200; a = randn(ComplexF64,n,n);

julia> @time eigen(a);
  0.835277 seconds (1.67 M allocations: 102.661 MiB, 1.57% gc time, 78.93% compilation time)

julia> n=2000; a = randn(ComplexF64,n,n);

julia> @time eigen(a);
  5.833620 seconds (26 allocations: 133.959 MiB, 0.40% gc time)

without the LD_PRELOAD:

julia> using LinearAlgebra

julia> BLAS.set_num_threads(8)

julia> n=200; a = randn(ComplexF64,n,n);

julia> @time eigen(a);
  0.740193 seconds (1.67 M allocations: 102.664 MiB, 2.00% gc time, 90.19% compilation time)

julia> n=2000; a = randn(ComplexF64,n,n);

julia> @time eigen(a);
 10.117982 seconds (26 allocations: 133.959 MiB, 0.21% gc time)

And using Julia v1.9 with MKL.jl (with or without LD_PRELOAD):

julia> using MKL

julia> using LinearAlgebra

julia> BLAS.set_num_threads(8)

julia> n=200; a = randn(ComplexF64,n,n);

julia> @time eigen(a);
  1.192825 seconds (1.67 M allocations: 109.709 MiB, 6.84% gc time, 89.60% compilation time)

julia> n=2000; a = randn(ComplexF64,n,n);

julia> @time eigen(a);
  9.663940 seconds (26 allocations: 133.974 MiB, 0.08% gc time)

For comleteness Julia v1.9 with OpenBLAS:

julia> using LinearAlgebra

julia> BLAS.set_num_threads(8)

julia> n=200; a = randn(ComplexF64,n,n);

julia> @time eigen(a);
  1.312937 seconds (1.67 M allocations: 109.224 MiB, 8.13% gc time, 91.58% compilation time)

julia> n=2000; a = randn(ComplexF64,n,n);

julia> @time eigen(a);
  8.914101 seconds (26 allocations: 126.161 MiB, 0.10% gc time)

Now, since I would like to avoid using Julia 1.6.7 and since MKL_jll.jl exists, what can we do to make the best use of MKL on AMD CPUs? Is there a simple way of changing MKL.jl or MKL_jll.jl to make either the LD_PRELOAD hack work, or is there a julia internal solution?

I hope I haven’t missed any discussions on this somewhere.

4 Likes

Okay I actually found a hacky solution myself for now:

using MKL
mklpath = dirname(MKL.MKL_jll.libmkl_rt_path)

cd(mklpath)

rm("libmkl_rt.so")
rm("libmkl_core.so")

write("libamdmkl.c","int mkl_serv_intel_cpu_true() {return 1;}")
run(`gcc -shared -o libmkl_core.so -Wl,-rpath=''\$ORIGIN'' libamdmkl.c libmkl_core.so.2`)
run(`gcc -shared -o libmkl_rt.so -Wl,-rpath=''\$ORIGIN'' libamdmkl.c libmkl_rt.so.2`)
rm("libamdmkl.c")

The original libmkl_rt.so and libmkl_core.so are just softlinks to libmkl_rt.so.2 and libmkl_core.so.2, so we don’t mess up things too much (for those that are worried).
I guess there are many reasons not to do this or cases where it fails, but it works for my system and I guess many similar ones as well!

EDIT: actually this doesn’t work properly. It doesn’t link properly to the libmkl_rt.so.2 library, but I had another MKL version in the library path, so then this hack works. I don’t know how to properly link the library though without relative paths.
I have updated the code to make it work for relative paths. Maybe someone will come up with something more elegant!

5 Likes