Acceleration of Intel MKL on AMD Ryzen CPU's

Some of us use Intel MKL with Julia for improved performance.
Intel MKL is composed of few code paths for different features of the CPU (SSE2, SSE4, AVX2, AVX512, etc…).
One of the issues of MKL is it discriminate non Intel CPU’s and use the the generic SSE2 code path on AVX2 capable CPU’s.

This specifically hurts on Ryzen 3xxx series which have better AVX2 performance than Intel’s comparable CPU’s.

It seems people found a way around it. By defining System / Environment Variable users could enforce Intel MKL to use the AVX2 code path and skip the CPU Dispatching mechanism.

One could read about it:

Though the above targets MATLAB I think it should work on Julia + MKL.

In Windows it requires:

@echo off

set MKL_DEBUG_CPU_TYPE=5
matlab.exe 

It seems MKL_DEBUG_CPU_TYPE=5 suggests AVX2 capable CPU code path.
Where instead of launching MATLAB one should launch Julia.
The same should hols on other OS.

I wonder if one could integrate this trick into Juno (On Julia Pro for that matter).

8 Likes

How to Set the Environment Variable in Juno

In order to set the Environment Variable in Juno one could do:

  1. Create a Launcher for Juno
    One could create a script file or batch file to launch Juno and set the variable. For instance, In Windows, see the launcher defined in Guide: How to Create a Portable Julia Pro Installation for Windows.
  2. Edit the Init File of Juno
    • Open the Command Pane (Ctrl + Shift + p).
    • Type Init Script and choose: Application: Open Your Init Script.
    • A file named init.coffee or init.js will be opened. Add process.env["MKL_DEBUG_CPU_TYPE"] = "5" in its last line.
    • Save and restart Juno.
1 Like

Admittign my ignorance here. I thought one had to do a compile and link of Julai from source in order to use Intel MKL.
I am sure an expert will be along soon to correct me…

https://www.reddit.com/r/Amd/comments/e4klj0/intel_is_still_sneakily_sabotaging_amd/

2 Likes

latest:

" Intel MKL has been known to use a SSE code paths on AMD CPUs that support newer SIMD instructions such as those that use the Zen microarchitecture. A (by now) well-known trick has been to set the MKL_DEBUG_CPU_TYPE environment variable to the value 5 to force the use of AVX2 kernels on AMD Zen CPUs. Unfortunately, this variable has been removed from Intel MKL 2020 Update 1 and later. This can be confirmed easily by running a program that uses MKL with ltrace -e getenv ."

https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html

EDIT:
read more:

  • “Good news: Intel seems to be adding Zen kernels”
  • “Bad news: sgemm is not yet implemented”
  • " A temporary workaround"

Hacker News discussion:

4 Likes

wow, more active hostility from Intel, who would have thought. /s

edit: nvm, the tone spoofed me. But again, removing option before implementing everything in BLAS for Zen is still bad when removing an *existing solution

Note that the rest of the article shows that they removed it because they added zen specific codepaths that are as fast or faster.

3 Likes

Have there been any developments on this lately? I’m asking because I’m considering buying a Ryzen computer, but since Julia is such a big part of my work I won’t do it if I know the performance is going to be worse than an Intel one.

Thanks!

Yes. LoopVectorization.jl, TriangularSolve.jl, RecursiveFactorization.jl, and Octavian.jl are all very optimized on my Ryzen 5950x and outperform MKL on it. So that’s at least what SciML defaults to under the hood now. Since the pure-Julia BLAS tools are good enough this issue is effectively nullified. (Though note they do not have full coverage of BLAS/LAPACK though)

6 Likes

If you’re using Linux you may trick MKL and get Intel code path on your Ryzen which will be the best you’ll be able to get from MKL.
I’m pretty sure that on new MKL the discrimination will stop probably by Intel itself.

Anyhow, Buy Ryzen, Nothing form Intel will beat Ryzen 5950x / Ryzen 5900x unless you go Intel HEDT.

Hi:
What is the state of the art of this issue now. My interest is to make principal component analysis with very large (non-sparse) matrices.
More precisely, I’d like to know if, on a Ryzen 9 7950x AM5:

  • Intel mkl outperforms the julia built in package (on Ryzen 9)
  • AMD’s AOCL is available on julia or it could be in short
  • If the answer to the last question is “yes”, if AOCL outperforms mkl (on Ryzen 9)

It would be very useful to have the answers to these questions, so thanks in advance.

3 Likes

From my experience last year I can say that parallel simulations using ModelingToolkit work MUCH better with MKL on Ryzen then without…

I cannot say anything about other use cases…

Not clear to me is what you mean with “julia built in package”… Do you mean OpenBLAS (https://www.openblas.net/) ?

LoopVectorization.jl etc are much faster than anything else for small to medium sized problems, but increase the compilation/ load time… For very small problems up to about 100 elements StaticArrays.jl is best…

Thank you for your response, @ufechner7. To be more clear about what I mean by “julia built in package”, I can say that I use

using LinearAlgebra

and, afterwards,

F = svd(   non_missing_anomalies   )

Where non_missing_anomalies is a very large, non-sparse, matrix.
I think that, to use the mkl, I should use:

using MKL
using LinearAlgebra

But, will this suffice?, do I need to use an environmental variable as suggested in previous posts?

Yes. The environmental variable is no longer needed for newer versions of MKL.

I mean, benchmark your use case yourself…

using MKL # optional
using LinearAlgebra
using BenchmarkTools

BLAS.set_num_threads(16) # try 8 or 16

@benchmark F=svd( something )

Try it with and without MKL…

OK, setting the number of threads can also make a difference.

It’s worth noting that MKL works great on Ryzen, but appears to be deliberately sabotaged to work really badly on AMD Epyc (server) chips.

2 Likes

The only solution is waiting for AOCL wrappers for faster sparse and dense operations on AMD CPUs · Issue #430 · SciML/LinearSolve.jl · GitHub to materialize.

The flexibility of the framework of Julia’s Linear Algebra should make it reality one day.

2 Likes

LoopVectorization should still be faster than StaticArrays at most sizes below 100, if you can avoid memcpy and allocations. Unfortunately, that is currently easier said than done. AmulB below calls memcpy twice, even though they’re both unnecessary.
The first is copying an SArray to another place on the stack; it should simply forward the pointer instead.
The other stack memory from one place to another, it should just replace all stores of the source pointer with the destination pointer.
So it’s conceivable that AmulB will match AmulB!'s performance below…
…but those changes aren’t going to happen before Julia 1.11.

julia> using StaticArrays, LoopVectorization, BenchmarkTools

julia> @inline function AmulB!(C, A, B)
         @turbo for n ∈ indices((C,B),2), m ∈ indices((C,A),1)
           Cmn = zero(eltype(C))
           for k ∈ indices((A,B),(2,1))
             Cmn += A[m,k] * B[k,n]
           end
           C[m,n] = Cmn
         end
         return C
       end
AmulB! (generic function with 1 method)

julia> @inline function AmulB(A::SMatrix{M,K,T}, B::SMatrix{K,N,S}) where {M,K,N,S,T}
         SMatrix(AmulB!(MMatrix{M,N,promote_type(T,S)}(undef), A, B))
       end
AmulB (generic function with 1 method)

julia> M=K=N=7; A = @SMatrix(rand(M,K)); B = @SMatrix(rand(K,N));

julia> AmulB(A,B) ≈ A*B
true

julia> @benchmark AmulB($(Ref(A))[], $(Ref(B))[])
BenchmarkTools.Trial: 10000 samples with 986 evaluations.
 Range (min … max):  51.460 ns … 234.389 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     53.489 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   54.375 ns ±   8.143 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▁▇█▇▇▆▄▁                                                 
  ▂▂▃▃▅████████▆▅▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂ ▃
  51.5 ns         Histogram: frequency by time         64.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark $(Ref(A))[] * $(Ref(B))[]
BenchmarkTools.Trial: 10000 samples with 985 evaluations.
 Range (min … max):  46.831 ns … 62.294 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     47.868 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   47.871 ns ±  0.941 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▅▃     ▄▂▂▆█▂                                              
  ▃████▅▄▄████████▆▅▄▄▄▃▃▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▂▂▁▁▁▂▁▂▂▂ ▃
  46.8 ns         Histogram: frequency by time        52.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> Am = MMatrix(A); Bm = MMatrix(B); Cm = MMatrix{M,N,Base.promote_eltype(Am,Bm)}(undef);

julia> @benchmark AmulB!($Cm, $Am, $Bm)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  29.005 ns … 605.039 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.538 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   29.671 ns ±   5.795 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▂ ▁   ▃▆▂█▂▄▁                                      
  ▁▁▁▁▁▁▁▁▃▄██▇█▇█▇███████▆█▅▇▅▆▆▃▄▃▄▄▃▃▂▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  29 ns           Histogram: frequency by time         30.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7513 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
  LD_UN_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
  LD_LIBRARY_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
  JULIA_PATH = @.
  JULIA_NUM_THREADS = 64
1 Like

I am sorry I am new to Julia and I easily get lost with the libraries ecosystem. In order to take advantage from LoopVectorization.jl and/or StaticArrays.jl, should I use:

using LoopVectorization
using LinearAlgebra

or

using StaticArrays
using LinearAlgebra

and will the svd from LinearAlgebra take advantage of LoopVectorization and/or StaticArrays?

To use StaticArrays.jl you must declare your vectors/ arrays as such, e.g.

using StaticArrays

vec=SA[1,2,3]

(well, don’t use global variables, this is just an example)

They come in two flavors, mutable or immutable, choose whatever you need.

LoopVectorization.jl is another beast, you need to use the correct macros, e.g.

function mydotavx(a, b)
          s = 0.0
          @turbo for i ∈ eachindex(a,b)
              s += a[i]*b[i]
          end
          s
      end

And the order of the using commands doesn’t matter with one exception: To use MKL you must write using MKL at the very beginning…

2 Likes

Ok, but my intent is to make a singular value decomposition (svd). How will I take advantage of these libraries for this specific purpose?