Acceleration of Intel MKL on AMD Ryzen CPU's

RoyiAvital · November 19, 2019, 9:35pm

Some of us use Intel MKL with Julia for improved performance.
Intel MKL is composed of few code paths for different features of the CPU (SSE2, SSE4, AVX2, AVX512, etc…).
One of the issues of MKL is it discriminate non Intel CPU’s and use the the generic SSE2 code path on AVX2 capable CPU’s.

This specifically hurts on Ryzen 3xxx series which have better AVX2 performance than Intel’s comparable CPU’s.

It seems people found a way around it. By defining System / Environment Variable users could enforce Intel MKL to use the AVX2 code path and skip the CPU Dispatching mechanism.

One could read about it:

Though the above targets MATLAB I think it should work on Julia + MKL.

In Windows it requires:

@echo off

set MKL_DEBUG_CPU_TYPE=5
matlab.exe

It seems MKL_DEBUG_CPU_TYPE=5 suggests AVX2 capable CPU code path.
Where instead of launching MATLAB one should launch Julia.
The same should hols on other OS.

I wonder if one could integrate this trick into Juno (On Julia Pro for that matter).

RoyiAvital · November 21, 2019, 5:52am

How to Set the Environment Variable in Juno

In order to set the Environment Variable in Juno one could do:

Create a Launcher for Juno
One could create a script file or batch file to launch Juno and set the variable. For instance, In Windows, see the launcher defined in Guide: How to Create a Portable Julia Pro Installation for Windows.
Edit the Init File of Juno
- Open the Command Pane (Ctrl + Shift + p).
- Type Init Script and choose: Application: Open Your Init Script.
- A file named init.coffee or init.js will be opened. Add process.env["MKL_DEBUG_CPU_TYPE"] = "5" in its last line.
- Save and restart Juno.

johnh · November 21, 2019, 10:17am

Admittign my ignorance here. I thought one had to do a compile and link of Julai from source in order to use Intel MKL.
I am sure an expert will be along soon to correct me…

jling · December 3, 2019, 3:21am

https://www.reddit.com/r/Amd/comments/e4klj0/intel_is_still_sneakily_sabotaging_amd/

ImreSamu · August 31, 2020, 9:47pm

latest:

" Intel MKL has been known to use a SSE code paths on AMD CPUs that support newer SIMD instructions such as those that use the Zen microarchitecture. A (by now) well-known trick has been to set the MKL_DEBUG_CPU_TYPE environment variable to the value 5 to force the use of AVX2 kernels on AMD Zen CPUs. Unfortunately, this variable has been removed from Intel MKL 2020 Update 1 and later. This can be confirmed easily by running a program that uses MKL with ltrace -e getenv ."

https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html

EDIT:
read more:

“Good news: Intel seems to be adding Zen kernels”
“Bad news: sgemm is not yet implemented”
" A temporary workaround"

Hacker News discussion:

https://news.ycombinator.com/item?id=24332825

jling · August 31, 2020, 9:49pm

wow, more active hostility from Intel, who would have thought. /s

edit: nvm, the tone spoofed me. But again, removing option before implementing everything in BLAS for Zen is still bad when removing an *existing solution

Oscar_Smith · August 31, 2020, 9:53pm

Note that the rest of the article shows that they removed it because they added zen specific codepaths that are as fast or faster.

tomchor · August 16, 2021, 3:42pm

Have there been any developments on this lately? I’m asking because I’m considering buying a Ryzen computer, but since Julia is such a big part of my work I won’t do it if I know the performance is going to be worse than an Intel one.

Thanks!

ChrisRackauckas · August 16, 2021, 5:11pm

Yes. LoopVectorization.jl, TriangularSolve.jl, RecursiveFactorization.jl, and Octavian.jl are all very optimized on my Ryzen 5950x and outperform MKL on it. So that’s at least what SciML defaults to under the hood now. Since the pure-Julia BLAS tools are good enough this issue is effectively nullified. (Though note they do not have full coverage of BLAS/LAPACK though)

RoyiAvital · August 16, 2021, 6:24pm

If you’re using Linux you may trick MKL and get Intel code path on your Ryzen which will be the best you’ll be able to get from MKL.
I’m pretty sure that on new MKL the discrimination will stop probably by Intel itself.

Anyhow, Buy Ryzen, Nothing form Intel will beat Ryzen 5950x / Ryzen 5900x unless you go Intel HEDT.

aasdelat · April 8, 2024, 10:04am

Hi:
What is the state of the art of this issue now. My interest is to make principal component analysis with very large (non-sparse) matrices.
More precisely, I’d like to know if, on a Ryzen 9 7950x AM5:

Intel mkl outperforms the julia built in package (on Ryzen 9)
AMD’s AOCL is available on julia or it could be in short
If the answer to the last question is “yes”, if AOCL outperforms mkl (on Ryzen 9)

It would be very useful to have the answers to these questions, so thanks in advance.

ufechner7 · April 8, 2024, 10:35am

From my experience last year I can say that parallel simulations using ModelingToolkit work MUCH better with MKL on Ryzen then without…

I cannot say anything about other use cases…

Not clear to me is what you mean with “julia built in package”… Do you mean OpenBLAS (https://www.openblas.net/) ?

LoopVectorization.jl etc are much faster than anything else for small to medium sized problems, but increase the compilation/ load time… For very small problems up to about 100 elements StaticArrays.jl is best…

aasdelat · April 8, 2024, 11:34am

Thank you for your response, @ufechner7. To be more clear about what I mean by “julia built in package”, I can say that I use

using LinearAlgebra

and, afterwards,

F = svd(   non_missing_anomalies   )

Where non_missing_anomalies is a very large, non-sparse, matrix.
I think that, to use the mkl, I should use:

using MKL
using LinearAlgebra

But, will this suffice?, do I need to use an environmental variable as suggested in previous posts?

ufechner7 · April 8, 2024, 11:53am

Yes. The environmental variable is no longer needed for newer versions of MKL.

I mean, benchmark your use case yourself…

using MKL # optional
using LinearAlgebra
using BenchmarkTools

BLAS.set_num_threads(16) # try 8 or 16

@benchmark F=svd( something )

Try it with and without MKL…

OK, setting the number of threads can also make a difference.

Oscar_Smith · April 8, 2024, 1:28pm

It’s worth noting that MKL works great on Ryzen, but appears to be deliberately sabotaged to work really badly on AMD Epyc (server) chips.

RoyiAvital · April 8, 2024, 1:55pm

The only solution is waiting for AOCL wrappers for faster sparse and dense operations on AMD CPUs · Issue #430 · SciML/LinearSolve.jl · GitHub to materialize.

The flexibility of the framework of Julia’s Linear Algebra should make it reality one day.

Elrod · April 8, 2024, 2:19pm

LoopVectorization should still be faster than StaticArrays at most sizes below 100, if you can avoid memcpy and allocations. Unfortunately, that is currently easier said than done. AmulB below calls memcpy twice, even though they’re both unnecessary.
The first is copying an SArray to another place on the stack; it should simply forward the pointer instead.
The other stack memory from one place to another, it should just replace all stores of the source pointer with the destination pointer.
So it’s conceivable that AmulB will match AmulB!'s performance below…
…but those changes aren’t going to happen before Julia 1.11.

julia> using StaticArrays, LoopVectorization, BenchmarkTools

julia> @inline function AmulB!(C, A, B)
         @turbo for n ∈ indices((C,B),2), m ∈ indices((C,A),1)
           Cmn = zero(eltype(C))
           for k ∈ indices((A,B),(2,1))
             Cmn += A[m,k] * B[k,n]
           end
           C[m,n] = Cmn
         end
         return C
       end
AmulB! (generic function with 1 method)

julia> @inline function AmulB(A::SMatrix{M,K,T}, B::SMatrix{K,N,S}) where {M,K,N,S,T}
         SMatrix(AmulB!(MMatrix{M,N,promote_type(T,S)}(undef), A, B))
       end
AmulB (generic function with 1 method)

julia> M=K=N=7; A = @SMatrix(rand(M,K)); B = @SMatrix(rand(K,N));

julia> AmulB(A,B) ≈ A*B
true

julia> @benchmark AmulB($(Ref(A))[], $(Ref(B))[])
BenchmarkTools.Trial: 10000 samples with 986 evaluations.
 Range (min … max):  51.460 ns … 234.389 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     53.489 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   54.375 ns ±   8.143 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▁▇█▇▇▆▄▁                                                 
  ▂▂▃▃▅████████▆▅▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂ ▃
  51.5 ns         Histogram: frequency by time         64.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark $(Ref(A))[] * $(Ref(B))[]
BenchmarkTools.Trial: 10000 samples with 985 evaluations.
 Range (min … max):  46.831 ns … 62.294 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     47.868 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   47.871 ns ±  0.941 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▅▃     ▄▂▂▆█▂                                              
  ▃████▅▄▄████████▆▅▄▄▄▃▃▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▂▂▁▁▁▂▁▂▂▂ ▃
  46.8 ns         Histogram: frequency by time        52.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> Am = MMatrix(A); Bm = MMatrix(B); Cm = MMatrix{M,N,Base.promote_eltype(Am,Bm)}(undef);

julia> @benchmark AmulB!($Cm, $Am, $Bm)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  29.005 ns … 605.039 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.538 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   29.671 ns ±   5.795 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▂ ▁   ▃▆▂█▂▄▁                                      
  ▁▁▁▁▁▁▁▁▃▄██▇█▇█▇███████▆█▅▇▅▆▆▃▄▃▄▄▃▃▂▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  29 ns           Histogram: frequency by time         30.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD EPYC 7513 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
  LD_UN_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
  LD_LIBRARY_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
  JULIA_PATH = @.
  JULIA_NUM_THREADS = 64

aasdelat · April 8, 2024, 3:59pm

I am sorry I am new to Julia and I easily get lost with the libraries ecosystem. In order to take advantage from LoopVectorization.jl and/or StaticArrays.jl, should I use:

using LoopVectorization
using LinearAlgebra

or

using StaticArrays
using LinearAlgebra

and will the svd from LinearAlgebra take advantage of LoopVectorization and/or StaticArrays?

ufechner7 · April 8, 2024, 4:23pm

To use StaticArrays.jl you must declare your vectors/ arrays as such, e.g.

using StaticArrays

vec=SA[1,2,3]

(well, don’t use global variables, this is just an example)

They come in two flavors, mutable or immutable, choose whatever you need.

LoopVectorization.jl is another beast, you need to use the correct macros, e.g.

function mydotavx(a, b)
          s = 0.0
          @turbo for i ∈ eachindex(a,b)
              s += a[i]*b[i]
          end
          s
      end

And the order of the using commands doesn’t matter with one exception: To use MKL you must write using MKL at the very beginning…

aasdelat · April 8, 2024, 4:30pm

Ok, but my intent is to make a singular value decomposition (svd). How will I take advantage of these libraries for this specific purpose?

Topic		Replies	Views
SVD 2x slower than in Matlab and how to get best performance on Windows10 Performance	20	3471	October 9, 2024
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36474	June 19, 2020
Hack: AMD Ryzen/TR/Epyc + Intel Math Kernel Library (MKL) Performance	0	1462	November 18, 2019
AOCL (not MKL) acceleration on AMD Ryzen CPU's Performance blas , linearalgebra , lapack , svd	22	2319	May 21, 2025
svdvals is alarmingly slow Performance	31	2955	March 15, 2018

Acceleration of Intel MKL on AMD Ryzen CPU's

How to Set the Environment Variable in Juno

Related topics