OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen)

Oh, I didn’t know that. Is there an easy way to check?

There’s a -p option when you start Julia, is that what you mean? I didn’t think I needed this, since I set the environment variable. Am I misunderstanding this?

The docs say,

If the underlying BLAS is using multiple threads, higher flop rates are realized. The number of BLAS threads can be set with BLAS.set_num_threads(n) .

If the keyword argument parallel is set to true , peakflops is run in parallel on all the worker processors. The flop rate of the entire parallel computer is returned. When running in parallel, only 1 BLAS thread is used. The argument n still refers to the size of the problem that is solved on each processor.

It doesn’t make sense to me why parallel=true would force single-threaded BLAS, but ok :slight_smile:

That was without:

 julia> BLAS.set_num_threads(16)

julia> LinearAlgebra.peakflops(16000)
3.5702446000519916e11

julia> BLAS.set_num_threads(32)

julia> LinearAlgebra.peakflops(16000)
3.293593157654745e11

Me too!

Correct on the first point. On the second, is that a general rule? Didn’t realize that. I’ve heard of people getting the best results with n-1 threads so one could still watch the mouse, etc. But I forget whether n was physical or logical in this case.

Just saw the additional details you both gave, that makes it much
clearer.

Good point, I haven’t even looked into overclocking

Oh, I didn’t know that. Is there an easy way to check?

You can check the build:
https://github.com/JuliaLang/julia/blob/master/deps/blas.mk#L25
If you don’t mind compiling Julia from source, you could also change the line for your OS.

There’s a -p option when you start Julia, is that what you mean? I didn’t think I needed this, since I set the environment variable. Am I misunderstanding this?

Ah, yes, that is what I meant. Which environmental variable?
Also, docs on Julia master:

help?> LinearAlgebra.peakflops
  LinearAlgebra.peakflops(n::Integer=2000; parallel::Bool=false)

  peakflops computes the peak flop rate of the computer by using double precision gemm!. By default, if no arguments are specified, it multiplies a matrix of size n x n, where n = 2000. If the underlying BLAS is using multiple threads, higher flop rates are realized. The number of BLAS threads can be set with BLAS.set_num_threads(n).

  If the keyword argument parallel is set to true, peakflops is run in parallel on all the worker processors. The flop rate of the entire parallel computer is returned. When running in parallel, only 1 BLAS thread is used. The argument n still refers to the size of the problem that is solved on each processor.

"When running in parallel, only 1 BLAS thread is used. " I imagine the worker processes are the ones you get from -p or addprocs.

Good point, I haven’t even looked into overclocking

I just moderately increase the clock speed, boot, launch Julia and try

X = rand(10^4,10^4);
@time inv(X);
@time inv(X);
@time inv(X);
@time foreach(inv, (X for _ in 1:100));

If the overclock is bad, the computer will normally crash the instant you hit “enter” on the first @time inv(X). If it doesn’t crash, watch the temperatures (I use watch -n0.5 sensors) and speed (watch -n1 "cat /proc/cpuinfo | grep MHz") and see how high the temperatures get, and make sure it’s actually running at the specified speed. Sometimes it’ll heat up and crash or throttle.
Make sure it’s at a temperature you’re comfortable with.

If things crash, you can try and increase voltage. You’d have to look up what is safe.
Also, (obviously) better safe than sorry. It’s not worth risking the computer crashing while you’re actually using it.

I also set XMP settings for memory, so that the RAM actually runs at the advertised speeds. I haven’t bothered to overclock beyond that yet, but it’d be a similar process. Maybe through in some large dot products / vector sums that are definitely memory bottle necked.

Also, my single thread benchmarks are actually worse now. Stock, the boost speed changes based on the number of cores working. I was lazy, and set the same speed for working cores (inactive cores still run much slower to save power). So I don’t think a single core boosts quite as high.
But when I’m actually doing real work, all the cores are busy, so it’s not something I worry too much about. It’d just be cool to see great times when running single threaded @benchmarks. :wink:

3 Likes

Found nice trick to maximize MKL performance on AMD CPU’s based on Agner Fog findings: Anaconda on Windows with MKL Support.

Pay attnetion to the compilation trick: https://github.com/fo40225/Anaconda-Windows-AMD/blob/master/site.cfg. Just incorporating Agner Fog’s CPU Dispatch.

Should be interesting to see comparison with this trick applied.

I’ve got to watch this space over the next year - I’ll want to put together a machine that is superfast at running Julia, that will also serve as a high powered gaming machine for my kids (has to be able to run the upcoming Microsoft Flight Simulator 2020 at high frame rates, at at least 4K), as well as allow me to play with the GPUs with Julia.

How high is your budget? Does it include the HEDT line ups?

IIRC, you’ve done a lot of low level / explicit SIMD for text processing?
Sounds like you’d know how much/if you’d benefit from avx512.
Thanks to strong competition from AMD, the new Cascade Lake-X CPUs (with avx512) cost half as much as the previous generation. Eg, $1000 for the 18 core part, which if over clocked should be able to hit around 2 teraflops with LinearAlgebra.peakflops(16000), like the much older 7980xe. The 10 core part is $600.

If you won’t benefit from avx512, the AMD parts will probably be better. The Ryzen 3900X is 12 cores for $500, and 3950X will be 16 for $750. I have no idea about the new threadrippers. These chips do better on most benchmarks than Cascade Lake in single core performance relative to clock speed.

Julia is good at generating avx512 code, but lots of software, even software that could theoretically make good use of it like OpenBLAS, do not. Based on reviews, I don’t think games benefit. You’d know about your string libraries.

I could run benchmark scripts if it doesn’t take much work to set up.

@Royi, that would be pretty cool if someone can get that to work, especially with the new 7nm parts.

2 Likes

Do you mean the patch for MKL?
Well, one need to build it form the lib files in with that. It works as you can see in the repository.

Anyone got their hands on a Ryzen 3900X and have some experiences to share?
I’m thinking about getting one and am pretty much wondering if it’s a good time to get one, or there is something up-and-coming that is worth holding out for?

The 3900X seems to perform very well on the typical processor-review tests, but am unsure if this conclusion extends to linalg as well. I has also been quite a couple of months since it was released so I’m not sure if it’s about to be superseded soon?

I have a Threadripper 1950X not really comparable, but I am really happy with it.

As always, the next generation is around the corner. :slight_smile:
AMD is planning (as far as I am informed) to roll-out their Zen 3 chips this year, so if you can wait a little bit longer, these will probably be called something like 4…(X).

2 Likes

You haven’t said what are you comparing the AMD 3900x to.

But generally speaking, it will be faster than any Intel CPU with less cores which doesn’t have AVX512 for Multi Threaded Linear Algebra.

It will be also a decent CPU for single threaded operations yet Intel has higher frequency CPU’s which will beat it on those tasks (Probably in the 3-10% range).

If you use it with MKL, remember to use the Acceleration of Intel MKL on AMD Ryzen CPU’s trick. Then you’ll get the best BLAS performance out there for non AVX512 12 cores CPU.

This is highly debatable. Most intel CPUs have 1/2 GHz of the speed when using AVX512. A table here shows some numbers. Note that this is skylake. I wonder if this happens with the latest 10 nm chips. I have some collegues who found out that their code was faster with AVX2 than using AVX512 precisely because of this.

2 Likes

I have the “non-X” version in a Clevo laptop.

julia> BLAS.set_num_threads(Sys.CPU_THREADS >> 1);

julia> LinearAlgebra.peakflops(16000)
3.841950244780208e11

julia> versioninfo()
Julia Version 1.6.0-DEV.58
Commit cfd7f48330 (2020-05-17 17:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 9 3900 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, znver1)

The system is running a desktop with some stuff running, but really a light load. Repeating, the number are about 3.6--3.9 E11 flops.

2 Likes

Even if it reduces 1 [GHz] it doubles the performance (Throughput) assuming memory BW isn’t the limiting factor. So for 4 [GHz] CPU in AVX256 you get performance of 6 [GHz] instead of the potential 8 [GHz]. Still it gives you a nice boost in performance (50% for those numbers).
The problem with the clock throttling is in a mixed code of AVX512 and single threaded code. In that case the reduction of clock will hurt performance to a point it doesn’t make sense to use AVX512.

But for Linear Algebra which is usually Multi Threaded oriented code, believe me AVX512 is a great feature.

You can ask @Elrod :-). Or just see his great numbers when his code utilizes AVX512 well. Intel MKL on AVX512 CPU’s beats CPU’s without AVX512 easily.

My newer CPU:

julia> using LinearAlgebra

julia> BLAS.set_num_threads(Sys.CPU_THREADS >> 1)

julia> LinearAlgebra.peakflops(16000)
2.1151237886809092e12

julia> LinearAlgebra.peakflops(16000)
2.1014127914913848e12

julia> LinearAlgebra.peakflops(16000)
2.1263672666765479e12

julia> versioninfo(verbose=true)
Julia Version 1.5.0-DEV.872
Commit 1345a043d2* (2020-05-06 02:30 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  uname: Linux 5.6.11-948.native #1 SMP Wed May 6 00:04:43 PDT 2020 x86_64 unknown
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz:
                 speed         user         nice          sys         idle          irq
       #1-36  3981 MHz  165301374 s       3337 s     797186 s  3161664577 s     376496 s

  Memory: 125.55207824707031 GB (98871.29296875 MB free)
  Uptime: 924561.0 sec
  Load Avg:  1.7197265625  0.60107421875  0.2099609375
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

“Intel Xeon Gold 5120” is not “Most intel CPUs”. For example, mine (above) runs 18 cores without AVX at 4.6 GHz, with AVX(2) at 4.3 GHz, and with AVX512 at 4.1GHz.

I love it. Obviously I put a lot of effort into making use of it (in particular, LoopVectorization), but it pays off.

That is a pretty “badass” CPU. CPUs without the “gold” adjective don’t have this behaviour. Please keep working on LoopVectorization it’s awesome!

2 Likes

If you tell me what to run, I’ll be happy to provide numbers from my Ice Lake i5.

Thanks all for sharing :slight_smile: Chris’ CPU is perhaps a bit pricey, but does of course seem extremely competent!
I think I can live without AVX512, many workloads do not benefit from it at all, but do benefit from a high core count, something AMD offers for a very reasonable price while also offering quite competitive single-core performance. The fact that AMD is soon moving to a new socket with some new technologies still make me wonder if it’s time to hold out for a few more months though :stuck_out_tongue:

Interesting, let me add an older Xeon just for the sake of completeness. This is my desktop:

julia> using LinearAlgebra

julia> BLAS.set_num_threads(Sys.CPU_THREADS >> 1);

julia> LinearAlgebra.peakflops(16000)
2.036555694312491e11

julia> versioninfo(verbose=true)
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
      "Manjaro Linux"
  uname: Linux 4.14.172-1-MANJARO #1 SMP PREEMPT Fri Feb 28 21:28:00 UTC 2020 x86_64 unknown
  CPU: Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz:
              speed         user         nice          sys         idle          irq
       #1  3773 MHz   14956550 s    3586749 s    5883343 s  572677402 s          0 s
       #2  3774 MHz   16758671 s    3570021 s    5807040 s  571980878 s          0 s
       #3  3860 MHz   19823855 s    3551664 s    5926076 s  569028854 s          0 s
       #4  3746 MHz   24385969 s    3518101 s    5655991 s  565108513 s          0 s
       #5  3844 MHz   19898516 s    3536843 s    5572273 s  569492118 s          0 s
       #6  3779 MHz   16081160 s    3557193 s    5681533 s  572925080 s          0 s
       #7  3811 MHz   16224667 s    3569879 s    5676842 s  572804372 s          0 s
       #8  3803 MHz   14050354 s    3596319 s    5671032 s  575381836 s          0 s

  Memory: 31.276653289794922 GB (262.8671875 MB free)
  Uptime: 6.068533e6 sec
  Load Avg:  1.87060546875  1.5419921875  1.00341796875
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

It seems MKL didn’t make too much difference in my case. I have a 3800x with all core at 4.25 GHz. The test is done in WSL2.

Version 1.4.1 installed from apt

julia> BLAS.openblas_get_config()
"OpenBLAS 0.3.7  USE64BITINT DYNAMIC_ARCH NO_AFFINITY Zen MAX_THREADS=16"

julia> BLAS.vendor()
:openblas64

julia> BLAS.set_num_threads(Sys.CPU_THREADS >> 1)

julia> LinearAlgebra.peakflops(16000)
3.133443793347056e11

julia> BLAS.set_num_threads(Sys.CPU_THREADS)

julia> LinearAlgebra.peakflops(16000)
4.313120502037118e11

julia> versioninfo(verbose=true)
Julia Version 1.4.1
Platform Info:
  OS: Linux (x86_64-linux-gnu)
      Ubuntu 20.04 LTS
  uname: Linux 4.19.84-microsoft-standard #1 SMP Wed Nov 13 11:44:37 UTC 2019 x86_64 x86_64
  CPU: AMD Ryzen 7 3800X 8-Core Processor:
                 speed         user         nice          sys         idle          irq
       #1-16  4250 MHz     315688 s       9076 s      16522 s    5332624 s          0 s

  Memory: 24.99040985107422 GB (12217.40625 MB free)
  Uptime: 3548.0 sec
  Load Avg:  3.60107421875  3.52783203125  1.93408203125
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 4

Version 1.4.2, official binary.

julia> BLAS.openblas_get_config()
"OpenBLAS 0.3.5  USE64BITINT DYNAMIC_ARCH NO_AFFINITY Zen MAX_THREADS=32"

julia> BLAS.set_num_threads(Sys.CPU_THREADS >> 1)

julia> LinearAlgebra.peakflops(16000)
2.4798958833634494e11

julia> BLAS.set_num_threads(Sys.CPU_THREADS)

julia> LinearAlgebra.peakflops(16000)
3.476535665311297e11

The performance is a lot worse than 1.4.1.

I used MKL.jl to convert this binary to MKL.

Without setting MKL_DEBUG_CPU_TYPE

julia> BLAS.vendor()
:mkl

julia> BLAS.set_num_threads(Sys.CPU_THREADS >> 1)

julia> LinearAlgebra.peakflops(16000)
3.1590494395556305e11

julia> BLAS.set_num_threads(Sys.CPU_THREADS)

julia> LinearAlgebra.peakflops(16000)
3.1748345868355646e11

With MKL_DEBUG_CPU_TYPE=5

julia> BLAS.set_num_threads(Sys.CPU_THREADS >> 1)

julia> LinearAlgebra.peakflops(16000)
3.1601105540297565e11

BLAS.set_num_threads(Sys.CPU_THREADS)

julia> LinearAlgebra.peakflops(16000)
3.206082916145316e11

So 1.4.1 with MKL works best here.

It looks like MKL version won’t utilize all threads, even if set_num_threads is 16.
Annotation 2020-06-18 173118

With OpenBLAS and 16 threads.
Annotation 2020-06-18 173327

I’m not sure if any one else has seen the same thing.

If you just want to benchmark OpenBLAS and MKL, I’d reccomend using OpenBLAS_jll and/or MKL_jll, as I do in LoopVectorization’s benchmarks:

using MKL_jll, OpenBLAS_jll, Libdl, LinearAlgebra

const libMKL = Libdl.dlopen(MKL_jll.libmkl_rt)
const DGEMM_MKL = Libdl.dlsym(libMKL, :dgemm)
const SGEMM_MKL = Libdl.dlsym(libMKL, :sgemm)
const DGEMV_MKL = Libdl.dlsym(libMKL, :dgemv)
const MKL_SET_NUM_THREADS = Libdl.dlsym(libMKL, :MKL_Set_Num_Threads)

const libOpenBLAS = Libdl.dlopen(OpenBLAS_jll.libopenblas)
const DGEMM_OpenBLAS = Libdl.dlsym(libOpenBLAS, :dgemm_64_)
const SGEMM_OpenBLAS = Libdl.dlsym(libOpenBLAS, :sgemm_64_)
const DGEMV_OpenBLAS = Libdl.dlsym(libOpenBLAS, :dgemv_64_)
const OPENBLAS_SET_NUM_THREADS = Libdl.dlsym(libOpenBLAS, :openblas_set_num_threads64_)

istransposed(x) = 'N'
istransposed(x::Adjoint{<:Real}) = 'T'
istransposed(x::Adjoint) = 'C'
istransposed(x::Transpose) = 'T'
for (lib,f) ∈ [(:GEMM_MKL,:gemmmkl!), (:GEMM_OpenBLAS,:gemmopenblas!)]
    for (T,prefix) ∈ [(Float32,:S),(Float64,:D)]
        fm = Symbol(prefix, lib)
        @eval begin
            function $f(C::AbstractMatrix{$T}, A::AbstractMatrix{$T}, B::AbstractMatrix{$T})
                transA = istransposed(A)
                transB = istransposed(B)
                M, N = size(C); K = size(B, 1)
                pA = parent(A); pB = parent(B)
                ldA = stride(pA, 2)
                ldB = stride(pB, 2)
                ldC = stride(C, 2)
                α = one($T)
                β = zero($T)
                ccall(
                    $fm, Cvoid,
                    (Ref{UInt8}, Ref{UInt8}, Ref{Int64}, Ref{Int64}, Ref{Int64}, Ref{$T}, Ref{$T},
                     Ref{Int64}, Ref{$T}, Ref{Int64}, Ref{$T}, Ref{$T}, Ref{Int64}),
                    transA, transB, M, N, K, α, pA, ldA, pB, ldB, β, C, ldC
                )
            end
        end
    end
end
mkl_set_num_threads(N::Integer) = ccall(MKL_SET_NUM_THREADS, Cvoid, (Int32,), N % Int32)
openblas_set_num_threads(N::Integer) = ccall(OPENBLAS_SET_NUM_THREADS, Cvoid, (Int64,), N)

# if you want single threaded
mkl_set_num_threads(1) 
openblas_set_num_threads(1)

# if you want multithreaded
mkl_set_num_threads(Sys.CPU_THREADS)
openblas_set_num_threads(Sys.CPU_THREADS ÷ 2)
# Using the number of physical cores is fastest
# MKL automatically won't use more than this, therefore specifying `Sys.CPU_THREADS` should be safe
# OpenBLAS requires you to be exact. `Sys.CPU_THREADS ÷ 2` will yield that much of the time, but not all.

The performance is a lot worse than 1.4.1 .

I’d guess that’s because apt uses OpenBLAS 0.3.7, while the official binary has OpenBLAS 0.3.5.
The official binary for Julia 1.5beta ships OpenBLAS 0.3.9, and OpenBLAS_jll should be able to provide this as well.

3 Likes

Thanks a lot for the code, but Libdl, LinearAlgebra should also be added.

I tested your code with two 100x100 matrices. The results are like

1-thread MKL vs OpenBLAS: 45.600 μs vs 47.500 μs
8-thread MKL vs OpenBLAS: 18.500 μs vs 24.800 μs
16-thread MKL vs OpenBLAS: 17.600 μs vs 35.500 μs

Single thread is close. MKL scales better than OpenBLAS.

But for a fairly small matrix, say 10x10, OpenBLAS is faster. 447.475 vs 300 ns, 1 thread.

Hmm…for 1000x1000 matrices

1-thread MKL vs OpenBLAS: 34.320 ms vs 33.369 ms
8-thread MKL vs OpenBLAS: 7.374 ms vs 9.142 ms

OpenBLAS is faster in single-thread again, but MKL is still ahead in multithread.

Anyway… I’m not doing intensive matrix operations at this moment, so I guess I can live with LinearAlgebra for now. Really appreciate your reply!

1 Like