OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen)

Questions about MKL vs OpenBLAS come up a lot, for example in comparisons with Matlab (linked to MKL), and a lot of users have issues building with MKL, eg here. Of course, one can easily download an MKL binary with JuliaPro, but then you may have to face down an army of dependency conflicts.

My point here is to compare MKL and OpenBLAS with an AMD processor (Ryzen Threadripper 1950x).
Lots of performance comparisons are already out there, but I figured I would add one without an intel chip.

Brief summary of results: MKL is generally faster at small sizes, OpenBLAS at large sizes.
All in all, Iā€™d call OpenBLAS the clear winner (given this hardware) because

  • overall run time is easily dominated by a few operations on large matrices.
  • if one is concerned about performance of operations on small matrices it is easy to turn to native Julia. For example, StaticArrays.jl performs well on very small matrices, and IterativeSolvers.jl is extremely effective for small-medium sized matrices. This leaves large matrices as the last domain where Julia really benefits from outside assistance.

Results are probably different with an Intel processor, eg probably this.
Maybe I shouldā€™ve made pretty graphs, but theyā€™d be harder to share on Discourse.

Operating System and Julia Install

I just wiped my hard drive and installed a fresh (x)ubuntu 16.04 last night, for the sake of upgrading ROCm. Ubuntu 16.04 comes with gcc 5.4, which ROCm requires but doesnā€™t support -march=zenv1 so I also built gcc version 7.2 (adding -7.2 as a suffix) to get the latest gcc, g++, and gfortran. (Phoronixā€™s benchmarks suggest gcc 7.2 often produces faster code, but not on the level of Julia 0.7 vs 0.6 :wink:, and that Zen optimizations donā€™t help yet. As an aside, if thatā€™s what it looks like when Juliaā€™s devs are trying to break things instead of optimizeā€¦ )
My ubuntu install did not come with gfortran, so this provided that dependency.

I just used sudo apt-get install for the remaining dependencies. All I recall installing was build-essential (not sure if this included anything relevant), m4, and pkg-config.

Build with OpenBLAS
You can safely skip most of this. I built the development version of OpenBLAS from source, cloned Julia, and created the following Make.user:

USE_SYSTEM_BLAS=1
USE_SYSTEM_LAPACK=1
LIBBLAS=-lopenblas
LIBBLASNAME=libopenblas
LIBLAPACK=-lopenblas
LIBLAPACKNAME=libopenblas

Then it was simply: make CC=gcc-7.2 CXX=g++-7.2 FC=gfortran-7.2 -j32.
If youā€™re going to do multiple installs with OpenBLAS, not rebuilding it may be worth it. Otherwise, you can skip the Make.user and everything will happen automatically. The end result:

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  BLAS: libopenblas (ZEN)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, generic)

and

julia> versioninfo()
Julia Version 0.7.0-DEV.3199
Commit 1238bad (2017-12-27 20:40 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  BLAS: libopenblas (ZEN)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, znver1)
Environment:

Building with MKL

I installed the 2018 update 1 versions of MKL (gratis) and Parallel Studio Xe (gratis for students, and I believe also open source contributors and classroom educators) in /opt/intel. I remember needing parallel studio to satisfy some dependency (I believe it was libimf) months ago, so I just went ahead and downloaded it today seeing if things could work without it.

First, I followed Juliaā€™s github readme, running
source /opt/intel/bin/compilervars.sh intel64
and creating

USEICC = 1
USEIFC = 1
USE_INTEL_MKL = 1
USE_INTEL_LIBM = 1

but eventually ran into a linking issue like the one described here. Mike Kinghan provided a great answer there, but Iā€™d rather move on and stick with gcc. Therefore the new, Make.user:

USE_INTEL_MKL = 1
USE_INTEL_LIBM = 1

and simply running make CC=gcc-7.2 CXX=g++-7.2 FC=gfortran-7.2 -j32 results in:

julia> versioninfo()
Julia Version 0.7.0-DEV.3199
Commit 1238bad (2017-12-27 20:40 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  BLAS: libmkl_rt
  LAPACK: libmkl_rt
  LIBM: libimf
  LLVM: libLLVM-3.9.1 (ORCJIT, znver1)
Environment:

Matrix Sizes

Sticking to NxN square matrices, and trying
N = (8,64,512,4096)
to get an idea of how they perform over a broad range.

GEMM

Here is what I ran:

create_g(T, N) = randn(T,N,N)
a = create_g.(Float32, N);
A = create_g.(Float64, N);
b = create_g.(Float32, N);
B = create_g.(Float64, N);
c = similar.(a);
C = similar.(A);
using BenchmarkTools
function bench_f!(C, A, B, f, N, info = "")
    for i in 1:length(N)
        println("Size: $(N[i])" * info)
        @show @benchmark $f($(C[i]), $(A[i]), $(B[i]))
    end
end
bench_f!(C, A, B, A_mul_B!, N, ", Julia-Dev:")

Below are the results using the same commit of Julia-master (v0.7.0, Commit 1238bad (2017-12-27 20:40 UTC) ), with the only difference being MKL vs OpenBLAS. Whether I ran MKL or OpenBLAS first varied, but if the benchmark caused my CPU fans to turn up, I would wait until they died down before running the next set of benchmarks.

Julia v0.6.2 with OpenBLAS unsurprisingly performed similarly to v0.7.0-dev + OpenBLAS, but did not print nicely with the above function so I excluded it for space.

The outputs:
8x8, Float64

Size: 8, MKL:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     336.964 ns (0.00% GC)
  median time:      344.312 ns (0.00% GC)
  mean time:        349.959 ns (0.00% GC)
  maximum time:     627.195 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     221
Size: 8, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     375.439 ns (0.00% GC)
  median time:      379.302 ns (0.00% GC)
  mean time:        381.454 ns (0.00% GC)
  maximum time:     695.751 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     205
StaticArrays:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     136.728 ns (0.00% GC)
  median time:      136.855 ns (0.00% GC)
  mean time:        137.184 ns (0.00% GC)
  maximum time:     189.052 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     867

64x64, Float64

Size: 64, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.682 Ī¼s (0.00% GC)
  median time:      13.185 Ī¼s (0.00% GC)
  mean time:        13.672 Ī¼s (0.00% GC)
  maximum time:     362.510 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     22.823 Ī¼s (0.00% GC)
  median time:      23.584 Ī¼s (0.00% GC)
  mean time:        23.769 Ī¼s (0.00% GC)
  maximum time:     108.314 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

512x512, Float64

Size: 512, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.907 ms (0.00% GC)
  median time:      1.984 ms (0.00% GC)
  mean time:        2.051 ms (0.00% GC)
  maximum time:     3.536 ms (0.00% GC)
  --------------
  samples:          2436
  evals/sample:     1
Size: 512, OpenBLAS:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.167 ms (0.00% GC)
  median time:      1.199 ms (0.00% GC)
  mean time:        1.267 ms (0.00% GC)
  maximum time:     2.381 ms (0.00% GC)
  --------------
  samples:          3940
  evals/sample:     1

4096x4096, Float64

Size: 4096, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     884.305 ms (0.00% GC)
  median time:      887.235 ms (0.00% GC)
  mean time:        888.018 ms (0.00% GC)
  maximum time:     896.666 ms (0.00% GC)
  --------------
  samples:          6
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     384.890 ms (0.00% GC)
  median time:      390.259 ms (0.00% GC)
  mean time:        398.521 ms (0.00% GC)
  maximum time:     448.346 ms (0.00% GC)
  --------------
  samples:          13
  evals/sample:     1

MKL was ahead at 64x64, but starting at 512x512 OpenBLAS started pulling far ahead.
Note that if youā€™re just multiplying 64x64 and 4096x4096 matrices, youā€™d have to multiply around 50,000 times more of the former for MKL to be faster.

Single precision results were comparable.
8x8, Float32

Size: 8, MKL:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     238.978 ns (0.00% GC)
  median time:      246.381 ns (0.00% GC)
  mean time:        247.609 ns (0.00% GC)
  maximum time:     480.341 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     417
Size: 8, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     331.460 ns (0.00% GC)
  median time:      333.518 ns (0.00% GC)
  mean time:        334.891 ns (0.00% GC)
  maximum time:     433.795 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     224
StaticArrays
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     116.201 ns (0.00% GC)
  median time:      116.912 ns (0.00% GC)
  mean time:        117.159 ns (0.00% GC)
  maximum time:     159.055 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     916

64x64, Float32

Size: 64, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.674 Ī¼s (0.00% GC)
  median time:      9.317 Ī¼s (0.00% GC)
  mean time:        9.510 Ī¼s (0.00% GC)
  maximum time:     241.198 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.571 Ī¼s (0.00% GC)
  median time:      11.722 Ī¼s (0.00% GC)
  mean time:        11.839 Ī¼s (0.00% GC)
  maximum time:     35.355 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

512x512, Float32

Size: 512, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     681.205 Ī¼s (0.00% GC)
  median time:      752.947 Ī¼s (0.00% GC)
  mean time:        895.173 Ī¼s (0.00% GC)
  maximum time:     3.101 ms (0.00% GC)
  --------------
  samples:          5577
  evals/sample:     1
Size: 512, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     604.630 Ī¼s (0.00% GC)
  median time:      619.408 Ī¼s (0.00% GC)
  mean time:        645.878 Ī¼s (0.00% GC)
  maximum time:     1.211 ms (0.00% GC)
  --------------
  samples:          7729
  evals/sample:     1

4096x4096, Float32

Size: 4096, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     336.803 ms (0.00% GC)
  median time:      348.420 ms (0.00% GC)
  mean time:        355.731 ms (0.00% GC)
  maximum time:     417.870 ms (0.00% GC)
  --------------
  samples:          15
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     183.598 ms (0.00% GC)
  median time:      188.000 ms (0.00% GC)
  mean time:        192.251 ms (0.00% GC)
  maximum time:     227.223 ms (0.00% GC)
  --------------
  samples:          27
  evals/sample:     1

Triangular x General Matrix Multiplication
I figured rather than calling gemm!, trmm!, etc directly, Iā€™d stick with the higher level A_mul_B! interface.

tA = UpperTriangular.(A);
bench_f!(C, tA, B, A_mul_B!, N, ", OpenBLAS:")

8x8, Triangle

Size: 8, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     405.515 ns (0.00% GC)
  median time:      410.370 ns (0.00% GC)
  mean time:        411.963 ns (0.00% GC)
  maximum time:     753.265 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     200
Size: 8, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.330 Ī¼s (0.00% GC)
  median time:      5.140 Ī¼s (0.00% GC)
  mean time:        5.269 Ī¼s (0.00% GC)
  maximum time:     32.291 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7
StaticArrays
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     64.979 ns (0.00% GC)
  median time:      66.874 ns (0.00% GC)
  mean time:        67.034 ns (0.00% GC)
  maximum time:     113.781 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     978

64x64, Triangle

Size: 64, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.454 Ī¼s (0.00% GC)
  median time:      13.696 Ī¼s (0.00% GC)
  mean time:        13.870 Ī¼s (0.00% GC)
  maximum time:     238.548 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.198 Ī¼s (0.00% GC)
  median time:      23.905 Ī¼s (0.00% GC)
  mean time:        24.484 Ī¼s (0.00% GC)
  maximum time:     216.596 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

512x512, Triangle

Size: 512, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.367 ms (0.00% GC)
  median time:      1.394 ms (0.00% GC)
  mean time:        1.422 ms (0.00% GC)
  maximum time:     2.153 ms (0.00% GC)
  --------------
  samples:          3513
  evals/sample:     1
Size: 512, OpenBLAS:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     819.315 Ī¼s (0.00% GC)
  median time:      1.274 ms (0.00% GC)
  mean time:        1.179 ms (0.00% GC)
  maximum time:     1.715 ms (0.00% GC)
  --------------
  samples:          4233
  evals/sample:     1

4096x4096, Triangle

Size: 4096, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     538.547 ms (0.00% GC)
  median time:      574.797 ms (0.00% GC)
  mean time:        569.898 ms (0.00% GC)
  maximum time:     594.784 ms (0.00% GC)
  --------------
  samples:          9
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     211.023 ms (0.00% GC)
  median time:      212.099 ms (0.00% GC)
  mean time:        216.946 ms (0.00% GC)
  maximum time:     249.050 ms (0.00% GC)
  --------------
  samples:          24
  evals/sample:     1

Again, MKL dominates at small sizes, but loses out as the matrices grow.

Symmetric x General Matrix Multiplication

sA = Symmetric.(A);
bench_f!(C, sA, B, A_mul_B!, N, ", OpenBLAS:")

8x8, Symmetric

Size: 8, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     396.866 ns (0.00% GC)
  median time:      401.801 ns (0.00% GC)
  mean time:        403.922 ns (0.00% GC)
  maximum time:     626.055 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     201
Size: 8, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.547 Ī¼s (0.00% GC)
  median time:      5.046 Ī¼s (0.00% GC)
  mean time:        5.104 Ī¼s (0.00% GC)
  maximum time:     12.286 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9
StaticArrays
  memory estimate:  848 bytes
  allocs estimate:  5
  --------------
  minimum time:     603.801 ns (0.00% GC)
  median time:      619.858 ns (0.00% GC)
  mean time:        694.029 ns (9.39% GC)
  maximum time:     190.751 Ī¼s (99.59% GC)
  --------------
  samples:          10000
  evals/sample:     176

64x64, Symmetric

Size: 64, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.740 Ī¼s (0.00% GC)
  median time:      12.153 Ī¼s (0.00% GC)
  mean time:        12.441 Ī¼s (0.00% GC)
  maximum time:     240.722 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     29.916 Ī¼s (0.00% GC)
  median time:      34.174 Ī¼s (0.00% GC)
  mean time:        34.917 Ī¼s (0.00% GC)
  maximum time:     238.238 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

512x512, Symmetric

Size: 512, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.921 ms (0.00% GC)
  median time:      1.998 ms (0.00% GC)
  mean time:        2.148 ms (0.00% GC)
  maximum time:     4.300 ms (0.00% GC)
  --------------
  samples:          2324
  evals/sample:     1
Size: 512, OpenBLAS:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.263 ms (0.00% GC)
  median time:      1.306 ms (0.00% GC)
  mean time:        1.393 ms (0.00% GC)
  maximum time:     2.338 ms (0.00% GC)
  --------------
  samples:          3584
  evals/sample:     1

4096x4096, Symmetric

Size: 4096, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     920.034 ms (0.00% GC)
  median time:      952.772 ms (0.00% GC)
  mean time:        959.352 ms (0.00% GC)
  maximum time:     1.024 s (0.00% GC)
  --------------
  samples:          6
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     380.653 ms (0.00% GC)
  median time:      382.731 ms (0.00% GC)
  mean time:        395.965 ms (0.00% GC)
  maximum time:     444.567 ms (0.00% GC)
  --------------
  samples:          13
  evals/sample:     1

I split matrix factorization results into a separate comment because of character limits.

19 Likes

Cholesky
Moving onto LAPACK, lets create positive definite matrices.

gen_pd(n) = randn(2n, n) |> x -> Symmetric(x' * x)
D = gen_pd.(N);
function bench_f(A, f, N, info = "")
    for i in 1:length(N)
        println("Size: $(N[i])" * info)
        Aįµ¢ = A[i]
        @show @benchmark $f(B) setup=( B = copy($Aįµ¢) ) evals=1
    end
end
bench_f(D, cholfact!, N, ", OpenBLAS:")

Also, because StaticArrays performs poorly here, I wanted to add a different Julia function for comparison:

function julia_chol!(U::AbstractArray{<:Real,2})
    p = size(U,1)
    @inbounds for i āˆˆ 1:p
        Uįµ¢įµ¢ = U[i,i]
        for j āˆˆ 1:i-1
            Uā±¼įµ¢ = U[j,i]
            for k āˆˆ 1:j-1
                Uā±¼įµ¢ -= U[k,i] * U[k,j]
            end
            Uā±¼įµ¢ /= U[j,j]
            U[j,i] = Uā±¼įµ¢
            Uįµ¢įµ¢ -= abs2(Uā±¼įµ¢)
        end
        U[i,i] = āˆšUįµ¢įµ¢
    end
    U
end

This is a naive algorithm and scales very poorly, but it is still faster for 8x8 matrices and in the same ballpark for 64x64.

8x8, Cholfact

Size: 8, MKL:
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     210.000 ns (0.00% GC)
  median time:      240.000 ns (0.00% GC)
  mean time:        280.039 ns (0.00% GC)
  maximum time:     8.055 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 8, OpenBLAS: 
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     561.000 ns (0.00% GC)
  median time:      571.000 ns (0.00% GC)
  mean time:        605.324 ns (0.00% GC)
  maximum time:     43.101 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 8, julia_chol!
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     170.000 ns (0.00% GC)
  median time:      171.000 ns (0.00% GC)
  mean time:        176.252 ns (0.00% GC)
  maximum time:     4.769 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
StaticArrays
  memory estimate:  14.58 KiB
  allocs estimate:  112
  --------------
  minimum time:     64.371 Ī¼s (0.00% GC)
  median time:      66.635 Ī¼s (0.00% GC)
  mean time:        72.561 Ī¼s (6.66% GC)
  maximum time:     34.846 ms (99.43% GC)
  --------------
  samples:          10000
  evals/sample:     1

64x64, Cholfact

Size: 64, MKL: 
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     15.560 Ī¼s (0.00% GC)
  median time:      15.930 Ī¼s (0.00% GC)
  mean time:        16.068 Ī¼s (0.00% GC)
  maximum time:     54.824 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     19.497 Ī¼s (0.00% GC)
  median time:      24.216 Ī¼s (0.00% GC)
  mean time:        28.747 Ī¼s (0.00% GC)
  maximum time:     28.050 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 64, julia_chol!
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     28.744 Ī¼s (0.00% GC)
  median time:      29.185 Ī¼s (0.00% GC)
  mean time:        29.262 Ī¼s (0.00% GC)
  maximum time:     74.320 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

512x512, Cholfact

Size: 512, MKL: 
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     788.878 Ī¼s (0.00% GC)
  median time:      842.179 Ī¼s (0.00% GC)
  mean time:        858.057 Ī¼s (0.00% GC)
  maximum time:     4.506 ms (0.00% GC)
  --------------
  samples:          3553
  evals/sample:     1
Size: 512, OpenBLAS:
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     1.048 ms (0.00% GC)
  median time:      1.098 ms (0.00% GC)
  mean time:        1.114 ms (0.00% GC)
  maximum time:     1.702 ms (0.00% GC)
  --------------
  samples:          3097
  evals/sample:     1
Size: 512, julia_chol!
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.201 ms (0.00% GC)
  median time:      16.402 ms (0.00% GC)
  mean time:        16.456 ms (0.00% GC)
  maximum time:     17.533 ms (0.00% GC)
  --------------
  samples:          298
  evals/sample:     1

4096x4096, Cholfact

Size: 4096, MKL: 
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     174.107 ms (0.00% GC)
  median time:      176.916 ms (0.00% GC)
  mean time:        177.837 ms (0.00% GC)
  maximum time:     190.099 ms (0.00% GC)
  --------------
  samples:          24
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     99.876 ms (0.00% GC)
  median time:      101.212 ms (0.00% GC)
  mean time:        102.345 ms (0.00% GC)
  maximum time:     107.673 ms (0.00% GC)
  --------------
  samples:          39
  evals/sample:     1
Size: 4096, julia_chol!
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.152 s (0.00% GC)
  median time:      12.152 s (0.00% GC)
  mean time:        12.152 s (0.00% GC)
  maximum time:     12.152 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

Eigendecomposition of Positive Definite Matrices

bench_f(D, eigfact!, N, ", OpenBLAS:")

8x8, eigfact of positive definite matrix

Size: 8, MKL:
  memory estimate:  4.38 KiB
  allocs estimate:  11
  --------------
  minimum time:     12.203 Ī¼s (0.00% GC)
  median time:      12.964 Ī¼s (0.00% GC)
  mean time:        13.451 Ī¼s (1.92% GC)
  maximum time:     1.319 ms (98.45% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 8, OpenBLAS: 
  memory estimate:  4.41 KiB
  allocs estimate:  11
  --------------
  minimum time:     15.339 Ī¼s (0.00% GC)
  median time:      17.763 Ī¼s (0.00% GC)
  mean time:        17.936 Ī¼s (1.45% GC)
  maximum time:     1.345 ms (97.59% GC)
  --------------
  samples:          10000
  evals/sample:     1
StaticArrays
  memory estimate:  5.59 KiB
  allocs estimate:  12
  --------------
  minimum time:     15.660 Ī¼s (0.00% GC)
  median time:      18.745 Ī¼s (0.00% GC)
  mean time:        23.590 Ī¼s (20.61% GC)
  maximum time:     36.955 ms (99.85% GC)
  --------------
  samples:          10000
  evals/sample:     1

64x64, eigfact of positive definite matrix

Size: 64, MKL: 
  memory estimate:  84.89 KiB
  allocs estimate:  13
  --------------
  minimum time:     312.257 Ī¼s (0.00% GC)
  median time:      321.615 Ī¼s (0.00% GC)
  mean time:        327.857 Ī¼s (1.37% GC)
  maximum time:     1.618 ms (74.54% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  85.36 KiB
  allocs estimate:  13
  --------------
  minimum time:     693.786 Ī¼s (0.00% GC)
  median time:      736.937 Ī¼s (0.00% GC)
  mean time:        740.982 Ī¼s (0.46% GC)
  maximum time:     1.668 ms (53.11% GC)
  --------------
  samples:          6677
  evals/sample:     1

512x512, eigfact of positive definite matrix

Size: 512, MKL: 
  memory estimate:  4.16 MiB
  allocs estimate:  13
  --------------
  minimum time:     22.577 ms (0.00% GC)
  median time:      23.149 ms (0.00% GC)
  mean time:        23.257 ms (0.71% GC)
  maximum time:     26.396 ms (3.49% GC)
  --------------
  samples:          212
  evals/sample:     1
Size: 512, OpenBLAS:
  memory estimate:  4.16 MiB
  allocs estimate:  13
  --------------
  minimum time:     44.988 ms (0.00% GC)
  median time:      45.718 ms (0.00% GC)
  mean time:        45.967 ms (0.27% GC)
  maximum time:     50.037 ms (1.41% GC)
  --------------
  samples:          108
  evals/sample:     1

4096x4096, eigfact of positive definite matrix

Size: 4096, MKL: 
  memory estimate:  257.47 MiB
  allocs estimate:  16
  --------------
  minimum time:     6.548 s (0.14% GC)
  median time:      6.548 s (0.14% GC)
  mean time:        6.548 s (0.14% GC)
  maximum time:     6.548 s (0.14% GC)
  --------------
  samples:          1
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  257.28 MiB
  allocs estimate:  16
  --------------
  minimum time:     8.973 s (0.10% GC)
  median time:      8.973 s (0.10% GC)
  mean time:        8.973 s (0.10% GC)
  maximum time:     8.973 s (0.10% GC)
  --------------
  samples:          1
  evals/sample:     1

Here, MKL maintained its edge. I recall reading that some of the LAPACK functions from OpenBLAS are reference, rather than optimized, implentations.

Eigendecomposition of general square matrices

bench_f(A, eigfact!, N, ", OpenBLAS:")

8x8, eigfact of general square matrix

Size: 8, MKL:
  memory estimate:  3.64 KiB
  allocs estimate:  19
  --------------
  minimum time:     15.269 Ī¼s (0.00% GC)
  median time:      15.960 Ī¼s (0.00% GC)
  mean time:        16.593 Ī¼s (2.06% GC)
  maximum time:     1.744 ms (97.80% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 8, OpenBLAS: 
  memory estimate:  11.41 KiB
  allocs estimate:  17
  --------------
  minimum time:     15.499 Ī¼s (0.00% GC)
  median time:      16.311 Ī¼s (0.00% GC)
  mean time:        17.403 Ī¼s (3.45% GC)
  maximum time:     1.243 ms (98.43% GC)
  --------------
  samples:          10000
  evals/sample:     1

64x64, eigfact of general square matrix

Size: 64, MKL: 
  memory estimate:  117.66 KiB
  allocs estimate:  23
  --------------
  minimum time:     1.754 ms (0.00% GC)
  median time:      1.787 ms (0.00% GC)
  mean time:        1.796 ms (0.39% GC)
  maximum time:     2.998 ms (39.95% GC)
  --------------
  samples:          2774
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  166.25 KiB
  allocs estimate:  25
  --------------
  minimum time:     1.250 ms (0.00% GC)
  median time:      1.265 ms (0.00% GC)
  mean time:        1.281 ms (0.57% GC)
  maximum time:     2.253 ms (38.01% GC)
  --------------
  samples:          3887
  evals/sample:     1

512x512, eigfact of general square matrix

Size: 512, MKL: 
  memory estimate:  6.16 MiB
  allocs estimate:  37
  --------------
  minimum time:     288.296 ms (0.00% GC)
  median time:      292.630 ms (0.00% GC)
  mean time:        300.290 ms (0.10% GC)
  maximum time:     361.709 ms (0.00% GC)
  --------------
  samples:          17
  evals/sample:     1
Size: 512, OpenBLAS:
  memory estimate:  6.54 MiB
  allocs estimate:  31
  --------------
  minimum time:     286.010 ms (0.00% GC)
  median time:      288.243 ms (0.00% GC)
  mean time:        289.080 ms (0.10% GC)
  maximum time:     297.906 ms (0.00% GC)
  --------------
  samples:          18
  evals/sample:     1

4096x4096, eigfact of general square matrix

Size: 4096, MKL: 
  memory estimate:  385.25 MiB
  allocs estimate:  81
  --------------
  minimum time:     43.321 s (0.02% GC)
  median time:      43.321 s (0.02% GC)
  mean time:        43.321 s (0.02% GC)
  maximum time:     43.321 s (0.02% GC)
  --------------
  samples:          1
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  388.28 MiB
  allocs estimate:  69
  --------------
  minimum time:     35.017 s (0.03% GC)
  median time:      35.017 s (0.03% GC)
  mean time:        35.017 s (0.03% GC)
  maximum time:     35.017 s (0.03% GC)
  --------------
  samples:          1
  evals/sample:     1

OpenBLAS ties or wins for general eigen-decomposition at all sizes.

QR Decomposition

bench_f(A, qrfact!, N, ", OpenBLAS:")

8x8, qrfact

Size: 8, MKL:
  memory estimate:  1.28 KiB
  allocs estimate:  4
  --------------
  minimum time:     13.195 Ī¼s (0.00% GC)
  median time:      13.405 Ī¼s (0.00% GC)
  mean time:        13.669 Ī¼s (0.00% GC)
  maximum time:     60.815 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 8, OpenBLAS: 
  memory estimate:  1.28 KiB
  allocs estimate:  4
  --------------
  minimum time:     16.861 Ī¼s (0.00% GC)
  median time:      16.982 Ī¼s (0.00% GC)
  mean time:        17.279 Ī¼s (0.00% GC)
  maximum time:     56.927 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
StaticArrays
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     310.129 ns (0.00% GC)
  median time:      314.905 ns (0.00% GC)
  mean time:        315.720 ns (0.00% GC)
  maximum time:     503.892 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     241

64x64, qrfact

Size: 64, MKL: 
  memory estimate:  36.22 KiB
  allocs estimate:  6
  --------------
  minimum time:     663.038 Ī¼s (0.00% GC)
  median time:      697.192 Ī¼s (0.00% GC)
  mean time:        706.418 Ī¼s (0.48% GC)
  maximum time:     1.987 ms (62.28% GC)
  --------------
  samples:          6973
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  36.22 KiB
  allocs estimate:  6
  --------------
  minimum time:     452.482 Ī¼s (0.00% GC)
  median time:      494.570 Ī¼s (0.00% GC)
  mean time:        517.678 Ī¼s (0.48% GC)
  maximum time:     12.591 ms (0.00% GC)
  --------------
  samples:          9450
  evals/sample:     1

512x512, qrfact

Size: 512, MKL: 
  memory estimate:  288.22 KiB
  allocs estimate:  6
  --------------
  minimum time:     13.664 ms (0.00% GC)
  median time:      14.642 ms (0.00% GC)
  mean time:        14.648 ms (0.00% GC)
  maximum time:     19.407 ms (0.00% GC)
  --------------
  samples:          328
  evals/sample:     1
Size: 512, OpenBLAS:
  memory estimate:  288.22 KiB
  allocs estimate:  6
  --------------
  minimum time:     12.646 ms (0.00% GC)
  median time:      12.867 ms (0.00% GC)
  mean time:        13.025 ms (0.00% GC)
  maximum time:     18.334 ms (0.00% GC)
  --------------
  samples:          368
  evals/sample:     1

4096x4096, qrfact

Size: 4096, MKL: 
  memory estimate:  2.25 MiB
  allocs estimate:  6
  --------------
  minimum time:     2.596 s (0.00% GC)
  median time:      2.679 s (0.00% GC)
  mean time:        2.679 s (0.00% GC)
  maximum time:     2.763 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  2.25 MiB
  allocs estimate:  6
  --------------
  minimum time:     2.061 s (0.00% GC)
  median time:      2.093 s (0.00% GC)
  mean time:        2.101 s (0.00% GC)
  maximum time:     2.150 s (0.00% GC)
  --------------
  samples:          3
  evals/sample:     1
10 Likes

Nice investigative work. One thing Iā€™d mention though

IterativeSolvers is essentially just sparse matrix multiplications so it would nice to test those. Also, itā€™s domain is not small to medium matrices. Iterative solvers are generally used on large sparse matrices since decomposition methods donā€™t scale well at all (at some point they donā€™t even fit in memory since the decompositions are essentially dense). Methods for extremely large matrices are the same iterative solvers just setup with a parallel implementation of the matrix multiplication, which in Julia could in theory be done by the array type so IterativeSolvers.jl would still be the thing to use there (though it might need to cut some allocations)

5 Likes

Shoot, turns out I didnā€™t thoroughly test IterativeSolvers.jl and was instead thinking of the unregistered LinearAlgebra.jl. Here is the gist from a few months back I was thinking of, where the Julia implementation was faster at symmetric eigendecomposition up to 133x133 or 35x35 matrices, depending on the computer.
Iā€™ll see if I can get around to comparing sparse matrix solvers tomorrow.

LinearAlgebra hasnā€™t been updated for 0.7 yet, so Iā€™m running it on 0.6.2.

Here, Julia was faster until roughly 133x133. MKL beats it earlier, although this was the one test where MKL maintains its edge vs OpenBLAS. Definitely worth pointing out that Julia is single threaded, while LAPACK reaches nearly 1600% CPU. Perhaps I should recompare after specifying single threaded BLAS/LAPACK, but I donā€™t feel like doing that today.

julia> function gen_pos_def(p::Int)
  X = randn( div(3p, 2), p)
  S = X' * X
  S, Hermitian(S, :L)
end
julia> S20, H20 = gen_pos_def(20);
julia> S35, H35 = gen_pos_def(33);
julia> S40, H40 = gen_pos_def(40);
julia> S100, H100 = gen_pos_def(100);
julia> S133, H133 = gen_pos_def(133);
julia> S150, H150 = gen_pos_def(150);
julia> S300, H300 = gen_pos_def(300);
julia> S1000, H1000 = gen_pos_def(1000);
julia> S4096, H4096 = gen_pos_def(4096);

julia> Julia_eig!(A) = LinearAlgebra.EigenSelfAdjoint.eig!(A)
julia> LAPACK_eig!(A) = Base.LinAlg.LAPACK.syevr!('V', 'A', 'L', A, 1.0, 1.0, 1, 1, eps())

julia> @benchmark Julia_eig!(A) setup=(A=copy($H20)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  8.28 KiB
  allocs estimate:  22
  --------------
  minimum time:     29.215 Ī¼s (0.00% GC)
  median time:      29.997 Ī¼s (0.00% GC)
  mean time:        30.739 Ī¼s (1.93% GC)
  maximum time:     1.568 ms (94.42% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S20)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  13.72 KiB
  allocs estimate:  15
  --------------
  minimum time:     78.166 Ī¼s (0.00% GC)
  median time:      82.736 Ī¼s (0.00% GC)
  mean time:        85.267 Ī¼s (1.30% GC)
  maximum time:     3.574 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S20)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  13.34 KiB
  allocs estimate:  10
  --------------
  minimum time:     62.658 Ī¼s (0.00% GC)
  median time:      67.227 Ī¼s (0.00% GC)
  mean time:        69.026 Ī¼s (1.91% GC)
  maximum time:     2.042 ms (95.54% GC)
  --------------
  samples:          10000
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H35)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  20.09 KiB
  allocs estimate:  22
  --------------
  minimum time:     99.698 Ī¼s (0.00% GC)
  median time:      101.220 Ī¼s (0.00% GC)
  mean time:        103.004 Ī¼s (1.27% GC)
  maximum time:     1.465 ms (89.95% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S35)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  29.02 KiB
  allocs estimate:  15
  --------------
  minimum time:     192.532 Ī¼s (0.00% GC)
  median time:      204.549 Ī¼s (0.00% GC)
  mean time:        207.510 Ī¼s (1.02% GC)
  maximum time:     1.606 ms (81.95% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S35)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  28.48 KiB
  allocs estimate:  10
  --------------
  minimum time:     118.562 Ī¼s (0.00% GC)
  median time:      122.721 Ī¼s (0.00% GC)
  mean time:        125.620 Ī¼s (1.84% GC)
  maximum time:     1.639 ms (88.61% GC)
  --------------
  samples:          10000
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H40)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  28.25 KiB
  allocs estimate:  22
  --------------
  minimum time:     161.694 Ī¼s (0.00% GC)
  median time:      163.878 Ī¼s (0.00% GC)
  mean time:        166.130 Ī¼s (0.99% GC)
  maximum time:     1.442 ms (85.49% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S40)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  39.09 KiB
  allocs estimate:  15
  --------------
  minimum time:     268.634 Ī¼s (0.00% GC)
  median time:      285.606 Ī¼s (0.00% GC)
  mean time:        289.395 Ī¼s (0.94% GC)
  maximum time:     1.567 ms (76.52% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S40)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  38.44 KiB
  allocs estimate:  10
  --------------
  minimum time:     155.362 Ī¼s (0.00% GC)
  median time:      160.562 Ī¼s (0.00% GC)
  mean time:        164.453 Ī¼s (1.77% GC)
  maximum time:     1.553 ms (87.63% GC)
  --------------
  samples:          10000
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H100)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  162.03 KiB
  allocs estimate:  24
  --------------
  minimum time:     1.824 ms (0.00% GC)
  median time:      1.845 ms (0.00% GC)
  mean time:        1.860 ms (0.37% GC)
  maximum time:     2.948 ms (35.73% GC)
  --------------
  samples:          2668
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S100)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  189.31 KiB
  allocs estimate:  18
  --------------
  minimum time:     2.005 ms (0.00% GC)
  median time:      2.109 ms (0.00% GC)
  mean time:        2.120 ms (0.43% GC)
  maximum time:     3.279 ms (32.28% GC)
  --------------
  samples:          2337
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S100)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  188.39 KiB
  allocs estimate:  12
  --------------
  minimum time:     935.148 Ī¼s (0.00% GC)
  median time:      952.491 Ī¼s (0.00% GC)
  mean time:        969.590 Ī¼s (1.35% GC)
  maximum time:     2.176 ms (54.05% GC)
  --------------
  samples:          5049
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H133)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  283.88 KiB
  allocs estimate:  24
  --------------
  minimum time:     4.118 ms (0.00% GC)
  median time:      4.171 ms (0.00% GC)
  mean time:        4.210 ms (0.32% GC)
  maximum time:     5.487 ms (18.29% GC)
  --------------
  samples:          1181
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S133)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  320.17 KiB
  allocs estimate:  18
  --------------
  minimum time:     4.384 ms (0.00% GC)
  median time:      4.550 ms (0.00% GC)
  mean time:        4.577 ms (0.45% GC)
  maximum time:     6.368 ms (0.00% GC)
  --------------
  samples:          1085
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S133)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  319.03 KiB
  allocs estimate:  12
  --------------
  minimum time:     1.757 ms (0.00% GC)
  median time:      1.796 ms (0.00% GC)
  mean time:        1.823 ms (1.19% GC)
  maximum time:     3.023 ms (35.37% GC)
  --------------
  samples:          2699
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H150)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  360.13 KiB
  allocs estimate:  24
  --------------
  minimum time:     5.808 ms (0.00% GC)
  median time:      5.958 ms (0.00% GC)
  mean time:        6.023 ms (0.38% GC)
  maximum time:     7.441 ms (13.45% GC)
  --------------
  samples:          827
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S150)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  400.86 KiB
  allocs estimate:  18
  --------------
  minimum time:     3.929 ms (0.00% GC)
  median time:      4.032 ms (0.00% GC)
  mean time:        4.083 ms (0.64% GC)
  maximum time:     8.619 ms (0.00% GC)
  --------------
  samples:          1213
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S150)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  399.53 KiB
  allocs estimate:  12
  --------------
  minimum time:     2.172 ms (0.00% GC)
  median time:      2.215 ms (0.00% GC)
  mean time:        2.262 ms (1.18% GC)
  maximum time:     3.846 ms (0.00% GC)
  --------------
  samples:          2167
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H300)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  1.39 MiB
  allocs estimate:  24
  --------------
  minimum time:     42.135 ms (0.00% GC)
  median time:      43.972 ms (0.00% GC)
  mean time:        44.055 ms (0.14% GC)
  maximum time:     53.301 ms (0.00% GC)
  --------------
  samples:          114
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S300)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  1.47 MiB
  allocs estimate:  18
  --------------
  minimum time:     12.786 ms (0.00% GC)
  median time:      13.168 ms (0.00% GC)
  mean time:        13.246 ms (0.48% GC)
  maximum time:     14.703 ms (6.60% GC)
  --------------
  samples:          374
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S300)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  1.47 MiB
  allocs estimate:  12
  --------------
  minimum time:     8.234 ms (0.00% GC)
  median time:      8.379 ms (0.00% GC)
  mean time:        8.441 ms (0.75% GC)
  maximum time:     10.117 ms (10.13% GC)
  --------------
  samples:          583
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H1000)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  15.31 MiB
  allocs estimate:  24
  --------------
  minimum time:     1.521 s (0.06% GC)
  median time:      1.529 s (0.06% GC)
  mean time:        1.530 s (0.06% GC)
  maximum time:     1.539 s (0.06% GC)
  --------------
  samples:          4
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S1000)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  15.57 MiB
  allocs estimate:  20
  --------------
  minimum time:     140.302 ms (0.70% GC)
  median time:      141.813 ms (0.67% GC)
  mean time:        141.873 ms (0.67% GC)
  maximum time:     143.045 ms (0.66% GC)
  --------------
  samples:          36
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S1000)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  15.56 MiB
  allocs estimate:  12
  --------------
  minimum time:     127.924 ms (0.80% GC)
  median time:      129.967 ms (0.76% GC)
  mean time:        129.972 ms (0.81% GC)
  maximum time:     133.471 ms (0.90% GC)
  --------------
  samples:          39
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H4096)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  256.19 MiB
  allocs estimate:  30
  --------------
  minimum time:     149.840 s (0.03% GC)
  median time:      149.840 s (0.03% GC)
  mean time:        149.840 s (0.03% GC)
  maximum time:     149.840 s (0.03% GC)
  --------------
  samples:          1
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S4096)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  257.28 MiB
  allocs estimate:  23
  --------------
  minimum time:     8.732 s (0.59% GC)
  median time:      8.732 s (0.59% GC)
  mean time:        8.732 s (0.59% GC)
  maximum time:     8.732 s (0.59% GC)
  --------------
  samples:          1
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S4096)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  257.47 MiB
  allocs estimate:  15
  --------------
  minimum time:     6.602 s (0.02% GC)
  median time:      6.602 s (0.02% GC)
  mean time:        6.602 s (0.02% GC)
  maximum time:     6.602 s (0.02% GC)
  --------------
  samples:          1
  evals/sample:     1"""

EDIT:
Apparently the LAPACK libraries do not make good use of multithreading (Iā€™ve struggled to write code that gets faster even when how seems obvious!), and even single threaded they trounce Julia at large sizes. However, single threaded, OpenBLAS is more than 50% faster than MKL at eigendecomposition of 4096x4096 positive definite matrices. Using all the cores speeds OpenBLAS up >2x, and MKL >5x, giving MKL a roughly 33% edge.
With 32 threads, MKL took 6.585 s and OpenBLAS 7.672 s, so OpenBLASā€™s syevr! does benefit from hyperthreading (on this platform, with this size).

OpenBLAS:

julia> BLAS.set_num_threads(1)
julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S4096)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  257.28 MiB
  allocs estimate:  23
  --------------
  minimum time:     19.594 s (0.26% GC)
  median time:      19.594 s (0.26% GC)
  mean time:        19.594 s (0.26% GC)
  maximum time:     19.594 s (0.26% GC)
  --------------
  samples:          1
  evals/sample:     1

MKL:

julia> BLAS.set_num_threads(1)

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S4096)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  257.47 MiB
  allocs estimate:  15
  --------------
  minimum time:     34.245 s (0.00% GC)
  median time:      34.245 s (0.00% GC)
  mean time:        34.245 s (0.00% GC)
  maximum time:     34.245 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1
2 Likes

Thanks for doing this. It would be great if you could make a few graphs to summarize the results. AFAICT Discourse allows uploading SVG files.

5 Likes

Very interesting.

It is known that Intel MKL and IPP discriminate non Intel CPUā€™s.
The interesting question, in my opinion, is whether AMD Ryzen + OpenBLAS can compete against Intel + Intel MKL for similar budget.

On many open source based code (Namely code which doesnā€™t discriminate, at least not as a policy) it seems AMD can compete pretty good with Intel and usually in Multi Threaded scenarios even beat it.

If someone could arrange some set of benchmarks for AMD vs. Intel on Julia it will be interesting and hopefully will make AMD an alternative for Scientific Programming Workstation.

1 Like

My knowledge is pretty slanted toward Linux / things benchmarked by Phoronix.

You can look at these benchmarks:

The Ryzen 7 1700 and the i7 7700K or i7 7740 are in the same ball park.
The Ryzen 7 1800X and the i7 8700K are in about the same price range, as are
the Ryzen 1950X and the i9 7900X.
[I think the AMD chips above are a little cheaper than the Intel Iā€™m comparing them with.]

8 vs 6 cores, and 16 vs 10.
Compiling with -j16 or even -j32 is wonderful, and if youā€™re a big fan of Threads.@threads (or things like OpenMP) it is great.
Most of the benchmarks and random things I do are single threaded, but more serious things where time matters more than just a game are more likely to be parallel or threaded.

Would be interesting to see how AMD with OpenBLAS does vs Intel with MKL, but there are a lot of confounding variables, like OS. As can be seen here, where it is just the ThreadRipper vs the i9 7900X on three different Linux distributions:

As in all his benchmarks, Clear Linux dominates more often than not. This was even the case for a set of tests when Clear Linux was one of the only ones to have already included Specter/Meltdown mitigation.
And this is true on both AMD and Intel hardware.
Which is cool because Clear Linux, like a lot of the fastest software, is made and maintained by Intel.
I guess being open source means it doesnā€™t make sense to add goofy things ā€“ like using cpuid instead of instructions a processor is capable of ā€“ to make it only fast on Intel.

I havenā€™t tried Clear Linux myself, because it seems aimed at headless / cloud type work, instead of home desktop.

As a word of warning though, Stan (Markov Chain Monte Carlo software) warns that if you build it with Intelā€™s compilers, you need to add a couple extra compiler flags; otherwise the accuracy icc sacrifices in the name of speed will actually stop a few of their algorithms from working.

I should get around to more benchmarks and graphics. Unfortunately, I just wiped that hard drive today (and associated MKL/Julia install) because I screwed things (graphics drivers) up to the point where it wouldnā€™t boot. @_@

EDIT:
Trying to rebuild Julia, and Julia really wants me to use some ancient commit of OpenBLAS that predates and does not support Zen.
Trying to find out how to override that, while still letting Julia take care of the BLAS build instead of doing it manually like in the opening post. =/

EDIT:
I tried editing

./deps/openblas.version
./deps/blas.mk

but gave up, and just built OpenBLAS via git clone, and then make + copy and pasting most of the flags that blas.mk (not necessary ā€“ like in my opening post ā€“ but they all sounded like good ideas). Zero issues.
Then, reusing the same Make.user as earlier (but now with the addition of USE_BLAS64=1 and statically linking llvm because of Mesa OpenGL issues Julia again builds without a problem.

Building BLAS and Julia separately == everything just works.
Trying to let Julia build BLAS for me == frustration and massive headaches.
The ability to just download a precompiled binary == reason why I shouldnā€™t complain lol.

1 Like

Hi,

I know the ā€œBig Pictureā€ regarding AMD Ryzen vs Intel Core.
I was specifically looking for comparison for Scientific Computing / Programming.
Being even more specific, basic operations when AMD Ryzen is using OpenBLAS and Intel Core is using Intel MKL.

If you can show something like that it would be great.

I would even say it is a shame big sites like Anandtech doesnā€™t have Scientific Programming (Python, MATLAB, Julia) as part of their tests system. Mostly since some of it is free and pretty easy to do.

2 Likes

There are a lot if scientific computing benchmarks here:

[On Newegg] the Intel process costs $1980 and the AMD $900, under half as much.

It has Numpy, Scikit-Learn, Matrix multiplication, GMP, FFTW, glibc (eg, tanh, sqrt), and lots of other things doing some sort of numerical computing like ā€œMrBayesā€ primate phylogeny.

But no OpenBLAS with AMD vs MKL with Intel. I agree, it would be cool.
Of course, with most applications (eg, Julia) the majority of software will be the same between them.

Well, for Numpy Iā€™d guess it either uses Intel MKL and then it is obvious why the Intel is much better or it uses OpenBLAS and then it means we have a problem with Ryzen :-).

I would like someone who write an ensemble of micro benchmarks using Python / Julia / MATLAB comparing those systems.

Probably there are some benchmarks out there ready.
We just need access to those systems.

1 Like

Ryzen is terrible because the bandwidth between cores is so low. look at SGEMM threadripper! terrible! any impossible to buy.

The interesting question, in my opinion, is whether AMD Ryzen + OpenBLAS can compete against Intel + Intel MKL for similar budget.

Just tested on a 7900x, which costs about as much as the Ryzen Threadripper 1950x and was released at around the same time.
The major differences are 10 cores / 20 threads vs 16 / 32, and two 512 bit fma units vs two 128 bits units per core.

Using MKL:

julia> A = randn(4096,4096);

julia> C = randn(4096,4096);

julia> B = randn(4096,4096);

julia> using LinearAlgebra,  BenchmarkTools

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     174.715 ms (0.00% GC)
  median time:      176.507 ms (0.00% GC)
  mean time:        176.510 ms (0.00% GC)
  maximum time:     178.155 ms (0.00% GC)
  --------------
  samples:          29
  evals/sample:     1

Over twice as fast as the Threadripper! At the high $ end, Intel and its avx advantage win hands down for matrix multiplication.
And this was with 4096 x 4096 matrices. It was over three times faster with 64x64.

In trying to optimize 8x8 by 8x8 matrix multiplication on the two processors, my best times were about 48 ns vs 11.6 ns for Ryzen vs avx-512.

Maybe I should build Julia with OpenBLAS to see how it compares. Iā€™d expect it to be faster at all sizes too, but it could be that at the small end Ryzen + MKL might be faster than Intel + OpenBLAS.

In my original post, I should have paid more attention to how many threads each were using. I realized OpenBLAS gets faster with BLAS.set_num_threads(1) for some sizes; I think they turn on multithreading too soon.

EDIT:
Does Julia + OpenBLAS launch with less threads now? I thought I had quite the regression, but BLAS.set_num_threads(16) fixed it. Threadripper:

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     369.510 ms (0.00% GC)
  median time:      370.037 ms (0.00% GC)
  mean time:        389.715 ms (0.00% GC)
  maximum time:     622.606 ms (0.00% GC)
  --------------
  samples:          13
  evals/sample:     1
1 Like

Just curious. Given the same budget, which combination would be more performent?

MKL + Intel vs. OpenBlas + AMD

That might change with the new 7nm Ryzen, but

julia> using LinearAlgebra

julia> BLAS.set_num_threads(Sys.CPU_THREADS >> 1)

julia> LinearAlgebra.peakflops(16000)
1.9732209060275942e12

julia> LinearAlgebra.peakflops(16000)
1.9726505538599666e12

julia> versioninfo()
Julia Version 1.4.0-DEV.0
Commit 2ef0ed159d* (2019-08-17 17:21 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.0 (ORCJIT, skylake)

is going to be well ahead of a Ryzen 2990WX. Similarly, the 7900X (10 core, with avx512) was much faster for BLAS/LAPACK than the Ryzen 1950X (16 core, half-rate avx2).

The new 7 nm Ryzen half-full rate avx 2. I would bet on them over the Intel chips without avx512.
As for the HEDT parts with avx-512, Iā€™ll wait to see the new Threadrippers. If AMD offers twice the cores per $, itā€™ll be hard to beat them there (and then AMD would be much faster for all non-avx workloads).

For the top of the line server parts, I believe the 64 core EPYC costs less money than the 28 core Xeon. So while they might be close for extremely vectorizable code that can leverage avx512 like matrix multiplicationā€¦

1 Like

Yep thatā€™s pretty impressive. Hereā€™s my 2950X:

julia> LinearAlgebra.peakflops(16000)
2.2912593368633377e11

julia> versioninfo()
Julia Version 1.3.0-rc1.0
Commit 768b25f6a8 (2019-08-18 00:04 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 2950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 32

Oddly enough, parallel=true actually makes this slower, and setting BLAS.set_num_threads(32) only gives about a 50% speedup

1 Like

By default, I thought OpenBLAS was only build with 16 threads.

I also think you need to addprocs or start Julia with -j$(nproc) to use parallel, since it disables multithreading. That is, with parallel=true, each child process will multiply matrices. Still, I also saw a roughly 2x hit on performance from that.

Was 2.2912593368633377e11 with or without the 50% increase?

The 7980XE having 8.6 times the FLOPs at around twice the (new) price is a hefty advantage.
Theyā€™d both be much cheaper now if anyone is willing to buy them used (I did off ebay).

EDIT: I am definitely tempted to try and increase overclocks to breach 2 teraflopsā€¦

@cscherrer

Iā€™m still trying to learn more in this area, but I think youā€™ll get better performance if you set the number of threads available to Julia to 16, rather than 32? I think you only have 16 physical threads available, with 32 logical threads. If all your threads are being heavily used until completion of the task, itā€™s my understanding that itā€™s better to set to the physical thread count, rather than the logical thread count.

Still trying to fully wrap my head around all that though, so please correct if if Iā€™m not thinking of things correctly.

1 Like

That has been my experience with BLAS/LAPACK (and MCMC simutions making single threaded calls) as well: performance is best when using the number of physical threads.
For reference, I used BLAS.set_num_threads(18) (I just edited my earlier post to reflect that).

2 Likes

Yes, my (potentially flawed) understanding is that with hyper-threading, you can have one thread ā€œpretendā€ to be two threads, so that it swaps back and forth between the two tasks. If both the tasks have latency involved for some reason, then itā€™s like you have twice as many threads; while itā€™s waiting on task 1, it hops back and does work on task 2. However, if neither task has any latency, it actually slows things down by hopping back and forth. Since thereā€™s not much latency in most computationally intense mathematical algorithms, youā€™re better off not using the hyper-threading capabilities during the algorithm.

Mostly putting this out there so someone can tell me if Iā€™m confused on the topic :grinning:

My understanding is that it is supposed to help CPUs take advantage of out of order execution.
According to this blog post, recent Intel CPUs can issue 4 instructions total per cycle, and AMD CPUs can issue 5.

But there are a lot of other limitations. The 7980XE is able to execute only upto two 512-bit fused multiply-add (fma) instructions per clock cycle, while with the 2950X only 1 of the 5 instructions can be a 256-bit fma.

So if youā€™re doing math-heavy code, it is really easy for a single thread ā€“ especially if youā€™re doing matrix multiplication ā€“ to max out on the floating point math.

So, no performance benefit from two logical threads per physical.

Those two logical threads are going to share some resources, like the L1 cache. Meaning theyā€™ll likely compete for cache space, and inadvertently end up spending more time idle, waiting for memory.


Also, I should point out that I did overclock the 7980XE already. It is currently set to a 3.7 GHz base frequency for avx512. According to Silicon Lotteryā€™s historical binning statistics, 89% of 7980XEs could achieve that. 72% can do 3.8 GHz, and the top 35% could achieve 3.9 GHz.
I started at the bottom, so I could see if my CPU is a part of the 72%. Failing that, Iā€™m sure I could play with something else, like memory or cache speeds to get that extra 1.5% to reach 2 TFLOPS of double precision.
3.7 GHz should mean that the theoretical peak is:
18 cores * 3.7*10^9 cycles/core/second * 2 fma / cycle * 16 floating point ops / 512-bit DP fma

julia> 18 * 3.7 * 10^9 * 2 * 16
2.1312000000000002e12

MKL is getting impressively close.

For a more fair comparison, the 2950X ought to be overclocked similarly. They donā€™t have statistics on the 2950X, but the top 83% of 1950X could hit 3.85 GHz

2 Likes