OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen)

Elrod · December 28, 2017, 2:49am

Questions about MKL vs OpenBLAS come up a lot, for example in comparisons with Matlab (linked to MKL), and a lot of users have issues building with MKL, eg here. Of course, one can easily download an MKL binary with JuliaPro, but then you may have to face down an army of dependency conflicts.

My point here is to compare MKL and OpenBLAS with an AMD processor (Ryzen Threadripper 1950x).
Lots of performance comparisons are already out there, but I figured I would add one without an intel chip.

Brief summary of results: MKL is generally faster at small sizes, OpenBLAS at large sizes.
All in all, I’d call OpenBLAS the clear winner (given this hardware) because

overall run time is easily dominated by a few operations on large matrices.
if one is concerned about performance of operations on small matrices it is easy to turn to native Julia. For example, StaticArrays.jl performs well on very small matrices, and IterativeSolvers.jl is extremely effective for small-medium sized matrices. This leaves large matrices as the last domain where Julia really benefits from outside assistance.

Results are probably different with an Intel processor, eg probably this.
Maybe I should’ve made pretty graphs, but they’d be harder to share on Discourse.

Operating System and Julia Install

I just wiped my hard drive and installed a fresh (x)ubuntu 16.04 last night, for the sake of upgrading ROCm. Ubuntu 16.04 comes with gcc 5.4, which ROCm requires but doesn’t support -march=zenv1 so I also built gcc version 7.2 (adding -7.2 as a suffix) to get the latest gcc, g++, and gfortran. (Phoronix’s benchmarks suggest gcc 7.2 often produces faster code, but not on the level of Julia 0.7 vs 0.6 , and that Zen optimizations don’t help yet. As an aside, if that’s what it looks like when Julia’s devs are trying to break things instead of optimize… )
My ubuntu install did not come with gfortran, so this provided that dependency.

I just used sudo apt-get install for the remaining dependencies. All I recall installing was build-essential (not sure if this included anything relevant), m4, and pkg-config.

Build with OpenBLAS
You can safely skip most of this. I built the development version of OpenBLAS from source, cloned Julia, and created the following Make.user:

USE_SYSTEM_BLAS=1
USE_SYSTEM_LAPACK=1
LIBBLAS=-lopenblas
LIBBLASNAME=libopenblas
LIBLAPACK=-lopenblas
LIBLAPACKNAME=libopenblas

Then it was simply: make CC=gcc-7.2 CXX=g++-7.2 FC=gfortran-7.2 -j32.
If you’re going to do multiple installs with OpenBLAS, not rebuilding it may be worth it. Otherwise, you can skip the Make.user and everything will happen automatically. The end result:

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40 (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  BLAS: libopenblas (ZEN)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, generic)

and

julia> versioninfo()
Julia Version 0.7.0-DEV.3199
Commit 1238bad (2017-12-27 20:40 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  BLAS: libopenblas (ZEN)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, znver1)
Environment:

Building with MKL

I installed the 2018 update 1 versions of MKL (gratis) and Parallel Studio Xe (gratis for students, and I believe also open source contributors and classroom educators) in /opt/intel. I remember needing parallel studio to satisfy some dependency (I believe it was libimf) months ago, so I just went ahead and downloaded it today seeing if things could work without it.

First, I followed Julia’s github readme, running
source /opt/intel/bin/compilervars.sh intel64
and creating

USEICC = 1
USEIFC = 1
USE_INTEL_MKL = 1
USE_INTEL_LIBM = 1

but eventually ran into a linking issue like the one described here. Mike Kinghan provided a great answer there, but I’d rather move on and stick with gcc. Therefore the new, Make.user:

USE_INTEL_MKL = 1
USE_INTEL_LIBM = 1

and simply running make CC=gcc-7.2 CXX=g++-7.2 FC=gfortran-7.2 -j32 results in:

julia> versioninfo()
Julia Version 0.7.0-DEV.3199
Commit 1238bad (2017-12-27 20:40 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  BLAS: libmkl_rt
  LAPACK: libmkl_rt
  LIBM: libimf
  LLVM: libLLVM-3.9.1 (ORCJIT, znver1)
Environment:

Matrix Sizes

Sticking to NxN square matrices, and trying
N = (8,64,512,4096)
to get an idea of how they perform over a broad range.

GEMM

Here is what I ran:

create_g(T, N) = randn(T,N,N)
a = create_g.(Float32, N);
A = create_g.(Float64, N);
b = create_g.(Float32, N);
B = create_g.(Float64, N);
c = similar.(a);
C = similar.(A);
using BenchmarkTools
function bench_f!(C, A, B, f, N, info = "")
    for i in 1:length(N)
        println("Size: $(N[i])" * info)
        @show @benchmark $f($(C[i]), $(A[i]), $(B[i]))
    end
end
bench_f!(C, A, B, A_mul_B!, N, ", Julia-Dev:")

Below are the results using the same commit of Julia-master (v0.7.0, Commit 1238bad (2017-12-27 20:40 UTC) ), with the only difference being MKL vs OpenBLAS. Whether I ran MKL or OpenBLAS first varied, but if the benchmark caused my CPU fans to turn up, I would wait until they died down before running the next set of benchmarks.

Julia v0.6.2 with OpenBLAS unsurprisingly performed similarly to v0.7.0-dev + OpenBLAS, but did not print nicely with the above function so I excluded it for space.

The outputs:
8x8, Float64

Size: 8, MKL:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     336.964 ns (0.00% GC)
  median time:      344.312 ns (0.00% GC)
  mean time:        349.959 ns (0.00% GC)
  maximum time:     627.195 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     221
Size: 8, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     375.439 ns (0.00% GC)
  median time:      379.302 ns (0.00% GC)
  mean time:        381.454 ns (0.00% GC)
  maximum time:     695.751 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     205
StaticArrays:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     136.728 ns (0.00% GC)
  median time:      136.855 ns (0.00% GC)
  mean time:        137.184 ns (0.00% GC)
  maximum time:     189.052 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     867

64x64, Float64

Size: 64, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.682 μs (0.00% GC)
  median time:      13.185 μs (0.00% GC)
  mean time:        13.672 μs (0.00% GC)
  maximum time:     362.510 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     22.823 μs (0.00% GC)
  median time:      23.584 μs (0.00% GC)
  mean time:        23.769 μs (0.00% GC)
  maximum time:     108.314 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

512x512, Float64

Size: 512, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.907 ms (0.00% GC)
  median time:      1.984 ms (0.00% GC)
  mean time:        2.051 ms (0.00% GC)
  maximum time:     3.536 ms (0.00% GC)
  --------------
  samples:          2436
  evals/sample:     1
Size: 512, OpenBLAS:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.167 ms (0.00% GC)
  median time:      1.199 ms (0.00% GC)
  mean time:        1.267 ms (0.00% GC)
  maximum time:     2.381 ms (0.00% GC)
  --------------
  samples:          3940
  evals/sample:     1

4096x4096, Float64

Size: 4096, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     884.305 ms (0.00% GC)
  median time:      887.235 ms (0.00% GC)
  mean time:        888.018 ms (0.00% GC)
  maximum time:     896.666 ms (0.00% GC)
  --------------
  samples:          6
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     384.890 ms (0.00% GC)
  median time:      390.259 ms (0.00% GC)
  mean time:        398.521 ms (0.00% GC)
  maximum time:     448.346 ms (0.00% GC)
  --------------
  samples:          13
  evals/sample:     1

MKL was ahead at 64x64, but starting at 512x512 OpenBLAS started pulling far ahead.
Note that if you’re just multiplying 64x64 and 4096x4096 matrices, you’d have to multiply around 50,000 times more of the former for MKL to be faster.

Single precision results were comparable.
8x8, Float32

Size: 8, MKL:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     238.978 ns (0.00% GC)
  median time:      246.381 ns (0.00% GC)
  mean time:        247.609 ns (0.00% GC)
  maximum time:     480.341 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     417
Size: 8, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     331.460 ns (0.00% GC)
  median time:      333.518 ns (0.00% GC)
  mean time:        334.891 ns (0.00% GC)
  maximum time:     433.795 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     224
StaticArrays
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     116.201 ns (0.00% GC)
  median time:      116.912 ns (0.00% GC)
  mean time:        117.159 ns (0.00% GC)
  maximum time:     159.055 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     916

64x64, Float32

Size: 64, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.674 μs (0.00% GC)
  median time:      9.317 μs (0.00% GC)
  mean time:        9.510 μs (0.00% GC)
  maximum time:     241.198 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.571 μs (0.00% GC)
  median time:      11.722 μs (0.00% GC)
  mean time:        11.839 μs (0.00% GC)
  maximum time:     35.355 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

512x512, Float32

Size: 512, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     681.205 μs (0.00% GC)
  median time:      752.947 μs (0.00% GC)
  mean time:        895.173 μs (0.00% GC)
  maximum time:     3.101 ms (0.00% GC)
  --------------
  samples:          5577
  evals/sample:     1
Size: 512, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     604.630 μs (0.00% GC)
  median time:      619.408 μs (0.00% GC)
  mean time:        645.878 μs (0.00% GC)
  maximum time:     1.211 ms (0.00% GC)
  --------------
  samples:          7729
  evals/sample:     1

4096x4096, Float32

Size: 4096, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     336.803 ms (0.00% GC)
  median time:      348.420 ms (0.00% GC)
  mean time:        355.731 ms (0.00% GC)
  maximum time:     417.870 ms (0.00% GC)
  --------------
  samples:          15
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     183.598 ms (0.00% GC)
  median time:      188.000 ms (0.00% GC)
  mean time:        192.251 ms (0.00% GC)
  maximum time:     227.223 ms (0.00% GC)
  --------------
  samples:          27
  evals/sample:     1

Triangular x General Matrix Multiplication
I figured rather than calling gemm!, trmm!, etc directly, I’d stick with the higher level A_mul_B! interface.

tA = UpperTriangular.(A);
bench_f!(C, tA, B, A_mul_B!, N, ", OpenBLAS:")

8x8, Triangle

Size: 8, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     405.515 ns (0.00% GC)
  median time:      410.370 ns (0.00% GC)
  mean time:        411.963 ns (0.00% GC)
  maximum time:     753.265 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     200
Size: 8, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.330 μs (0.00% GC)
  median time:      5.140 μs (0.00% GC)
  mean time:        5.269 μs (0.00% GC)
  maximum time:     32.291 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7
StaticArrays
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     64.979 ns (0.00% GC)
  median time:      66.874 ns (0.00% GC)
  mean time:        67.034 ns (0.00% GC)
  maximum time:     113.781 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     978

64x64, Triangle

Size: 64, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.454 μs (0.00% GC)
  median time:      13.696 μs (0.00% GC)
  mean time:        13.870 μs (0.00% GC)
  maximum time:     238.548 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.198 μs (0.00% GC)
  median time:      23.905 μs (0.00% GC)
  mean time:        24.484 μs (0.00% GC)
  maximum time:     216.596 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

512x512, Triangle

Size: 512, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.367 ms (0.00% GC)
  median time:      1.394 ms (0.00% GC)
  mean time:        1.422 ms (0.00% GC)
  maximum time:     2.153 ms (0.00% GC)
  --------------
  samples:          3513
  evals/sample:     1
Size: 512, OpenBLAS:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     819.315 μs (0.00% GC)
  median time:      1.274 ms (0.00% GC)
  mean time:        1.179 ms (0.00% GC)
  maximum time:     1.715 ms (0.00% GC)
  --------------
  samples:          4233
  evals/sample:     1

4096x4096, Triangle

Size: 4096, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     538.547 ms (0.00% GC)
  median time:      574.797 ms (0.00% GC)
  mean time:        569.898 ms (0.00% GC)
  maximum time:     594.784 ms (0.00% GC)
  --------------
  samples:          9
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     211.023 ms (0.00% GC)
  median time:      212.099 ms (0.00% GC)
  mean time:        216.946 ms (0.00% GC)
  maximum time:     249.050 ms (0.00% GC)
  --------------
  samples:          24
  evals/sample:     1

Again, MKL dominates at small sizes, but loses out as the matrices grow.

Symmetric x General Matrix Multiplication

sA = Symmetric.(A);
bench_f!(C, sA, B, A_mul_B!, N, ", OpenBLAS:")

8x8, Symmetric

Size: 8, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     396.866 ns (0.00% GC)
  median time:      401.801 ns (0.00% GC)
  mean time:        403.922 ns (0.00% GC)
  maximum time:     626.055 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     201
Size: 8, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.547 μs (0.00% GC)
  median time:      5.046 μs (0.00% GC)
  mean time:        5.104 μs (0.00% GC)
  maximum time:     12.286 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9
StaticArrays
  memory estimate:  848 bytes
  allocs estimate:  5
  --------------
  minimum time:     603.801 ns (0.00% GC)
  median time:      619.858 ns (0.00% GC)
  mean time:        694.029 ns (9.39% GC)
  maximum time:     190.751 μs (99.59% GC)
  --------------
  samples:          10000
  evals/sample:     176

64x64, Symmetric

Size: 64, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.740 μs (0.00% GC)
  median time:      12.153 μs (0.00% GC)
  mean time:        12.441 μs (0.00% GC)
  maximum time:     240.722 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     29.916 μs (0.00% GC)
  median time:      34.174 μs (0.00% GC)
  mean time:        34.917 μs (0.00% GC)
  maximum time:     238.238 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

512x512, Symmetric

Size: 512, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.921 ms (0.00% GC)
  median time:      1.998 ms (0.00% GC)
  mean time:        2.148 ms (0.00% GC)
  maximum time:     4.300 ms (0.00% GC)
  --------------
  samples:          2324
  evals/sample:     1
Size: 512, OpenBLAS:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.263 ms (0.00% GC)
  median time:      1.306 ms (0.00% GC)
  mean time:        1.393 ms (0.00% GC)
  maximum time:     2.338 ms (0.00% GC)
  --------------
  samples:          3584
  evals/sample:     1

4096x4096, Symmetric

Size: 4096, MKL: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     920.034 ms (0.00% GC)
  median time:      952.772 ms (0.00% GC)
  mean time:        959.352 ms (0.00% GC)
  maximum time:     1.024 s (0.00% GC)
  --------------
  samples:          6
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     380.653 ms (0.00% GC)
  median time:      382.731 ms (0.00% GC)
  mean time:        395.965 ms (0.00% GC)
  maximum time:     444.567 ms (0.00% GC)
  --------------
  samples:          13
  evals/sample:     1

I split matrix factorization results into a separate comment because of character limits.

Elrod · December 28, 2017, 2:49am

Cholesky
Moving onto LAPACK, lets create positive definite matrices.

gen_pd(n) = randn(2n, n) |> x -> Symmetric(x' * x)
D = gen_pd.(N);
function bench_f(A, f, N, info = "")
    for i in 1:length(N)
        println("Size: $(N[i])" * info)
        Aᵢ = A[i]
        @show @benchmark $f(B) setup=( B = copy($Aᵢ) ) evals=1
    end
end
bench_f(D, cholfact!, N, ", OpenBLAS:")

Also, because StaticArrays performs poorly here, I wanted to add a different Julia function for comparison:

function julia_chol!(U::AbstractArray{<:Real,2})
    p = size(U,1)
    @inbounds for i ∈ 1:p
        Uᵢᵢ = U[i,i]
        for j ∈ 1:i-1
            Uⱼᵢ = U[j,i]
            for k ∈ 1:j-1
                Uⱼᵢ -= U[k,i] * U[k,j]
            end
            Uⱼᵢ /= U[j,j]
            U[j,i] = Uⱼᵢ
            Uᵢᵢ -= abs2(Uⱼᵢ)
        end
        U[i,i] = √Uᵢᵢ
    end
    U
end

This is a naive algorithm and scales very poorly, but it is still faster for 8x8 matrices and in the same ballpark for 64x64.

8x8, Cholfact

Size: 8, MKL:
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     210.000 ns (0.00% GC)
  median time:      240.000 ns (0.00% GC)
  mean time:        280.039 ns (0.00% GC)
  maximum time:     8.055 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 8, OpenBLAS: 
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     561.000 ns (0.00% GC)
  median time:      571.000 ns (0.00% GC)
  mean time:        605.324 ns (0.00% GC)
  maximum time:     43.101 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 8, julia_chol!
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     170.000 ns (0.00% GC)
  median time:      171.000 ns (0.00% GC)
  mean time:        176.252 ns (0.00% GC)
  maximum time:     4.769 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
StaticArrays
  memory estimate:  14.58 KiB
  allocs estimate:  112
  --------------
  minimum time:     64.371 μs (0.00% GC)
  median time:      66.635 μs (0.00% GC)
  mean time:        72.561 μs (6.66% GC)
  maximum time:     34.846 ms (99.43% GC)
  --------------
  samples:          10000
  evals/sample:     1

64x64, Cholfact

Size: 64, MKL: 
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     15.560 μs (0.00% GC)
  median time:      15.930 μs (0.00% GC)
  mean time:        16.068 μs (0.00% GC)
  maximum time:     54.824 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     19.497 μs (0.00% GC)
  median time:      24.216 μs (0.00% GC)
  mean time:        28.747 μs (0.00% GC)
  maximum time:     28.050 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 64, julia_chol!
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     28.744 μs (0.00% GC)
  median time:      29.185 μs (0.00% GC)
  mean time:        29.262 μs (0.00% GC)
  maximum time:     74.320 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

512x512, Cholfact

Size: 512, MKL: 
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     788.878 μs (0.00% GC)
  median time:      842.179 μs (0.00% GC)
  mean time:        858.057 μs (0.00% GC)
  maximum time:     4.506 ms (0.00% GC)
  --------------
  samples:          3553
  evals/sample:     1
Size: 512, OpenBLAS:
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     1.048 ms (0.00% GC)
  median time:      1.098 ms (0.00% GC)
  mean time:        1.114 ms (0.00% GC)
  maximum time:     1.702 ms (0.00% GC)
  --------------
  samples:          3097
  evals/sample:     1
Size: 512, julia_chol!
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.201 ms (0.00% GC)
  median time:      16.402 ms (0.00% GC)
  mean time:        16.456 ms (0.00% GC)
  maximum time:     17.533 ms (0.00% GC)
  --------------
  samples:          298
  evals/sample:     1

4096x4096, Cholfact

Size: 4096, MKL: 
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     174.107 ms (0.00% GC)
  median time:      176.916 ms (0.00% GC)
  mean time:        177.837 ms (0.00% GC)
  maximum time:     190.099 ms (0.00% GC)
  --------------
  samples:          24
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     99.876 ms (0.00% GC)
  median time:      101.212 ms (0.00% GC)
  mean time:        102.345 ms (0.00% GC)
  maximum time:     107.673 ms (0.00% GC)
  --------------
  samples:          39
  evals/sample:     1
Size: 4096, julia_chol!
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.152 s (0.00% GC)
  median time:      12.152 s (0.00% GC)
  mean time:        12.152 s (0.00% GC)
  maximum time:     12.152 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

Eigendecomposition of Positive Definite Matrices

bench_f(D, eigfact!, N, ", OpenBLAS:")

8x8, eigfact of positive definite matrix

Size: 8, MKL:
  memory estimate:  4.38 KiB
  allocs estimate:  11
  --------------
  minimum time:     12.203 μs (0.00% GC)
  median time:      12.964 μs (0.00% GC)
  mean time:        13.451 μs (1.92% GC)
  maximum time:     1.319 ms (98.45% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 8, OpenBLAS: 
  memory estimate:  4.41 KiB
  allocs estimate:  11
  --------------
  minimum time:     15.339 μs (0.00% GC)
  median time:      17.763 μs (0.00% GC)
  mean time:        17.936 μs (1.45% GC)
  maximum time:     1.345 ms (97.59% GC)
  --------------
  samples:          10000
  evals/sample:     1
StaticArrays
  memory estimate:  5.59 KiB
  allocs estimate:  12
  --------------
  minimum time:     15.660 μs (0.00% GC)
  median time:      18.745 μs (0.00% GC)
  mean time:        23.590 μs (20.61% GC)
  maximum time:     36.955 ms (99.85% GC)
  --------------
  samples:          10000
  evals/sample:     1

64x64, eigfact of positive definite matrix

Size: 64, MKL: 
  memory estimate:  84.89 KiB
  allocs estimate:  13
  --------------
  minimum time:     312.257 μs (0.00% GC)
  median time:      321.615 μs (0.00% GC)
  mean time:        327.857 μs (1.37% GC)
  maximum time:     1.618 ms (74.54% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  85.36 KiB
  allocs estimate:  13
  --------------
  minimum time:     693.786 μs (0.00% GC)
  median time:      736.937 μs (0.00% GC)
  mean time:        740.982 μs (0.46% GC)
  maximum time:     1.668 ms (53.11% GC)
  --------------
  samples:          6677
  evals/sample:     1

512x512, eigfact of positive definite matrix

Size: 512, MKL: 
  memory estimate:  4.16 MiB
  allocs estimate:  13
  --------------
  minimum time:     22.577 ms (0.00% GC)
  median time:      23.149 ms (0.00% GC)
  mean time:        23.257 ms (0.71% GC)
  maximum time:     26.396 ms (3.49% GC)
  --------------
  samples:          212
  evals/sample:     1
Size: 512, OpenBLAS:
  memory estimate:  4.16 MiB
  allocs estimate:  13
  --------------
  minimum time:     44.988 ms (0.00% GC)
  median time:      45.718 ms (0.00% GC)
  mean time:        45.967 ms (0.27% GC)
  maximum time:     50.037 ms (1.41% GC)
  --------------
  samples:          108
  evals/sample:     1

4096x4096, eigfact of positive definite matrix

Size: 4096, MKL: 
  memory estimate:  257.47 MiB
  allocs estimate:  16
  --------------
  minimum time:     6.548 s (0.14% GC)
  median time:      6.548 s (0.14% GC)
  mean time:        6.548 s (0.14% GC)
  maximum time:     6.548 s (0.14% GC)
  --------------
  samples:          1
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  257.28 MiB
  allocs estimate:  16
  --------------
  minimum time:     8.973 s (0.10% GC)
  median time:      8.973 s (0.10% GC)
  mean time:        8.973 s (0.10% GC)
  maximum time:     8.973 s (0.10% GC)
  --------------
  samples:          1
  evals/sample:     1

Here, MKL maintained its edge. I recall reading that some of the LAPACK functions from OpenBLAS are reference, rather than optimized, implentations.

Eigendecomposition of general square matrices

bench_f(A, eigfact!, N, ", OpenBLAS:")

8x8, eigfact of general square matrix

Size: 8, MKL:
  memory estimate:  3.64 KiB
  allocs estimate:  19
  --------------
  minimum time:     15.269 μs (0.00% GC)
  median time:      15.960 μs (0.00% GC)
  mean time:        16.593 μs (2.06% GC)
  maximum time:     1.744 ms (97.80% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 8, OpenBLAS: 
  memory estimate:  11.41 KiB
  allocs estimate:  17
  --------------
  minimum time:     15.499 μs (0.00% GC)
  median time:      16.311 μs (0.00% GC)
  mean time:        17.403 μs (3.45% GC)
  maximum time:     1.243 ms (98.43% GC)
  --------------
  samples:          10000
  evals/sample:     1

64x64, eigfact of general square matrix

Size: 64, MKL: 
  memory estimate:  117.66 KiB
  allocs estimate:  23
  --------------
  minimum time:     1.754 ms (0.00% GC)
  median time:      1.787 ms (0.00% GC)
  mean time:        1.796 ms (0.39% GC)
  maximum time:     2.998 ms (39.95% GC)
  --------------
  samples:          2774
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  166.25 KiB
  allocs estimate:  25
  --------------
  minimum time:     1.250 ms (0.00% GC)
  median time:      1.265 ms (0.00% GC)
  mean time:        1.281 ms (0.57% GC)
  maximum time:     2.253 ms (38.01% GC)
  --------------
  samples:          3887
  evals/sample:     1

512x512, eigfact of general square matrix

Size: 512, MKL: 
  memory estimate:  6.16 MiB
  allocs estimate:  37
  --------------
  minimum time:     288.296 ms (0.00% GC)
  median time:      292.630 ms (0.00% GC)
  mean time:        300.290 ms (0.10% GC)
  maximum time:     361.709 ms (0.00% GC)
  --------------
  samples:          17
  evals/sample:     1
Size: 512, OpenBLAS:
  memory estimate:  6.54 MiB
  allocs estimate:  31
  --------------
  minimum time:     286.010 ms (0.00% GC)
  median time:      288.243 ms (0.00% GC)
  mean time:        289.080 ms (0.10% GC)
  maximum time:     297.906 ms (0.00% GC)
  --------------
  samples:          18
  evals/sample:     1

4096x4096, eigfact of general square matrix

Size: 4096, MKL: 
  memory estimate:  385.25 MiB
  allocs estimate:  81
  --------------
  minimum time:     43.321 s (0.02% GC)
  median time:      43.321 s (0.02% GC)
  mean time:        43.321 s (0.02% GC)
  maximum time:     43.321 s (0.02% GC)
  --------------
  samples:          1
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  388.28 MiB
  allocs estimate:  69
  --------------
  minimum time:     35.017 s (0.03% GC)
  median time:      35.017 s (0.03% GC)
  mean time:        35.017 s (0.03% GC)
  maximum time:     35.017 s (0.03% GC)
  --------------
  samples:          1
  evals/sample:     1

OpenBLAS ties or wins for general eigen-decomposition at all sizes.

QR Decomposition

bench_f(A, qrfact!, N, ", OpenBLAS:")

8x8, qrfact

Size: 8, MKL:
  memory estimate:  1.28 KiB
  allocs estimate:  4
  --------------
  minimum time:     13.195 μs (0.00% GC)
  median time:      13.405 μs (0.00% GC)
  mean time:        13.669 μs (0.00% GC)
  maximum time:     60.815 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
Size: 8, OpenBLAS: 
  memory estimate:  1.28 KiB
  allocs estimate:  4
  --------------
  minimum time:     16.861 μs (0.00% GC)
  median time:      16.982 μs (0.00% GC)
  mean time:        17.279 μs (0.00% GC)
  maximum time:     56.927 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
StaticArrays
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     310.129 ns (0.00% GC)
  median time:      314.905 ns (0.00% GC)
  mean time:        315.720 ns (0.00% GC)
  maximum time:     503.892 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     241

64x64, qrfact

Size: 64, MKL: 
  memory estimate:  36.22 KiB
  allocs estimate:  6
  --------------
  minimum time:     663.038 μs (0.00% GC)
  median time:      697.192 μs (0.00% GC)
  mean time:        706.418 μs (0.48% GC)
  maximum time:     1.987 ms (62.28% GC)
  --------------
  samples:          6973
  evals/sample:     1
Size: 64, OpenBLAS: 
  memory estimate:  36.22 KiB
  allocs estimate:  6
  --------------
  minimum time:     452.482 μs (0.00% GC)
  median time:      494.570 μs (0.00% GC)
  mean time:        517.678 μs (0.48% GC)
  maximum time:     12.591 ms (0.00% GC)
  --------------
  samples:          9450
  evals/sample:     1

512x512, qrfact

Size: 512, MKL: 
  memory estimate:  288.22 KiB
  allocs estimate:  6
  --------------
  minimum time:     13.664 ms (0.00% GC)
  median time:      14.642 ms (0.00% GC)
  mean time:        14.648 ms (0.00% GC)
  maximum time:     19.407 ms (0.00% GC)
  --------------
  samples:          328
  evals/sample:     1
Size: 512, OpenBLAS:
  memory estimate:  288.22 KiB
  allocs estimate:  6
  --------------
  minimum time:     12.646 ms (0.00% GC)
  median time:      12.867 ms (0.00% GC)
  mean time:        13.025 ms (0.00% GC)
  maximum time:     18.334 ms (0.00% GC)
  --------------
  samples:          368
  evals/sample:     1

4096x4096, qrfact

Size: 4096, MKL: 
  memory estimate:  2.25 MiB
  allocs estimate:  6
  --------------
  minimum time:     2.596 s (0.00% GC)
  median time:      2.679 s (0.00% GC)
  mean time:        2.679 s (0.00% GC)
  maximum time:     2.763 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1
Size: 4096, OpenBLAS: 
  memory estimate:  2.25 MiB
  allocs estimate:  6
  --------------
  minimum time:     2.061 s (0.00% GC)
  median time:      2.093 s (0.00% GC)
  mean time:        2.101 s (0.00% GC)
  maximum time:     2.150 s (0.00% GC)
  --------------
  samples:          3
  evals/sample:     1

ChrisRackauckas · December 28, 2017, 3:44am

Nice investigative work. One thing I’d mention though

IterativeSolvers is essentially just sparse matrix multiplications so it would nice to test those. Also, it’s domain is not small to medium matrices. Iterative solvers are generally used on large sparse matrices since decomposition methods don’t scale well at all (at some point they don’t even fit in memory since the decompositions are essentially dense). Methods for extremely large matrices are the same iterative solvers just setup with a parallel implementation of the matrix multiplication, which in Julia could in theory be done by the array type so IterativeSolvers.jl would still be the thing to use there (though it might need to cut some allocations)

Elrod · December 28, 2017, 6:24am

Shoot, turns out I didn’t thoroughly test IterativeSolvers.jl and was instead thinking of the unregistered LinearAlgebra.jl. Here is the gist from a few months back I was thinking of, where the Julia implementation was faster at symmetric eigendecomposition up to 133x133 or 35x35 matrices, depending on the computer.
I’ll see if I can get around to comparing sparse matrix solvers tomorrow.

LinearAlgebra hasn’t been updated for 0.7 yet, so I’m running it on 0.6.2.

Here, Julia was faster until roughly 133x133. MKL beats it earlier, although this was the one test where MKL maintains its edge vs OpenBLAS. Definitely worth pointing out that Julia is single threaded, while LAPACK reaches nearly 1600% CPU. Perhaps I should recompare after specifying single threaded BLAS/LAPACK, but I don’t feel like doing that today.

julia> function gen_pos_def(p::Int)
  X = randn( div(3p, 2), p)
  S = X' * X
  S, Hermitian(S, :L)
end
julia> S20, H20 = gen_pos_def(20);
julia> S35, H35 = gen_pos_def(33);
julia> S40, H40 = gen_pos_def(40);
julia> S100, H100 = gen_pos_def(100);
julia> S133, H133 = gen_pos_def(133);
julia> S150, H150 = gen_pos_def(150);
julia> S300, H300 = gen_pos_def(300);
julia> S1000, H1000 = gen_pos_def(1000);
julia> S4096, H4096 = gen_pos_def(4096);

julia> Julia_eig!(A) = LinearAlgebra.EigenSelfAdjoint.eig!(A)
julia> LAPACK_eig!(A) = Base.LinAlg.LAPACK.syevr!('V', 'A', 'L', A, 1.0, 1.0, 1, 1, eps())

julia> @benchmark Julia_eig!(A) setup=(A=copy($H20)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  8.28 KiB
  allocs estimate:  22
  --------------
  minimum time:     29.215 μs (0.00% GC)
  median time:      29.997 μs (0.00% GC)
  mean time:        30.739 μs (1.93% GC)
  maximum time:     1.568 ms (94.42% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S20)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  13.72 KiB
  allocs estimate:  15
  --------------
  minimum time:     78.166 μs (0.00% GC)
  median time:      82.736 μs (0.00% GC)
  mean time:        85.267 μs (1.30% GC)
  maximum time:     3.574 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S20)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  13.34 KiB
  allocs estimate:  10
  --------------
  minimum time:     62.658 μs (0.00% GC)
  median time:      67.227 μs (0.00% GC)
  mean time:        69.026 μs (1.91% GC)
  maximum time:     2.042 ms (95.54% GC)
  --------------
  samples:          10000
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H35)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  20.09 KiB
  allocs estimate:  22
  --------------
  minimum time:     99.698 μs (0.00% GC)
  median time:      101.220 μs (0.00% GC)
  mean time:        103.004 μs (1.27% GC)
  maximum time:     1.465 ms (89.95% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S35)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  29.02 KiB
  allocs estimate:  15
  --------------
  minimum time:     192.532 μs (0.00% GC)
  median time:      204.549 μs (0.00% GC)
  mean time:        207.510 μs (1.02% GC)
  maximum time:     1.606 ms (81.95% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S35)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  28.48 KiB
  allocs estimate:  10
  --------------
  minimum time:     118.562 μs (0.00% GC)
  median time:      122.721 μs (0.00% GC)
  mean time:        125.620 μs (1.84% GC)
  maximum time:     1.639 ms (88.61% GC)
  --------------
  samples:          10000
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H40)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  28.25 KiB
  allocs estimate:  22
  --------------
  minimum time:     161.694 μs (0.00% GC)
  median time:      163.878 μs (0.00% GC)
  mean time:        166.130 μs (0.99% GC)
  maximum time:     1.442 ms (85.49% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S40)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  39.09 KiB
  allocs estimate:  15
  --------------
  minimum time:     268.634 μs (0.00% GC)
  median time:      285.606 μs (0.00% GC)
  mean time:        289.395 μs (0.94% GC)
  maximum time:     1.567 ms (76.52% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S40)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  38.44 KiB
  allocs estimate:  10
  --------------
  minimum time:     155.362 μs (0.00% GC)
  median time:      160.562 μs (0.00% GC)
  mean time:        164.453 μs (1.77% GC)
  maximum time:     1.553 ms (87.63% GC)
  --------------
  samples:          10000
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H100)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  162.03 KiB
  allocs estimate:  24
  --------------
  minimum time:     1.824 ms (0.00% GC)
  median time:      1.845 ms (0.00% GC)
  mean time:        1.860 ms (0.37% GC)
  maximum time:     2.948 ms (35.73% GC)
  --------------
  samples:          2668
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S100)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  189.31 KiB
  allocs estimate:  18
  --------------
  minimum time:     2.005 ms (0.00% GC)
  median time:      2.109 ms (0.00% GC)
  mean time:        2.120 ms (0.43% GC)
  maximum time:     3.279 ms (32.28% GC)
  --------------
  samples:          2337
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S100)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  188.39 KiB
  allocs estimate:  12
  --------------
  minimum time:     935.148 μs (0.00% GC)
  median time:      952.491 μs (0.00% GC)
  mean time:        969.590 μs (1.35% GC)
  maximum time:     2.176 ms (54.05% GC)
  --------------
  samples:          5049
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H133)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  283.88 KiB
  allocs estimate:  24
  --------------
  minimum time:     4.118 ms (0.00% GC)
  median time:      4.171 ms (0.00% GC)
  mean time:        4.210 ms (0.32% GC)
  maximum time:     5.487 ms (18.29% GC)
  --------------
  samples:          1181
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S133)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  320.17 KiB
  allocs estimate:  18
  --------------
  minimum time:     4.384 ms (0.00% GC)
  median time:      4.550 ms (0.00% GC)
  mean time:        4.577 ms (0.45% GC)
  maximum time:     6.368 ms (0.00% GC)
  --------------
  samples:          1085
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S133)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  319.03 KiB
  allocs estimate:  12
  --------------
  minimum time:     1.757 ms (0.00% GC)
  median time:      1.796 ms (0.00% GC)
  mean time:        1.823 ms (1.19% GC)
  maximum time:     3.023 ms (35.37% GC)
  --------------
  samples:          2699
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H150)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  360.13 KiB
  allocs estimate:  24
  --------------
  minimum time:     5.808 ms (0.00% GC)
  median time:      5.958 ms (0.00% GC)
  mean time:        6.023 ms (0.38% GC)
  maximum time:     7.441 ms (13.45% GC)
  --------------
  samples:          827
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S150)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  400.86 KiB
  allocs estimate:  18
  --------------
  minimum time:     3.929 ms (0.00% GC)
  median time:      4.032 ms (0.00% GC)
  mean time:        4.083 ms (0.64% GC)
  maximum time:     8.619 ms (0.00% GC)
  --------------
  samples:          1213
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S150)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  399.53 KiB
  allocs estimate:  12
  --------------
  minimum time:     2.172 ms (0.00% GC)
  median time:      2.215 ms (0.00% GC)
  mean time:        2.262 ms (1.18% GC)
  maximum time:     3.846 ms (0.00% GC)
  --------------
  samples:          2167
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H300)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  1.39 MiB
  allocs estimate:  24
  --------------
  minimum time:     42.135 ms (0.00% GC)
  median time:      43.972 ms (0.00% GC)
  mean time:        44.055 ms (0.14% GC)
  maximum time:     53.301 ms (0.00% GC)
  --------------
  samples:          114
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S300)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  1.47 MiB
  allocs estimate:  18
  --------------
  minimum time:     12.786 ms (0.00% GC)
  median time:      13.168 ms (0.00% GC)
  mean time:        13.246 ms (0.48% GC)
  maximum time:     14.703 ms (6.60% GC)
  --------------
  samples:          374
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S300)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  1.47 MiB
  allocs estimate:  12
  --------------
  minimum time:     8.234 ms (0.00% GC)
  median time:      8.379 ms (0.00% GC)
  mean time:        8.441 ms (0.75% GC)
  maximum time:     10.117 ms (10.13% GC)
  --------------
  samples:          583
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H1000)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  15.31 MiB
  allocs estimate:  24
  --------------
  minimum time:     1.521 s (0.06% GC)
  median time:      1.529 s (0.06% GC)
  mean time:        1.530 s (0.06% GC)
  maximum time:     1.539 s (0.06% GC)
  --------------
  samples:          4
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S1000)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  15.57 MiB
  allocs estimate:  20
  --------------
  minimum time:     140.302 ms (0.70% GC)
  median time:      141.813 ms (0.67% GC)
  mean time:        141.873 ms (0.67% GC)
  maximum time:     143.045 ms (0.66% GC)
  --------------
  samples:          36
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S1000)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  15.56 MiB
  allocs estimate:  12
  --------------
  minimum time:     127.924 ms (0.80% GC)
  median time:      129.967 ms (0.76% GC)
  mean time:        129.972 ms (0.81% GC)
  maximum time:     133.471 ms (0.90% GC)
  --------------
  samples:          39
  evals/sample:     1"""

julia> @benchmark Julia_eig!(A) setup=(A=copy($H4096)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  256.19 MiB
  allocs estimate:  30
  --------------
  minimum time:     149.840 s (0.03% GC)
  median time:      149.840 s (0.03% GC)
  mean time:        149.840 s (0.03% GC)
  maximum time:     149.840 s (0.03% GC)
  --------------
  samples:          1
  evals/sample:     1

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S4096)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  257.28 MiB
  allocs estimate:  23
  --------------
  minimum time:     8.732 s (0.59% GC)
  median time:      8.732 s (0.59% GC)
  mean time:        8.732 s (0.59% GC)
  maximum time:     8.732 s (0.59% GC)
  --------------
  samples:          1
  evals/sample:     1

julia> """@benchmark MKL_LAPACK_eig!(A) setup=(A=copy($S4096)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  257.47 MiB
  allocs estimate:  15
  --------------
  minimum time:     6.602 s (0.02% GC)
  median time:      6.602 s (0.02% GC)
  mean time:        6.602 s (0.02% GC)
  maximum time:     6.602 s (0.02% GC)
  --------------
  samples:          1
  evals/sample:     1"""

EDIT:
Apparently the LAPACK libraries do not make good use of multithreading (I’ve struggled to write code that gets faster even when how seems obvious!), and even single threaded they trounce Julia at large sizes. However, single threaded, OpenBLAS is more than 50% faster than MKL at eigendecomposition of 4096x4096 positive definite matrices. Using all the cores speeds OpenBLAS up >2x, and MKL >5x, giving MKL a roughly 33% edge.
With 32 threads, MKL took 6.585 s and OpenBLAS 7.672 s, so OpenBLAS’s syevr! does benefit from hyperthreading (on this platform, with this size).

OpenBLAS:

julia> BLAS.set_num_threads(1)
julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S4096)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  257.28 MiB
  allocs estimate:  23
  --------------
  minimum time:     19.594 s (0.26% GC)
  median time:      19.594 s (0.26% GC)
  mean time:        19.594 s (0.26% GC)
  maximum time:     19.594 s (0.26% GC)
  --------------
  samples:          1
  evals/sample:     1

MKL:

julia> BLAS.set_num_threads(1)

julia> @benchmark LAPACK_eig!(A) setup=(A=copy($S4096)) evals=1
BenchmarkTools.Trial: 
  memory estimate:  257.47 MiB
  allocs estimate:  15
  --------------
  minimum time:     34.245 s (0.00% GC)
  median time:      34.245 s (0.00% GC)
  mean time:        34.245 s (0.00% GC)
  maximum time:     34.245 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

Tamas_Papp · December 28, 2017, 7:09am

Thanks for doing this. It would be great if you could make a few graphs to summarize the results. AFAICT Discourse allows uploading SVG files.

RoyiAvital · February 11, 2018, 12:57am

Very interesting.

It is known that Intel MKL and IPP discriminate non Intel CPU’s.
The interesting question, in my opinion, is whether AMD Ryzen + OpenBLAS can compete against Intel + Intel MKL for similar budget.

On many open source based code (Namely code which doesn’t discriminate, at least not as a policy) it seems AMD can compete pretty good with Intel and usually in Multi Threaded scenarios even beat it.

If someone could arrange some set of benchmarks for AMD vs. Intel on Julia it will be interesting and hopefully will make AMD an alternative for Scientific Programming Workstation.

Elrod · February 11, 2018, 2:16am

My knowledge is pretty slanted toward Linux / things benchmarked by Phoronix.

You can look at these benchmarks:

The Ryzen 7 1700 and the i7 7700K or i7 7740 are in the same ball park.
The Ryzen 7 1800X and the i7 8700K are in about the same price range, as are
the Ryzen 1950X and the i9 7900X.
[I think the AMD chips above are a little cheaper than the Intel I’m comparing them with.]

8 vs 6 cores, and 16 vs 10.
Compiling with -j16 or even -j32 is wonderful, and if you’re a big fan of Threads.@threads (or things like OpenMP) it is great.
Most of the benchmarks and random things I do are single threaded, but more serious things where time matters more than just a game are more likely to be parallel or threaded.

Would be interesting to see how AMD with OpenBLAS does vs Intel with MKL, but there are a lot of confounding variables, like OS. As can be seen here, where it is just the ThreadRipper vs the i9 7900X on three different Linux distributions:

As in all his benchmarks, Clear Linux dominates more often than not. This was even the case for a set of tests when Clear Linux was one of the only ones to have already included Specter/Meltdown mitigation.
And this is true on both AMD and Intel hardware.
Which is cool because Clear Linux, like a lot of the fastest software, is made and maintained by Intel.
I guess being open source means it doesn’t make sense to add goofy things – like using cpuid instead of instructions a processor is capable of – to make it only fast on Intel.

I haven’t tried Clear Linux myself, because it seems aimed at headless / cloud type work, instead of home desktop.

As a word of warning though, Stan (Markov Chain Monte Carlo software) warns that if you build it with Intel’s compilers, you need to add a couple extra compiler flags; otherwise the accuracy icc sacrifices in the name of speed will actually stop a few of their algorithms from working.

I should get around to more benchmarks and graphics. Unfortunately, I just wiped that hard drive today (and associated MKL/Julia install) because I screwed things (graphics drivers) up to the point where it wouldn’t boot. @_@

EDIT:
Trying to rebuild Julia, and Julia really wants me to use some ancient commit of OpenBLAS that predates and does not support Zen.
Trying to find out how to override that, while still letting Julia take care of the BLAS build instead of doing it manually like in the opening post. =/

EDIT:
I tried editing

./deps/openblas.version
./deps/blas.mk

but gave up, and just built OpenBLAS via git clone, and then make + copy and pasting most of the flags that blas.mk (not necessary – like in my opening post – but they all sounded like good ideas). Zero issues.
Then, reusing the same Make.user as earlier (but now with the addition of USE_BLAS64=1 and statically linking llvm because of Mesa OpenGL issues Julia again builds without a problem.

Building BLAS and Julia separately == everything just works.
Trying to let Julia build BLAS for me == frustration and massive headaches.
The ability to just download a precompiled binary == reason why I shouldn’t complain lol.

RoyiAvital · February 11, 2018, 6:12am

Hi,

I know the “Big Picture” regarding AMD Ryzen vs Intel Core.
I was specifically looking for comparison for Scientific Computing / Programming.
Being even more specific, basic operations when AMD Ryzen is using OpenBLAS and Intel Core is using Intel MKL.

If you can show something like that it would be great.

I would even say it is a shame big sites like Anandtech doesn’t have Scientific Programming (Python, MATLAB, Julia) as part of their tests system. Mostly since some of it is free and pretty easy to do.

Elrod · February 11, 2018, 7:05am

There are a lot if scientific computing benchmarks here:

[On Newegg] the Intel process costs $1980 and the AMD $900, under half as much.

It has Numpy, Scikit-Learn, Matrix multiplication, GMP, FFTW, glibc (eg, tanh, sqrt), and lots of other things doing some sort of numerical computing like “MrBayes” primate phylogeny.

But no OpenBLAS with AMD vs MKL with Intel. I agree, it would be cool.
Of course, with most applications (eg, Julia) the majority of software will be the same between them.

RoyiAvital · February 11, 2018, 8:15am

Well, for Numpy I’d guess it either uses Intel MKL and then it is obvious why the Intel is much better or it uses OpenBLAS and then it means we have a problem with Ryzen :-).

I would like someone who write an ensemble of micro benchmarks using Python / Julia / MATLAB comparing those systems.

Probably there are some benchmarks out there ready.
We just need access to those systems.

LukePoga · June 2, 2018, 2:33am

Ryzen is terrible because the bandwidth between cores is so low. look at SGEMM threadripper! terrible! any impossible to buy.

Elrod · July 7, 2018, 3:06am

The interesting question, in my opinion, is whether AMD Ryzen + OpenBLAS can compete against Intel + Intel MKL for similar budget.

Just tested on a 7900x, which costs about as much as the Ryzen Threadripper 1950x and was released at around the same time.
The major differences are 10 cores / 20 threads vs 16 / 32, and two 512 bit fma units vs two 128 bits units per core.

Using MKL:

julia> A = randn(4096,4096);

julia> C = randn(4096,4096);

julia> B = randn(4096,4096);

julia> using LinearAlgebra,  BenchmarkTools

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     174.715 ms (0.00% GC)
  median time:      176.507 ms (0.00% GC)
  mean time:        176.510 ms (0.00% GC)
  maximum time:     178.155 ms (0.00% GC)
  --------------
  samples:          29
  evals/sample:     1

Over twice as fast as the Threadripper! At the high $ end, Intel and its avx advantage win hands down for matrix multiplication.
And this was with 4096 x 4096 matrices. It was over three times faster with 64x64.

In trying to optimize 8x8 by 8x8 matrix multiplication on the two processors, my best times were about 48 ns vs 11.6 ns for Ryzen vs avx-512.

Maybe I should build Julia with OpenBLAS to see how it compares. I’d expect it to be faster at all sizes too, but it could be that at the small end Ryzen + MKL might be faster than Intel + OpenBLAS.

In my original post, I should have paid more attention to how many threads each were using. I realized OpenBLAS gets faster with BLAS.set_num_threads(1) for some sizes; I think they turn on multithreading too soon.

EDIT:
Does Julia + OpenBLAS launch with less threads now? I thought I had quite the regression, but BLAS.set_num_threads(16) fixed it. Threadripper:

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     369.510 ms (0.00% GC)
  median time:      370.037 ms (0.00% GC)
  mean time:        389.715 ms (0.00% GC)
  maximum time:     622.606 ms (0.00% GC)
  --------------
  samples:          13
  evals/sample:     1

Yifan_Liu · August 18, 2019, 6:50pm

Just curious. Given the same budget, which combination would be more performent?

MKL + Intel vs. OpenBlas + AMD

Elrod · August 18, 2019, 7:36pm

That might change with the new 7nm Ryzen, but

julia> using LinearAlgebra

julia> BLAS.set_num_threads(Sys.CPU_THREADS >> 1)

julia> LinearAlgebra.peakflops(16000)
1.9732209060275942e12

julia> LinearAlgebra.peakflops(16000)
1.9726505538599666e12

julia> versioninfo()
Julia Version 1.4.0-DEV.0
Commit 2ef0ed159d* (2019-08-17 17:21 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.0 (ORCJIT, skylake)

is going to be well ahead of a Ryzen 2990WX. Similarly, the 7900X (10 core, with avx512) was much faster for BLAS/LAPACK than the Ryzen 1950X (16 core, half-rate avx2).

The new 7 nm Ryzen half-full rate avx 2. I would bet on them over the Intel chips without avx512.
As for the HEDT parts with avx-512, I’ll wait to see the new Threadrippers. If AMD offers twice the cores per $, it’ll be hard to beat them there (and then AMD would be much faster for all non-avx workloads).

For the top of the line server parts, I believe the 64 core EPYC costs less money than the 28 core Xeon. So while they might be close for extremely vectorizable code that can leverage avx512 like matrix multiplication…

cscherrer · August 18, 2019, 9:18pm

Yep that’s pretty impressive. Here’s my 2950X:

julia> LinearAlgebra.peakflops(16000)
2.2912593368633377e11

julia> versioninfo()
Julia Version 1.3.0-rc1.0
Commit 768b25f6a8 (2019-08-18 00:04 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 2950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 32

Oddly enough, parallel=true actually makes this slower, and setting BLAS.set_num_threads(32) only gives about a 50% speedup

Elrod · August 18, 2019, 9:30pm

By default, I thought OpenBLAS was only build with 16 threads.

I also think you need to addprocs or start Julia with -j$(nproc) to use parallel, since it disables multithreading. That is, with parallel=true, each child process will multiply matrices. Still, I also saw a roughly 2x hit on performance from that.

Was 2.2912593368633377e11 with or without the 50% increase?

The 7980XE having 8.6 times the FLOPs at around twice the (new) price is a hefty advantage.
They’d both be much cheaper now if anyone is willing to buy them used (I did off ebay).

EDIT: I am definitely tempted to try and increase overclocks to breach 2 teraflops…

pistacliffcho · August 18, 2019, 9:34pm

@cscherrer

I’m still trying to learn more in this area, but I think you’ll get better performance if you set the number of threads available to Julia to 16, rather than 32? I think you only have 16 physical threads available, with 32 logical threads. If all your threads are being heavily used until completion of the task, it’s my understanding that it’s better to set to the physical thread count, rather than the logical thread count.

Still trying to fully wrap my head around all that though, so please correct if if I’m not thinking of things correctly.

Elrod · August 18, 2019, 9:41pm

That has been my experience with BLAS/LAPACK (and MCMC simutions making single threaded calls) as well: performance is best when using the number of physical threads.
For reference, I used BLAS.set_num_threads(18) (I just edited my earlier post to reflect that).

pistacliffcho · August 18, 2019, 9:49pm

Yes, my (potentially flawed) understanding is that with hyper-threading, you can have one thread “pretend” to be two threads, so that it swaps back and forth between the two tasks. If both the tasks have latency involved for some reason, then it’s like you have twice as many threads; while it’s waiting on task 1, it hops back and does work on task 2. However, if neither task has any latency, it actually slows things down by hopping back and forth. Since there’s not much latency in most computationally intense mathematical algorithms, you’re better off not using the hyper-threading capabilities during the algorithm.

Mostly putting this out there so someone can tell me if I’m confused on the topic

Elrod · August 18, 2019, 10:05pm

My understanding is that it is supposed to help CPUs take advantage of out of order execution.
According to this blog post, recent Intel CPUs can issue 4 instructions total per cycle, and AMD CPUs can issue 5.

But there are a lot of other limitations. The 7980XE is able to execute only upto two 512-bit fused multiply-add (fma) instructions per clock cycle, while with the 2950X only 1 of the 5 instructions can be a 256-bit fma.

So if you’re doing math-heavy code, it is really easy for a single thread – especially if you’re doing matrix multiplication – to max out on the floating point math.

So, no performance benefit from two logical threads per physical.

Those two logical threads are going to share some resources, like the L1 cache. Meaning they’ll likely compete for cache space, and inadvertently end up spending more time idle, waiting for memory.

Also, I should point out that I did overclock the 7980XE already. It is currently set to a 3.7 GHz base frequency for avx512. According to Silicon Lottery’s historical binning statistics, 89% of 7980XEs could achieve that. 72% can do 3.8 GHz, and the top 35% could achieve 3.9 GHz.
I started at the bottom, so I could see if my CPU is a part of the 72%. Failing that, I’m sure I could play with something else, like memory or cache speeds to get that extra 1.5% to reach 2 TFLOPS of double precision.
3.7 GHz should mean that the theoretical peak is:
18 cores * 3.7*10^9 cycles/core/second * 2 fma / cycle * 16 floating point ops / 512-bit DP fma

julia> 18 * 3.7 * 10^9 * 2 * 16
2.1312000000000002e12

MKL is getting impressively close.

For a more fair comparison, the 2950X ought to be overclocked similarly. They don’t have statistics on the 2950X, but the top 83% of 1950X could hit 3.85 GHz

Topic		Replies	Views
Current OpenBLAS Versions (January 2022) do not support Intel gen 11 performantly? Performance linearalgebra	50	4803	April 7, 2022
LU factorization performance issue New to Julia linearalgebra	30	801	June 6, 2022
Show off Julia performance on your PC! Performance	53	4503	April 26, 2020
Any benchmark of Julia v1.0 vs older versions Performance	66	8336	April 3, 2019
Benchmark MATLAB & Julia for Matrix Operations Performance	148	20211	October 15, 2019

OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen)

Related topics