Questions about MKL vs OpenBLAS come up a lot, for example in comparisons with Matlab (linked to MKL), and a lot of users have issues building with MKL, eg here. Of course, one can easily download an MKL binary with JuliaPro, but then you may have to face down an army of dependency conflicts.
My point here is to compare MKL and OpenBLAS with an AMD processor (Ryzen Threadripper 1950x).
Lots of performance comparisons are already out there, but I figured I would add one without an intel chip.
Brief summary of results: MKL is generally faster at small sizes, OpenBLAS at large sizes.
All in all, Iād call OpenBLAS the clear winner (given this hardware) because
- overall run time is easily dominated by a few operations on large matrices.
- if one is concerned about performance of operations on small matrices it is easy to turn to native Julia. For example,
StaticArrays.jl
performs well on very small matrices, andIterativeSolvers.jl
is extremely effective for small-medium sized matrices. This leaves large matrices as the last domain where Julia really benefits from outside assistance.
Results are probably different with an Intel processor, eg probably this.
Maybe I shouldāve made pretty graphs, but theyād be harder to share on Discourse.
Operating System and Julia Install
I just wiped my hard drive and installed a fresh (x)ubuntu 16.04 last night, for the sake of upgrading ROCm. Ubuntu 16.04 comes with gcc 5.4, which ROCm
requires but doesnāt support -march=zenv1
so I also built gcc version 7.2 (adding -7.2 as a suffix) to get the latest gcc
, g++
, and gfortran
. (Phoronixās benchmarks suggest gcc 7.2 often produces faster code, but not on the level of Julia 0.7 vs 0.6 , and that Zen optimizations donāt help yet. As an aside, if thatās what it looks like when Juliaās devs are trying to break things instead of optimizeā¦ )
My ubuntu install did not come with gfortran
, so this provided that dependency.
I just used sudo apt-get install
for the remaining dependencies. All I recall installing was build-essential
(not sure if this included anything relevant), m4
, and pkg-config
.
Build with OpenBLAS
You can safely skip most of this. I built the development version of OpenBLAS from source, cloned Julia, and created the following Make.user
:
USE_SYSTEM_BLAS=1
USE_SYSTEM_LAPACK=1
LIBBLAS=-lopenblas
LIBBLASNAME=libopenblas
LIBLAPACK=-lopenblas
LIBLAPACKNAME=libopenblas
Then it was simply: make CC=gcc-7.2 CXX=g++-7.2 FC=gfortran-7.2 -j32
.
If youāre going to do multiple installs with OpenBLAS, not rebuilding it may be worth it. Otherwise, you can skip the Make.user
and everything will happen automatically. The end result:
julia> versioninfo()
Julia Version 0.6.2
Commit d386e40 (2017-12-13 18:08 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
WORD_SIZE: 64
BLAS: libopenblas (ZEN)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, generic)
and
julia> versioninfo()
Julia Version 0.7.0-DEV.3199
Commit 1238bad (2017-12-27 20:40 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
WORD_SIZE: 64
BLAS: libopenblas (ZEN)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, znver1)
Environment:
Building with MKL
I installed the 2018 update 1 versions of MKL (gratis) and Parallel Studio Xe (gratis for students, and I believe also open source contributors and classroom educators) in /opt/intel
. I remember needing parallel studio to satisfy some dependency (I believe it was libimf
) months ago, so I just went ahead and downloaded it today seeing if things could work without it.
First, I followed Juliaās github readme, running
source /opt/intel/bin/compilervars.sh intel64
and creating
USEICC = 1
USEIFC = 1
USE_INTEL_MKL = 1
USE_INTEL_LIBM = 1
but eventually ran into a linking issue like the one described here. Mike Kinghan provided a great answer there, but Iād rather move on and stick with gcc
. Therefore the new, Make.user
:
USE_INTEL_MKL = 1
USE_INTEL_LIBM = 1
and simply running make CC=gcc-7.2 CXX=g++-7.2 FC=gfortran-7.2 -j32
results in:
julia> versioninfo()
Julia Version 0.7.0-DEV.3199
Commit 1238bad (2017-12-27 20:40 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
WORD_SIZE: 64
BLAS: libmkl_rt
LAPACK: libmkl_rt
LIBM: libimf
LLVM: libLLVM-3.9.1 (ORCJIT, znver1)
Environment:
Matrix Sizes
Sticking to NxN
square matrices, and trying
N = (8,64,512,4096)
to get an idea of how they perform over a broad range.
GEMM
Here is what I ran:
create_g(T, N) = randn(T,N,N)
a = create_g.(Float32, N);
A = create_g.(Float64, N);
b = create_g.(Float32, N);
B = create_g.(Float64, N);
c = similar.(a);
C = similar.(A);
using BenchmarkTools
function bench_f!(C, A, B, f, N, info = "")
for i in 1:length(N)
println("Size: $(N[i])" * info)
@show @benchmark $f($(C[i]), $(A[i]), $(B[i]))
end
end
bench_f!(C, A, B, A_mul_B!, N, ", Julia-Dev:")
Below are the results using the same commit of Julia-master (v0.7.0, Commit 1238bad (2017-12-27 20:40 UTC) ), with the only difference being MKL vs OpenBLAS. Whether I ran MKL or OpenBLAS first varied, but if the benchmark caused my CPU fans to turn up, I would wait until they died down before running the next set of benchmarks.
Julia v0.6.2 with OpenBLAS unsurprisingly performed similarly to v0.7.0-dev + OpenBLAS, but did not print nicely with the above function so I excluded it for space.
The outputs:
8x8, Float64
Size: 8, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 336.964 ns (0.00% GC)
median time: 344.312 ns (0.00% GC)
mean time: 349.959 ns (0.00% GC)
maximum time: 627.195 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 221
Size: 8, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 375.439 ns (0.00% GC)
median time: 379.302 ns (0.00% GC)
mean time: 381.454 ns (0.00% GC)
maximum time: 695.751 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 205
StaticArrays:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 136.728 ns (0.00% GC)
median time: 136.855 ns (0.00% GC)
mean time: 137.184 ns (0.00% GC)
maximum time: 189.052 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 867
64x64, Float64
Size: 64, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 11.682 Ī¼s (0.00% GC)
median time: 13.185 Ī¼s (0.00% GC)
mean time: 13.672 Ī¼s (0.00% GC)
maximum time: 362.510 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
Size: 64, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 22.823 Ī¼s (0.00% GC)
median time: 23.584 Ī¼s (0.00% GC)
mean time: 23.769 Ī¼s (0.00% GC)
maximum time: 108.314 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
512x512, Float64
Size: 512, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.907 ms (0.00% GC)
median time: 1.984 ms (0.00% GC)
mean time: 2.051 ms (0.00% GC)
maximum time: 3.536 ms (0.00% GC)
--------------
samples: 2436
evals/sample: 1
Size: 512, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.167 ms (0.00% GC)
median time: 1.199 ms (0.00% GC)
mean time: 1.267 ms (0.00% GC)
maximum time: 2.381 ms (0.00% GC)
--------------
samples: 3940
evals/sample: 1
4096x4096, Float64
Size: 4096, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 884.305 ms (0.00% GC)
median time: 887.235 ms (0.00% GC)
mean time: 888.018 ms (0.00% GC)
maximum time: 896.666 ms (0.00% GC)
--------------
samples: 6
evals/sample: 1
Size: 4096, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 384.890 ms (0.00% GC)
median time: 390.259 ms (0.00% GC)
mean time: 398.521 ms (0.00% GC)
maximum time: 448.346 ms (0.00% GC)
--------------
samples: 13
evals/sample: 1
MKL was ahead at 64x64, but starting at 512x512 OpenBLAS started pulling far ahead.
Note that if youāre just multiplying 64x64 and 4096x4096 matrices, youād have to multiply around 50,000 times more of the former for MKL to be faster.
Single precision results were comparable.
8x8, Float32
Size: 8, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 238.978 ns (0.00% GC)
median time: 246.381 ns (0.00% GC)
mean time: 247.609 ns (0.00% GC)
maximum time: 480.341 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 417
Size: 8, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 331.460 ns (0.00% GC)
median time: 333.518 ns (0.00% GC)
mean time: 334.891 ns (0.00% GC)
maximum time: 433.795 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 224
StaticArrays
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 116.201 ns (0.00% GC)
median time: 116.912 ns (0.00% GC)
mean time: 117.159 ns (0.00% GC)
maximum time: 159.055 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 916
64x64, Float32
Size: 64, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 7.674 Ī¼s (0.00% GC)
median time: 9.317 Ī¼s (0.00% GC)
mean time: 9.510 Ī¼s (0.00% GC)
maximum time: 241.198 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
Size: 64, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 11.571 Ī¼s (0.00% GC)
median time: 11.722 Ī¼s (0.00% GC)
mean time: 11.839 Ī¼s (0.00% GC)
maximum time: 35.355 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
512x512, Float32
Size: 512, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 681.205 Ī¼s (0.00% GC)
median time: 752.947 Ī¼s (0.00% GC)
mean time: 895.173 Ī¼s (0.00% GC)
maximum time: 3.101 ms (0.00% GC)
--------------
samples: 5577
evals/sample: 1
Size: 512, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 604.630 Ī¼s (0.00% GC)
median time: 619.408 Ī¼s (0.00% GC)
mean time: 645.878 Ī¼s (0.00% GC)
maximum time: 1.211 ms (0.00% GC)
--------------
samples: 7729
evals/sample: 1
4096x4096, Float32
Size: 4096, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 336.803 ms (0.00% GC)
median time: 348.420 ms (0.00% GC)
mean time: 355.731 ms (0.00% GC)
maximum time: 417.870 ms (0.00% GC)
--------------
samples: 15
evals/sample: 1
Size: 4096, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 183.598 ms (0.00% GC)
median time: 188.000 ms (0.00% GC)
mean time: 192.251 ms (0.00% GC)
maximum time: 227.223 ms (0.00% GC)
--------------
samples: 27
evals/sample: 1
Triangular x General Matrix Multiplication
I figured rather than calling gemm!
, trmm!
, etc directly, Iād stick with the higher level A_mul_B!
interface.
tA = UpperTriangular.(A);
bench_f!(C, tA, B, A_mul_B!, N, ", OpenBLAS:")
8x8, Triangle
Size: 8, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 405.515 ns (0.00% GC)
median time: 410.370 ns (0.00% GC)
mean time: 411.963 ns (0.00% GC)
maximum time: 753.265 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 200
Size: 8, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.330 Ī¼s (0.00% GC)
median time: 5.140 Ī¼s (0.00% GC)
mean time: 5.269 Ī¼s (0.00% GC)
maximum time: 32.291 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 7
StaticArrays
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 64.979 ns (0.00% GC)
median time: 66.874 ns (0.00% GC)
mean time: 67.034 ns (0.00% GC)
maximum time: 113.781 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 978
64x64, Triangle
Size: 64, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 12.454 Ī¼s (0.00% GC)
median time: 13.696 Ī¼s (0.00% GC)
mean time: 13.870 Ī¼s (0.00% GC)
maximum time: 238.548 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
Size: 64, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 15.198 Ī¼s (0.00% GC)
median time: 23.905 Ī¼s (0.00% GC)
mean time: 24.484 Ī¼s (0.00% GC)
maximum time: 216.596 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
512x512, Triangle
Size: 512, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.367 ms (0.00% GC)
median time: 1.394 ms (0.00% GC)
mean time: 1.422 ms (0.00% GC)
maximum time: 2.153 ms (0.00% GC)
--------------
samples: 3513
evals/sample: 1
Size: 512, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 819.315 Ī¼s (0.00% GC)
median time: 1.274 ms (0.00% GC)
mean time: 1.179 ms (0.00% GC)
maximum time: 1.715 ms (0.00% GC)
--------------
samples: 4233
evals/sample: 1
4096x4096, Triangle
Size: 4096, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 538.547 ms (0.00% GC)
median time: 574.797 ms (0.00% GC)
mean time: 569.898 ms (0.00% GC)
maximum time: 594.784 ms (0.00% GC)
--------------
samples: 9
evals/sample: 1
Size: 4096, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 211.023 ms (0.00% GC)
median time: 212.099 ms (0.00% GC)
mean time: 216.946 ms (0.00% GC)
maximum time: 249.050 ms (0.00% GC)
--------------
samples: 24
evals/sample: 1
Again, MKL
dominates at small sizes, but loses out as the matrices grow.
Symmetric x General Matrix Multiplication
sA = Symmetric.(A);
bench_f!(C, sA, B, A_mul_B!, N, ", OpenBLAS:")
8x8, Symmetric
Size: 8, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 396.866 ns (0.00% GC)
median time: 401.801 ns (0.00% GC)
mean time: 403.922 ns (0.00% GC)
maximum time: 626.055 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 201
Size: 8, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.547 Ī¼s (0.00% GC)
median time: 5.046 Ī¼s (0.00% GC)
mean time: 5.104 Ī¼s (0.00% GC)
maximum time: 12.286 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 9
StaticArrays
memory estimate: 848 bytes
allocs estimate: 5
--------------
minimum time: 603.801 ns (0.00% GC)
median time: 619.858 ns (0.00% GC)
mean time: 694.029 ns (9.39% GC)
maximum time: 190.751 Ī¼s (99.59% GC)
--------------
samples: 10000
evals/sample: 176
64x64, Symmetric
Size: 64, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 10.740 Ī¼s (0.00% GC)
median time: 12.153 Ī¼s (0.00% GC)
mean time: 12.441 Ī¼s (0.00% GC)
maximum time: 240.722 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
Size: 64, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 29.916 Ī¼s (0.00% GC)
median time: 34.174 Ī¼s (0.00% GC)
mean time: 34.917 Ī¼s (0.00% GC)
maximum time: 238.238 Ī¼s (0.00% GC)
--------------
samples: 10000
evals/sample: 1
512x512, Symmetric
Size: 512, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.921 ms (0.00% GC)
median time: 1.998 ms (0.00% GC)
mean time: 2.148 ms (0.00% GC)
maximum time: 4.300 ms (0.00% GC)
--------------
samples: 2324
evals/sample: 1
Size: 512, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.263 ms (0.00% GC)
median time: 1.306 ms (0.00% GC)
mean time: 1.393 ms (0.00% GC)
maximum time: 2.338 ms (0.00% GC)
--------------
samples: 3584
evals/sample: 1
4096x4096, Symmetric
Size: 4096, MKL:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 920.034 ms (0.00% GC)
median time: 952.772 ms (0.00% GC)
mean time: 959.352 ms (0.00% GC)
maximum time: 1.024 s (0.00% GC)
--------------
samples: 6
evals/sample: 1
Size: 4096, OpenBLAS:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 380.653 ms (0.00% GC)
median time: 382.731 ms (0.00% GC)
mean time: 395.965 ms (0.00% GC)
maximum time: 444.567 ms (0.00% GC)
--------------
samples: 13
evals/sample: 1
I split matrix factorization results into a separate comment because of character limits.