To add another datapoint, here are the results on a 32-core node on our cluster, with and without threading and comparing OpenBLAS and MKL:
https://github.com/barche/julia-blas-benchmarks/blob/master/BenchmarkResults.ipynb
I also reran the HPL linpack test, here are the results:
- Standard HPL OpenBLAS, 32 MPI processes on a single node: 757 Gflops
- Standard HPL MKL, 32 MPI processes on a single node: 788 Gflops
- Intel HPL MKL, 32 MPI processes on a single node: 814 Gflops
- Intel HPL MKL, 2 MPI processes with 16 threads each on a single node: 963 Gflops
From both tests it seems clear to me that MKL wins when threading enters into the equation, but single-core performance is much closer, with the possible exception of the Cholesky and Eigen decompositions.