Current OpenBLAS Versions (January 2022) do not support Intel gen 11 performantly?

photor · January 26, 2022, 1:33am

If restricted to the intel family, maybe I should buy a 12980XE for Float64 matrix mutiplication?

Oscar_Smith · January 26, 2022, 1:34am

Why are you restricted to Intel?

Elrod · January 26, 2022, 1:34am

10980XE is faster than 12th gen for matrix multiply.
It is also faster than a 32 core Zen3 Epyc for matrix multiply.

Oscar_Smith · January 26, 2022, 1:35am

wait, really? Is AVX-512 that useful? I thought you were saturating memory bandwidth anyway.

Elrod · January 26, 2022, 1:37am

Perhaps I’m not quite right.

julia> using LinearAlgebra; BLAS.set_num_threads(32);

julia> A=randn(10000,10000);

julia> @elapsed A*A
2.670395171

julia> @elapsed A*A
1.909998951

julia> using MKL

julia> @elapsed A*A
1.541350805

julia> @elapsed A*A
1.186835123

julia> versioninfo()
Julia Version 1.8.0-DEV.1184
Commit 722f9d4958 (2021-12-28 14:28 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: AMD EPYC 7513 32-Core Processor

They’re comparable.

Oscar_Smith · January 26, 2022, 1:38am

What if you throw Ocatavian at it?

Elrod · January 26, 2022, 1:47am

Zen3, 32 core Epyc:

julia> using LinearAlgebra; BLAS.set_num_threads(Sys.CPU_THREADS ÷ 2);

julia> A = rand(10_000,10_000); B = similar(A);

julia> @time mul!(B, A, A);
  2.622767 seconds (2.42 M allocations: 123.016 MiB, 18.41% compilation time)

julia> @time mul!(B, A, A);
  1.939154 seconds

julia> using MKL

julia> @time mul!(B, A, A);
  1.430667 seconds

julia> @time mul!(B, A, A);
  1.258703 seconds

julia> using Octavian

julia> @time matmul!(B, A, A);
 16.063869 seconds (28.60 M allocations: 1.518 GiB, 1.96% gc time, 90.36% compilation time)

julia> @time matmul!(B, A, A);
  1.301244 seconds

Cascadelake-X, 18 core AVX512:

julia> using LinearAlgebra; BLAS.set_num_threads(Sys.CPU_THREADS ÷ 2);

julia> A = rand(10_000,10_000); B = similar(A);

julia> @time mul!(B, A, A);
  1.855408 seconds (2.51 M allocations: 124.746 MiB, 33.53% compilation time)

julia> @time mul!(B, A, A);
  1.130044 seconds

julia> using MKL

julia> @time mul!(B, A, A);
  1.129982 seconds

julia> @time mul!(B, A, A);
  1.129533 seconds

julia> using Octavian

julia> @time matmul!(B, A, A);
 26.057493 seconds (38.22 M allocations: 1.960 GiB, 2.76% gc time, 94.77% compilation time)

julia> @time matmul!(B, A, A);
  1.273121 seconds

When I benchmarked earlier, performance was erratic across sizes on the Epyc, but this could be because other people were using it too, so you probably shouldn’t put much meaning on that, other than it’d be harder to get good plots of perf across size if the server is being used by other people.

It’s 2x 256 bit FMA / cycle / core * 32 cores vs 2x 512 bit FMA / cycle / core * 18 cores, so that the 10980XE does comparably well is to be expected.

But note, of course, that this is a very specific workload!
The 11900K is supposed to have 10%+ more IPC than the 10980XE, but for matrix multiply, it has half the IPC because while it overall has more execution units, they’re specifically lacking in 512 bit FMA ability.

Same goes for the alder lake CPUs.

Before spending a bunch of money on a CPU and motherboard, I’d consider all the kinds of workloads you care about, and how well your options do.
The 10980XE came out in 2019, and was already based on an old architecture then – it’s basically the same as the 7980XE, which was released in 2017.

Intel will hopefully come out with a Saphire Rapids version this year, which would be the server version of Alder lake, and should also have 2x 512 bit FMA units, along with a 2MiB L2 cache, and (at the top of the stack) more cores than the 10980XE.
Some rumors also suggest AMD might support AVX512 in Zen4 (in which case, is it like in Ice Lake Client/Rocket Lake/Tiger Lake/Alder Lake before it got disabled, and half-rate FMA, of full?).

So, while I do still think the 10980XE is a great chip for SIMD numerical workloads (and especially matmul) [and you should be able to find them for <$800, but that + a mother board and DDR4 that won’t support future chips is a lot of money], I’d suggest probably waiting a little longer since much newer chip architectures are on the horizon.

photor · January 26, 2022, 2:57am

I thought MKL was restricted to Intel cpus, but from Elrod’s benchmarks it seems I was wrong

fgerick · January 27, 2022, 7:49am

Is there something that needs to be taken into account or configured for AMD Milan processors? I just ran the same code on a 2x24 core AMD EPYC 7443 machine and got this result:

julia> Threads.nthreads()
96

julia> using LinearAlgebra; BLAS.set_num_threads(Sys.CPU_THREADS ÷ 2);

julia> A = rand(10_000,10_000); B = similar(A);

julia> @time mul!(B, A, A);
  3.391364 seconds (2.49 M allocations: 124.344 MiB, 12.61% compilation time)

julia> @time mul!(B, A, A);
  3.178868 seconds

julia> using MKL

julia> @time mul!(B, A, A);
  2.854096 seconds

julia> @time mul!(B, A, A);
  2.724407 seconds

julia> using Octavian

julia> @time matmul!(B, A, A);
 14.762420 seconds (28.56 M allocations: 1.495 GiB, 1.96% gc time, 80.99% compilation time)

julia> @time matmul!(B, A, A);
  3.115852 seconds

julia> versioninfo()
Julia Version 1.8.0-DEV.1405
Commit 2010d95d8a (2022-01-26 17:41 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: AMD EPYC 7443 24-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.0 (ORCJIT, znver3)

Shouldn’t the 7513 and 7443 be quite comparable, despite the 8 more cores?

fgerick · January 27, 2022, 10:27am

I think I answered my question, which is related to the discussion here: Thread affinitization: pinning Julia threads to cores

If I use only 48 threads from one of the two cpus by ThreadPinning.jl, the time of Octavian.jl reduces to 1.5-1.6s.

LaurentPlagne · January 27, 2022, 6:19pm

Just for fun the apple M1 max laptop takes 3.5 s for the 10000x10000 dgemm (with one thread (coprocessor ?)) and 1s for sgemm

photor · January 28, 2022, 1:47am

Post the details?

LaurentPlagne · January 28, 2022, 2:41pm

Hi,
sorry for the previous post and this reply that is unrelated to the OP’s thread (I only found funny to compare a single thread M1 perf to many ix86 cores on a specific 10^4 \times 10^4 dense matrix-matrix product. Energetically it is

I have evaluated the Accelerate cblas_dgemm performance from C++ in the following code.

C++ gemm call (compile with clang++ -O3 main.cpp -Wc++11-extensions -framework accelerate -o main)

#include <stdio.h>
#include <stdlib.h>

#define LOOP_COUNT 10

#include <stdlib.h>
#include <vector>
#include <chrono>
#include <iostream>

#include <Accelerate/Accelerate.h>

template <class REAL> 
void matmul(REAL *A, REAL *B, REAL *C, int M, int P, int N, REAL alpha, REAL beta){
    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, P, alpha, A, P, B, N, beta, C, N);
}

template <> 
void matmul<float>(float *A, float *B, float *C, int M, int P, int N, float alpha, float beta){
    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, P, alpha, A, P, B, N, beta, C, N);
}

template <class REAL>
void bench(int matrix_rank)
{
    REAL *A, *B, *C;
    int i, j, r, size;
    REAL alpha, beta;

    int M = matrix_rank;
    int P = matrix_rank;
    int N = matrix_rank;

    printf("Intializing data for matrix multiplication C=A*B for matrix\n\n"
           " A(%i*%i) and matrix B(%i*%i)\n",
           M, P, P, N);
    alpha = 1.0;
    beta = 0.0;

    auto vA = std::vector<REAL>(N * P);
    auto vB = std::vector<REAL>(N * P);
    auto vC = std::vector<REAL>(N * P);

    A = &vA[0];
    B = &vB[0];
    C = &vC[0];

    printf("Intializing matrix data\n\n");
    size = M * P;
    for (i = 0; i < size; ++i)
    {
        A[i] = (REAL)(i + 1);
    }
    size = N * P;
    for (i = 0; i < size; ++i)
    {
        B[i] = (REAL)(i - 1);
    }

    size = M * N;
    for (j = 0; j < size; ++j)
    {
        C[j] = 0.0;
    }

    std::chrono::time_point<std::chrono::system_clock> start, end;

    start = std::chrono::system_clock::now();
    for (r = 0; r < LOOP_COUNT; ++r)
    {
        matmul<REAL>(A,B,C,M,P,N,alpha,beta);
        // cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, M, N, P, alpha, A, P, B, N, beta, C, N);
        // multiply matrices with cblas_dgemm;
    }
    end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end - start;

    double s_elapsed = elapsed_seconds.count() / LOOP_COUNT;
    double gflops = 2.0 * double(N) * double(M) * double(P) / (s_elapsed * 1.e9);
    std::cout << "time (ms) :" << (s_elapsed * 1000) << " gflops=" << gflops << std::endl;

    return ;
}

int main()
{
    std::cout << "testing double precision :" << std::endl;
    bench<double>(10000);
    std::cout << "testing single precision :" << std::endl;
    bench<float>(10000);
}

It produces the following results on my machine (would be the same on M1 pro):

testing double precision :
Intializing data for matrix multiplication C=A*B for matrix
 A(10000*10000) and matrix B(10000*10000)
time (ms) :3555.68 gflops=562.48

testing single precision :
Intializing data for matrix multiplication C=A*B for matrix
 A(10000*10000) and matrix B(10000*10000)
time (ms) :967.148 gflops=2067.94

I give a (short) try to @staticfloat’s gist here https://gist.github.com/staticfloat/ce81a163807633748e414e2f3c628062 but I did not get the same perf yet (for some reason it launches MT version).

I guess that a direct call to accelerate function on apple silicon should be available soon (see https://github.com/JuliaLang/julia/issues/42312)

I don’t understand the reason why apple has included this (these?) undocumented AMX coprocessor(s). There is already the neural engine and the GPU sharing the same memory. Maybe this coprocessor is just a smart use of the neural engine.

photor · January 29, 2022, 11:09am

You mean you achieved this result (3.5 s) with a single thread on a M1 macbook pro? Really unbelievable. What about the multi thread result?

LaurentPlagne · January 29, 2022, 12:05pm

Yes this result is single threaded: it does not use the CPU but a coprocessor.
It seems that M1’s results are halved compared to M1 pro/Max so maybe there are two coprocessors on the latest apple socs.

On one hand it is super impressive (no fan) but on the other hand it does not scale with the number of cores.

If you have an apple silicon machine you can try the posted C++ code by yourself

photor · January 29, 2022, 2:18pm

I have to buy a M1 pro to test it
But I really don’t like C++, prefer to stick to Julia

Seif_Shebl · January 29, 2022, 5:44pm

I wouldn’t be so impressed by M1, it’s more comparable to the 11th gen in generic workloads. BTW, are you sure you enabled the performance mode in your laptop? My 11800h finishes that multiplication in 4.15 sec. using MKL, but BLAS takes a shame 19 sec., surely suboptimal.

photor · January 30, 2022, 3:11am

i9-11900K is not a chip for laptops, its a desktop chip, which is supposed to be far more performant than i7-11800H, but in fact it’s the opposite. Sad

LaurentPlagne · January 30, 2022, 4:17pm

OK, finally I managed to use @elrod’s AppleAccelerateLinAlgWrapper.jl package to get the same results:

using BenchmarkTools,AppleAccelerateLinAlgWrapper
N=10000
a,b,c=rand(N,N),rand(N,N),rand(N,N);
t=@belapsed AppleAccelerateLinAlgWrapper.gemm($a,$b);
gflops=2N^3/(t*1.e9)
@show t,gflops
(t, gflops) = (3.6196095, 552.545792577901)
a,b,c=rand(Float32,N,N),rand(Float32,N,N),rand(Float32,N,N);
t=@belapsed AppleAccelerateLinAlgWrapper.gemm($a,$b);
gflops=2N^3/(t*1.e9)
 @show t,gflops
(t, gflops) = (0.998861834, 2002.2789257958573)

Note that these results are obtained with the arm Julia 1.7.1 version (x86 is much slower).
I think that this is actually pretty different from @Seif_Shebl’s results because it is single threaded (no spinning fan and a temperature below 50 deg C). In addition, gemm is cpu bound but the memory bandwidth of this machine is incredible.

IMHO, the problem with this laptop is definitively not the performance but the software ecosystem : native arm Julia is not yet an option now (but the situation improves pretty fast) and I have to keep using x86 Julia binaries. The same is true for GPGPU because the Julia/metal ecosystem is way less mature than CUDA.jl (or AMD.jl or oneAPI.jl). This is why I would not recommend it for intensive Julia programmers although I enjoy very much to program and experiment with no noise at all ! Last, if like me you are a Linux (or Windows) user, the apple keyboard design choice may drive you mad: why do they choose to make [ { | second class characters !!??!!

Elrod · January 30, 2022, 4:50pm

Try Julia nightly.
Segfaults have been fixed: https://github.com/JuliaLang/julia/pull/43664
Threads freezing has been fixed: Improve thread scheduler memory ordering by vtjnash · Pull Request #43418 · JuliaLang/julia · GitHub

Topic		Replies	Views
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36661	June 19, 2020
JuliaPro 1.0.1.1 is available, but no MKL? Tooling juliapro	7	3693	September 12, 2023
Any benchmark of Julia v1.0 vs older versions Performance	66	8242	April 3, 2019
OpenBLAS vs MKL General Usage mkl	7	15845	January 16, 2020
Show off Julia performance on your PC! Performance	53	4438	April 26, 2020

Current OpenBLAS Versions (January 2022) do not support Intel gen 11 performantly?

Related topics