Current OpenBLAS Versions (January 2022) do not support Intel gen 11 performantly?

julia> BLAS.get_config()
└ [ILP64] libopenblas64_.dll

julia> A=randn(10000,10000);

julia> @elapsed A*A

julia> @elapsed A*A

julia> using MKL

julia> @elapsed A*A

julia> @elapsed A*A


1 Like

Something is weird there. OpenBlas is usually slower, but that is way more than I’ve ever seen. Usually it’s less than 2x.

More info here:

julia> versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, icelake-client)

Yes, weird. Maybe not optimzied for the 11th gen cpus?


   _       _ _(_)_     |  Documentation:
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.7.1 (2021-12-22)
 _/ |\__'_|_|_|\__'_|  |  Official release
|__/                   |

julia> import LinearAlgebra
julia> import BenchmarkTools

julia> LinearAlgebra.BLAS.get_config()
└ [ILP64]

julia> A = randn(10000, 10000);
julia> BenchmarkTools.@btime( $A*$A );
  6.572 s (2 allocations: 762.94 MiB)

julia> using MKL
julia> LinearAlgebra.BLAS.get_config()
└ [ILP64]

julia> BenchmarkTools.@btime( $A*$A );
  5.539 s (2 allocations: 762.94 MiB)

Seems reasonable to me?

How many threads does BLAS use? You can check with BLAS.get_num_threads().

8, the default value :grinning:

The 9th gen cpus are older and may have better support

As I see Julia 1.7 - has an

# OpenBLAS
OPENBLAS_VER := 0.3.13

And Releases · xianyi/OpenBLAS · GitHub

  • v0.3.19 : latest ( Dec 19, 2021 release )
  • v0.3.16 : “fixed cpu type autodetection for Intel Tiger Lake
  • v0.3.14 : "Added CPUID autodetection for Intel Rocket Lake and Tiger Lake cpus"

11th generation Intel Core == Tiger Lake

IMHO / my suggestion

check the Julia Nightly builds


That may be true, but that doesn’t mean that “OpenBLAS sucks!”


I agree - a less uncharitable and inflammatory title might be “OpenBLAS is slow when using an older version on modern CPUs”


I updated the title to be more informative and help future searches find this thread.


Even with MKL, my 11th gen i9 is slower than your 9th gen i7 in this bench. F**k intel!

There’s a reason the i9-11900k was reviewed as a “waste of sand” Pathetic: Intel Core i9-11900K CPU Review & Benchmarks: Gaming, Power, Production - YouTube

Edit: I slightly misremebered. It was the i9-10900k that was a “waste of sand”. The i9-11900k was the followup that was worse.


No, it was the 11900K that Gamers Nexus called “a waste of sand”.
The 10900K was a 10 core Skylake chip.
The 11900K was an 8 core Rocket lake.

I’m a fan of AVX512, so I’m inclined to disagree and would take the 8 core rocket lake over the 10 core skylake.

However, rocket lake has only a single 512 bit FMA unit (actually, its two 256 bit units working together). This was also the case for certain Skylake-X chips, Ice Lake client, Tiger Lake, and Alder Lake before AVX512 was disabled altogether.
As such, you probably aren’t going to get better gemm performance: running AVX2, you use ports 0 and 1 to each do one 256 bit FMA per cycle, for 2x256 bits of FMA/cycle.
With AVX512, they work together to do 1x512 bits of FMA/cycle.

Most Skylake-X and Ice Lake server have 2x512 bits. These are the chips that do well on matrix multiply.

Even on chips with just 1, it is easier to achieve peak performance with AVX512, however.
Execution is just one part of the pipeline.
Decoding one instruction is faster than decoding two, scheduling one vs two, etc…

However, GEMM is optimized well enough that it tends to be bottlenecked by execution.
There is some discussion here, where I found that my Tiger Lake CPU reached >99% of the theoretical peak performance, while IIRC the M1 was in the low 90% area.
The M1 has the same execution capability, but in the form of 4x 128 bit = 512. It’s just much harder to decode and schedule 4 instructions than 1.

While clock for clock theroetical peak performance of the 11900K and 9700K are equal, it should be easier to get close to the former’s peak. Aside from being easier to schedule fewer instructions, the 11900K has better out of order capabilities, can fit 4x the data into named registers, has 50% larger L1 data cache, and 100% larger L2 cache.
Still, MKL is very good at getting close to 100% peak at large sizes, so it being more difficult to do so for the 9700K doesn’t really imply that MKL isn’t getting close enough anyway.

So clock speed is probably more important for GEMM. I’d have thought the 11900K’s clock speed is higher. Maybe the 9700K is overclocked and the 11900K isn’t? Or perhaps there are differences in cooling/thermal throttling?

Could also be that MKL is badly tuned for rocket lake. Maybe it treats rocket lake like Skylake-X, even though Skylake-X has twice the L2 cache (and 4x larger than Skylake’s), and can thus use much larger L2 blocks. Unlikely, but things like it are possible.

How does Octavian.jl compare on both?
It reads hardware data to generate specific code, but that code isn’t as well optimized, so it’s likely to have an easier time being fast on the 11900K.


Professional analysis. Do you mean, say, i9-9980XE is good for matrix mutiplication?

The Octavian result for my cpu is here:

julia> Threads.nthreads()

julia> A=randn(10000,10000);

julia> using Octavian

julia> @elapsed matmul(A,A)

julia> @elapsed matmul(A,A)

julia> @elapsed matmul(A,A)

Yes. The 9980XE and 10980XE are more or less the same CPU.

julia> using LinearAlgebra; BLAS.set_num_threads(18);

julia> A=randn(10000,10000);

julia> @elapsed A*A

julia> @elapsed A*A

julia> using MKL

julia> @elapsed A*A

julia> @elapsed A*A

julia> versioninfo()
Julia Version 1.7.2-pre.0
Commit 3f024fd0ab* (2021-12-23 18:27 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz


So, 10980XE is the best cpu to do this job, until now?

No. Intel 12th gen is faster, and the fastest will be an AMD Epyc server CPU since those have 8 channel memory and up to 64 cores. (I’m not sure if we have good benchmarks on AWS’s graviton chips, but I would expect those to do very well also).