Current OpenBLAS Versions (January 2022) do not support Intel gen 11 performantly?

photor · January 24, 2022, 6:06am

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.dll

julia> A=randn(10000,10000);

julia> @elapsed A*A
31.8050192

julia> @elapsed A*A
32.776766

julia> using MKL

julia> @elapsed A*A
6.3647406

julia> @elapsed A*A
6.3448374

cpu-z

Oscar_Smith · January 24, 2022, 6:17am

Something is weird there. OpenBlas is usually slower, but that is way more than I’ve ever seen. Usually it’s less than 2x.

photor · January 24, 2022, 6:22am

More info here:

julia> versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, icelake-client)

photor · January 24, 2022, 6:24am

Yes, weird. Maybe not optimzied for the 11th gen cpus?

GregVernon · January 24, 2022, 6:27am

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.7.1 (2021-12-22)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> import LinearAlgebra
julia> import BenchmarkTools

julia> LinearAlgebra.BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.so

julia> A = randn(10000, 10000);
julia> BenchmarkTools.@btime( $A*$A );
  6.572 s (2 allocations: 762.94 MiB)

julia> using MKL
julia> LinearAlgebra.BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libmkl_rt.so

julia> BenchmarkTools.@btime( $A*$A );
  5.539 s (2 allocations: 762.94 MiB)

Seems reasonable to me?

goerch · January 24, 2022, 6:43am

How many threads does BLAS use? You can check with BLAS.get_num_threads().

photor · January 24, 2022, 7:07am

8, the default value

photor · January 24, 2022, 7:09am

The 9th gen cpus are older and may have better support

ImreSamu · January 24, 2022, 7:14am

As I see Julia 1.7 - has an

# OpenBLAS
OPENBLAS_VER := 0.3.13

And Releases · OpenMathLib/OpenBLAS · GitHub

v0.3.19 : latest ( Dec 19, 2021 release )
v0.3.16 : “fixed cpu type autodetection for Intel Tiger Lake”
v0.3.14 : "Added CPUID autodetection for Intel Rocket Lake and Tiger Lake cpus"

11th generation Intel Core == Tiger Lake

IMHO / my suggestion

check the Julia Nightly builds

The current Julia Master is OPENBLAS_VER := 0.3.17

GregVernon · January 24, 2022, 3:46pm

That may be true, but that doesn’t mean that “OpenBLAS sucks!”

nilshg · January 24, 2022, 3:50pm

I agree - a less uncharitable and inflammatory title might be “OpenBLAS is slow when using an older version on modern CPUs”

ChrisRackauckas · January 24, 2022, 3:51pm

I updated the title to be more informative and help future searches find this thread.

photor · January 25, 2022, 1:33am

Even with MKL, my 11th gen i9 is slower than your 9th gen i7 in this bench. F**k intel!

Oscar_Smith · January 25, 2022, 1:34am

There’s a reason the i9-11900k was reviewed as a “waste of sand” Pathetic: Intel Core i9-11900K CPU Review & Benchmarks: Gaming, Power, Production - YouTube

Edit: I slightly misremebered. It was the i9-10900k that was a “waste of sand”. The i9-11900k was the followup that was worse.

Elrod · January 25, 2022, 5:43am

No, it was the 11900K that Gamers Nexus called “a waste of sand”.
The 10900K was a 10 core Skylake chip.
The 11900K was an 8 core Rocket lake.

I’m a fan of AVX512, so I’m inclined to disagree and would take the 8 core rocket lake over the 10 core skylake.

However, rocket lake has only a single 512 bit FMA unit (actually, its two 256 bit units working together). This was also the case for certain Skylake-X chips, Ice Lake client, Tiger Lake, and Alder Lake before AVX512 was disabled altogether.
As such, you probably aren’t going to get better gemm performance: running AVX2, you use ports 0 and 1 to each do one 256 bit FMA per cycle, for 2x256 bits of FMA/cycle.
With AVX512, they work together to do 1x512 bits of FMA/cycle.

Most Skylake-X and Ice Lake server have 2x512 bits. These are the chips that do well on matrix multiply.

Even on chips with just 1, it is easier to achieve peak performance with AVX512, however.
Execution is just one part of the pipeline.
Decoding one instruction is faster than decoding two, scheduling one vs two, etc…

However, GEMM is optimized well enough that it tends to be bottlenecked by execution.
There is some discussion here, where I found that my Tiger Lake CPU reached >99% of the theoretical peak performance, while IIRC the M1 was in the low 90% area.
The M1 has the same execution capability, but in the form of 4x 128 bit = 512. It’s just much harder to decode and schedule 4 instructions than 1.

While clock for clock theroetical peak performance of the 11900K and 9700K are equal, it should be easier to get close to the former’s peak. Aside from being easier to schedule fewer instructions, the 11900K has better out of order capabilities, can fit 4x the data into named registers, has 50% larger L1 data cache, and 100% larger L2 cache.
Still, MKL is very good at getting close to 100% peak at large sizes, so it being more difficult to do so for the 9700K doesn’t really imply that MKL isn’t getting close enough anyway.

So clock speed is probably more important for GEMM. I’d have thought the 11900K’s clock speed is higher. Maybe the 9700K is overclocked and the 11900K isn’t? Or perhaps there are differences in cooling/thermal throttling?

Could also be that MKL is badly tuned for rocket lake. Maybe it treats rocket lake like Skylake-X, even though Skylake-X has twice the L2 cache (and 4x larger than Skylake’s), and can thus use much larger L2 blocks. Unlikely, but things like it are possible.

How does Octavian.jl compare on both?
It reads hardware data to generate specific code, but that code isn’t as well optimized, so it’s likely to have an easier time being fast on the 11900K.

photor · January 25, 2022, 6:57am

Professional analysis. Do you mean, say, i9-9980XE is good for matrix mutiplication?

The Octavian result for my cpu is here:

julia> Threads.nthreads()
8

julia> A=randn(10000,10000);

julia> using Octavian

julia> @elapsed matmul(A,A)
19.5469914

julia> @elapsed matmul(A,A)
8.8592322

julia> @elapsed matmul(A,A)
8.8841109

Elrod · January 25, 2022, 8:10am

Yes. The 9980XE and 10980XE are more or less the same CPU.

julia> using LinearAlgebra; BLAS.set_num_threads(18);

julia> A=randn(10000,10000);

julia> @elapsed A*A
2.266004585

julia> @elapsed A*A
1.924472689

julia> using MKL

julia> @elapsed A*A
1.573432146

julia> @elapsed A*A
1.189004237

julia> versioninfo()
Julia Version 1.7.2-pre.0
Commit 3f024fd0ab* (2021-12-23 18:27 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz

photor · January 25, 2022, 8:50am

impressive!

photor · January 26, 2022, 1:07am

So, 10980XE is the best cpu to do this job, until now?

Oscar_Smith · January 26, 2022, 1:14am

No. Intel 12th gen is faster, and the fastest will be an AMD Epyc server CPU since those have 8 channel memory and up to 64 cores. (I’m not sure if we have good benchmarks on AWS’s graviton chips, but I would expect those to do very well also).

Topic		Replies	Views
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36472	June 19, 2020
Poor openBLAS performance for large matrix multiply? New to Julia openblas	17	1228	April 4, 2025
Apple M1 GPU from Julia? GPU question	20	5870	March 31, 2023
LU factorization performance issue New to Julia linearalgebra	30	716	June 6, 2022
JuliaPro 1.0.1.1 is available, but no MKL? Tooling juliapro	7	3678	September 12, 2023

Current OpenBLAS Versions (January 2022) do not support Intel gen 11 performantly?

IMHO / my suggestion

Related topics