What is the current state of multi-threaded BLAS in Julia?

leestrobel · March 21, 2024, 7:00pm

Hi, I am wondering what is the current state of multi-threaded BLAS in Julia?

I remember looking into several options about a year ago for accelerating dense matrix multiplication and sparse solving on some PDE code I was working on. At the time, the impression I had was that OpenBLAS was quite poor. I had tried it on my laptop (AMD Ryzen 7530), as well as on the Supercloud (single-node) and in both cases it seemed to give barely any benefit for dense multiplication with 8 OpenBLAS threads rather than 1.

I recall seeing better results with MKL with the Supercloud Intel CPUs, and I had the impression that there is no AMD-specific BLAS library that is going to give comparable performance to MKL.

The reason I am asking is because I am looking at speccing out a high-performance scientific workstation for our research lab. I am wondering if there will be any performance benefit to choosing an Intel or AMD CPU for use with Julia? My impression is that Intel with MKL will probably give the best multi-threaded BLAS performance - does that sound accurate?

mkitti · March 21, 2024, 7:14pm

There is an environment variable to control BLAS threads. You can also use MKL. Also there is Octavian.jl

adienes · March 21, 2024, 7:17pm

though do be aware that Octavian.jl is soon deprecated

Elrod · March 21, 2024, 7:40pm

AMD Zen4 has AVX512, while only Intel server CPUs do. If you’re restricted to desktop level parts, I’d go AMD if you’re interested in any sort of linear algebra.
Although, their throughputs are going to match (64 bytes of fma/clock cycle), so it isn’t that big a deal.

With server class parts (which includes some, but not all, workstation CPUs), the Intel CPUs that support AVX512 can do 128 bytes of fma/clock cycle.

ufechner7 · March 21, 2024, 8:26pm

But they reduce their clock frequency when doing so, so in most cases there is no advantage over AMD…

jd-foster · March 21, 2024, 8:44pm

BLIS has good multi threading performance on AMD. Easiest way to trying it out is via GitHub - JuliaLinearAlgebra/BLISBLAS.jl: BLIS-pendant of MKL.jl or with more access to the C API through GitHub - JuliaLinearAlgebra/BLIS.jl: This repo plans to provide a low-level Julia wrapper for BLIS typed interface.

Elrod · March 21, 2024, 8:44pm

That’s not really true. Ice lake and newer don’t really drop clock speeds, while older CPUs like Skylake-X and Cascadelake, which do drop clock speeds, are far faster than AMD when doing a decent job of leveraging AVX512.

MKL and other BLAS libraries do a good job leveraging AVX512.
A lot of @turbo code does, too (but it is deprecated as of Julia 1.11).
Otherwise, assume that AVX512 either isn’t being used, or is being used badly unless tested.

leestrobel · March 23, 2024, 6:35pm

Thank you all for your inputs. The situation doesn’t seem entirely clear regarding which option is best, although going with Intel + MKL seems like a pretty safe bet.

leestrobel · March 23, 2024, 6:38pm

Ok, thanks for this advice. One thing is that I definitely want a ‘desktop’ CPU, as the workstation is going to be running mostly ‘small’ jobs that won’t be massively parallelizable. So, we want a CPU that will have excellent single-thread performance (i.e. not Xeons). If the jobs were highly parallelizable, then we would just put them on an HPC.

Topic		Replies	Views
Regarding the multithreaded performance of OpenBLAS Performance blas , multithreading	7	5435	January 31, 2022
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36472	June 19, 2020
Ideal number of BLAS threads General Usage blas , multithreading , linearalgebra	10	4414	April 27, 2022
Is MKL performance on AMD no longer crippled? General Usage	4	2079	May 11, 2024
BLAS thread count vs Julia thread count General Usage question , performance , linearalgebra	21	2736	April 6, 2021

What is the current state of multi-threaded BLAS in Julia?

Related topics