What is the current state of multi-threaded BLAS in Julia?

Hi, I am wondering what is the current state of multi-threaded BLAS in Julia?

I remember looking into several options about a year ago for accelerating dense matrix multiplication and sparse solving on some PDE code I was working on. At the time, the impression I had was that OpenBLAS was quite poor. I had tried it on my laptop (AMD Ryzen 7530), as well as on the Supercloud (single-node) and in both cases it seemed to give barely any benefit for dense multiplication with 8 OpenBLAS threads rather than 1.

I recall seeing better results with MKL with the Supercloud Intel CPUs, and I had the impression that there is no AMD-specific BLAS library that is going to give comparable performance to MKL.

The reason I am asking is because I am looking at speccing out a high-performance scientific workstation for our research lab. I am wondering if there will be any performance benefit to choosing an Intel or AMD CPU for use with Julia? My impression is that Intel with MKL will probably give the best multi-threaded BLAS performance - does that sound accurate?

1 Like

There is an environment variable to control BLAS threads. You can also use MKL. Also there is Octavian.jl

though do be aware that Octavian.jl is soon deprecated

2 Likes

AMD Zen4 has AVX512, while only Intel server CPUs do. If you’re restricted to desktop level parts, I’d go AMD if you’re interested in any sort of linear algebra.
Although, their throughputs are going to match (64 bytes of fma/clock cycle), so it isn’t that big a deal.

With server class parts (which includes some, but not all, workstation CPUs), the Intel CPUs that support AVX512 can do 128 bytes of fma/clock cycle.

4 Likes

But they reduce their clock frequency when doing so, so in most cases there is no advantage over AMD…

BLIS has good multi threading performance on AMD. Easiest way to trying it out is via GitHub - JuliaLinearAlgebra/BLISBLAS.jl: BLIS-pendant of MKL.jl or with more access to the C API through GitHub - JuliaLinearAlgebra/BLIS.jl: This repo plans to provide a low-level Julia wrapper for BLIS typed interface.

2 Likes

That’s not really true. Ice lake and newer don’t really drop clock speeds, while older CPUs like Skylake-X and Cascadelake, which do drop clock speeds, are far faster than AMD when doing a decent job of leveraging AVX512.

MKL and other BLAS libraries do a good job leveraging AVX512.
A lot of @turbo code does, too (but it is deprecated as of Julia 1.11).
Otherwise, assume that AVX512 either isn’t being used, or is being used badly unless tested.

1 Like

Thank you all for your inputs. The situation doesn’t seem entirely clear regarding which option is best, although going with Intel + MKL seems like a pretty safe bet.

Ok, thanks for this advice. One thing is that I definitely want a ‘desktop’ CPU, as the workstation is going to be running mostly ‘small’ jobs that won’t be massively parallelizable. So, we want a CPU that will have excellent single-thread performance (i.e. not Xeons). If the jobs were highly parallelizable, then we would just put them on an HPC.