I have been benchmarking some code and coming across all sorts of questions (see my other thread for in-depth details, I will post a progress update soon). However, I have an OpenBLAS question that I believe merits its own thread.
To summarize, the problem at hand can be found here (very rough code). It is a PDE-derived ODE system solved using DifferentialEquations.jl, very much a real-world problem not a synthetic benchmark. Here, I am benchmarking the DifferentialEquations.Solve function call, which will heavily make use of BLAS as per my profiles (shown in the other thread if you’re curious). Basically this
Check out src/example.jl. In the table below, I show the minimum time in ms.
Julia threads ↓ ; BLAS threads →
16
8
1
16
750
1505
538
1
900
700
694
In the table, the first row corresponds to Threads.nthreads() == 16 and the first column corresponds to setting BLAS.set_num_threads(16).
It is surprising to many people that 16 Julia threads and 1 OpenBLAS thread are the fastest options, but this has already been discussed extensively here by @carstenbauer (whole thread is worth reading if you do any serious OpenBLAS calls).
One important thing to note is regarding power and CPU usage.
Scenario 1: 16 Julia threads, 1 BLAS thread: 60 W, and 300% CPU utilization
Scenario 2: 16 Julia threads, 16 BLAS threads: 190W, and 1600% CPU utilization. Takes 25% longer than Scenario 1. Even slower than sequential code.
Questions and Discussion
What is the meaning of the special case OPEN_BLAS_NUM_THREADS=1? What exactly is going on here and why do we need this when running multithreaded code? Does this mean if multithreading doesn’t benefit my code that much, I should “de-multithread” so I can run OpenBLAS with 16 threads and run the rest of the code with 1 thread? I’m confused
Why is my CPU usage low when I use 1 OpenBLAS thread? It still runs multithreaded according to htop, but only utilizes a small percentage of available computational resources.
Why in the world is OpenBLAS with 16 threads using so much computing power while running slower than OPEN_BLAS_NUM_THREADS=1. Is it like, running identical computations on 16 threads then synchronizing and that’s why it’s slower?
Most importantly, how do I get my code to fully use my computer’s resources. 1600% usage and a very hot CPU are what I’m looking for.
System Details:
Julia 1.7.1 compiled with OpenBLAS 0.3.19
8 cores 16 threads i7-12700k overclocked to 5.1 GHz. E cores turned off and support for AVX/AVX2/AVX-512 enabled. Cache overclocked to 4.2 GHz for good measure.
Overclocked DDR5 memory for good measure as well (64 GB 5200 CL36)
Out of curiosity, can you run the same comparison with MKL?
It tends to be much better at threading than OpenBLAS, especially at small sizes.
OpenBLAS is known to get much better performance single threaded than multithreaded for lu below fairly large matrices on CPUs with many threads.
Given that OpenBLAS regressed even with single threaded Julia, it doesn’t seem related to any potential conflict between julia and BLAS threads.
Also, how well does it do with ecores enabled?
Have you tried that? Alder Lake only has 1 512 bit fma unit (technically, it’s actually the two 256 boot units [from ports 0 and 1 {assuming it’s implemented the same way as previous generations}] working together), so GEMM itself won’t benefit much, unlike for the chips with two 512 bit fma units.
Done, it is indeed slightly better (~13% comparing 16/1 OpenBLAS to 8/8 MKL). I added more data points out of curiosity.
Julia threads ↓ MKL threads →
16
8
2
1
16
11647 (1600%)
11281 (1600%)
519 (600%)
495 (300%)
8
467 (900%)
467 (900%)
477 (300%)
475 (200%)
1
637 (1600%)
637 (800%)
650 (200%)
663 (100%)
So far, the best option is 8 Julia threads and 8 MKL threads… Between parentheses is the CPU usage, and as you can see, the numbers are quite wonky as well. I am genuinely confused. Is there even a way to get to 1600% CPU usage “properly” (i.e., without sacrificing performance because I’m doing something wrong? I just want to utilize my whole machine!)
I will benchmark this with E-cores tomorrow morning, I will roll back the overclock and use the CPU in its default state with E-cores enabled until I can overclock them. Thank you for the explanation, I wasn’t aware of the limitations of Alder lake.
Honestly, I don’t even know how to add AVX to my code besides your @avx macro which currently doesn’t work for me due to branching in the loop.
Disclaimer: I haven’t had the time to look into the details of your computation. So take the following as a general note and a bit of speculation.
Depending on the computation / workload at hand “utilizing the whole machine” doesn’t necessarily mean 1600% CPU usage (if we take it that CPU usage is indicative of performance, which isn’t necessarily the case either). For example, if (the major part of) your computation is memory-bound (which is very often the case in scientific codes) you can add threads and enable more aggressive SIMD etc. as much as you want without changing much the utilization of you cores and the overall performance of your code. After all, that is what it means to be memory-bound rather than compute-bound (See here for an hopefully instructive example.)
For example, just briefly looking at your numbers here, I see that changing the number of MKL threads for the single-Julia thread case doesn’t change the timings too much. This could indicate that the BLAS computations are memory bound and that the maximal achievable memory bandwidth is already achieved with, say, 8 threads.
(BTW, I guess you’re table is a bit misleading since for Julia threads 16 and MKL threads 16 there will actually be 16*16 MKL threads in total. I guess with “MKL threads” you actually mean "MKL_NUM_THREADS` which, for multithreaded Julia, is very much different. This is likely the reason why you see such a bad performance with increasing “MKL_NUM_THREADS” for 16 Julia threads: you’re extensively oversubscribing your cores.)
This has already been mentioned by me and @jpsamaroo in the other discourse thread you linked. Basically OPENBLAS_NUM_THREADS=1 (you have a typo there) is special because it makes OpenBLAS computations run on the respective calling (Julia) thread. for OPENBLAS_NUM_THREADS>1 the behavior changes qualitatively in the sense that OpenBLAS will create an own pool of OpenBLAS threads which it will use to run BLAS computation triggered by any of the Julia threads (there is only a single pool of OpenBLAS threads, irrespective of how many Julia threads you have). Hence, assuming Threads.nthreads() == 16, setting OPENBLAS_NUM_THREADS=1 will, effectively, make all your BLAS computations run on all of the Julia threads (16) whereas setting OPENBLAS_NUM_THREADS=2 will make all your BLAS computations run on only 2 separate OpenBLAS threads. That’s why you see such horrible performance for your 16/8 case for example.
As for your other question, in general, multithreading your computation with Julia threads (if possible) and using OPENBLAS_NUM_THREADS=1 should be better than using only a single Julia thread and OPENBLAS_NUM_THREADS=16. The main point is that you can parallelize your specific application much more effectively than OpenBLAS, which can only parallelize the BLAS parts. However, as with every “rule of thumb”, there are exceptions and it can depend on the computation at hand. (BTW, in your case, the rule of thumb seems to hold: compare 16/1 (538) to 1/16 (900).)
Note that the code we’re talking about is mostly using BLAS1, not BLAS2 or BLAS3 calls. That greatly decreases the potential of multithreading. It’s pretty inherent to the algorithm.
Thank you for explaining this to me so clearly. I hadn’t thought about that, I always thought memory bandwidth issues were more rare. What you said seems to describe my code perfectly.
While looking this up, I found thishelpful post as well.
To test this for my code, I ran a little experiment. I set my RAM to different speeds, 4000CL36, 5200CL36, and 6000CL36. I benchmarked the read/write speeds and latencies for each configuration as well as two simulations. The results are below.
4000CL36 64GB (Default)
5200CL36 64GB (OC)
6000CL36 32 GB (XMP)
mu (s)
0.194
0.192
0.193
beta (s)
10.9
10.74
10.885
Read (GB/s)
62
81
92
Write (GB/s)
59
76
83
Latency (ns)
93
75
71
As you can see, the increase in RAM speed (which improves bandwidth, if I understand read/write speeds properly) does not improve the code’s performance. I am not sure what’s going on, I am probably benchmarking the wrong memory metric.
I’d seen it mentioned in the other thread but honestly didn’t understand it completely, so thank you for this excellent explanation. Thank you for answering everything else as well.
Here, P-cores OC = 5.1 GHz, E-cores OC = 4.0 GHz. \mu is a non-oscillating solution with a simulation run for 100 time units and \beta is an oscillating solution run for 10,000 time units. Julia is run with 20 threas (16 from the P-cores and 4 from the E-cores) and I use BLAS.set_num_threads(1)
As you can see, the version with P-cores only runs about 30% faster, which is non-negligible. However, they may very well be a better combination of Julia threads/MKL threads for this hybrid architecture, but I can’t quite wrap my head around it.
Also, haven’t we concluded that the calculation is memory-bound anyway? So throwing more cores at it shouldn’t accelerate it. This might be a bad test.