Consider a MWE where some moderately-heavy calculations are performed in a parallel loop. Here, I am simulating the calculations by doing eigen
:
using LinearAlgebra, Random
BLAS.set_num_threads(1) # turn off BLAS parallelisation
Random.seed!(1)
function testp(M, n)
Threads.@threads for _ in 1:n
eigen(M)
end
end
for sz in (100, 7000) # call testp twice: frist with a small matrix, then with a large one
M = rand(sz, sz)
print("$sz: ")
@time testp(M, Threads.nthreads())
end
The first call to testp
with a 100x100 matrix is intended to compile the function, while the second call is a “real” calculation, which takes quite some time.
The problem is that when testp
is called for the second time, not all CPU cores are utilised, and utilisation is inconsistent between Julia launches.
Here are the results of a sample run (on an M2 Pro Mac with 8 performance cores (32 GB RAM); launched as julia -t 8
):
100: 0.479790 seconds (1.28 M allocations: 89.085 MiB, 1.87% gc time, 794.34% compilation time)
7000: 480.801438 seconds (1.02 k allocations: 11.741 GiB, 0.02% gc time)
During the 7000x7000 calculation, only two cores out of eight were utilised.
I restart Julia and repeat the calculation:
100: 0.488900 seconds (1.29 M allocations: 90.483 MiB, 2.78% gc time, 782.40% compilation time)
7000: 289.529241 seconds (265 allocations: 11.741 GiB, 0.02% gc time)
This time during the 7000x7000 calculation, at first only two cores were utilised, but after ~2 minutes five more cores kicked in and remained used till the end.
If I change the respective line of code to for sz in (7000, 7000)
, all eight cores might be utilised in a fresh Julia session:
7000: 182.412394 seconds (1.28 M allocations: 11.825 GiB, 0.02% gc time, 2.02% compilation time)
7000: 182.639939 seconds (265 allocations: 11.741 GiB, 0.01% gc time)
But another fresh run gives the following:
7000: 278.456918 seconds (1.29 M allocations: 11.826 GiB, 0.01% gc time, 1.31% compilation time) [CPU utilisation: starts at 7 cores, after 2.5 min only 3 cores]
7000: 384.224213 seconds (265 allocations: 11.741 GiB, 0.03% gc time) [CPU utilisation: starts at 2 cores, after 2 min -- 4 cores, after 2 more min -- 5 cores]
To see for yourself, you could simply run the above MWE and check CPU utilisation when the 7000^2 calculation starts (no need to wait for the calculation to finish).
Let me emphasise several points:
-
The issue does not appear if smaller matrices are used, e.g. 3000^2 instead of 7000^2. In the case of smaller matrices, all cores are always utilised.
-
The issue persists even if more work is supplied for the threads, i.e. if the second argument to
testp
is100 * Threads.nthreads()
instead of1 * Threads.nthreads()
. -
I could replicate the issue when using matrix multiplication
M * M * M
instead ofeigen
.
I also tested the above MWE on an Intel i7-1260P (32 GB RAM) running both Windows 11 and Linux. The results are of the same nature.
I am still trying to figure out whether this has to do with linear algebra operations or not. At least OpenBLAS is likely not to blame because I can replicate the results with AppleAccelerate
and MKL
.
In my real-world problem, I am solving ODE systems in a multithreaded loop on a 32/64 cores AMD Threadripper, with only 2 cores utilised (and plenty of free RAM).
So why is the CPU utilisation so inconsistent?
Any feedback is highly appreciated!
versionifo
Julia Version 1.10.1
Commit 7790d6f0641 (2024-02-13 20:41 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 12 Ă— Apple M2 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS = 8