Inconsistent CPU utilisation in @threads loops

ysh · February 21, 2024, 12:49pm

Consider a MWE where some moderately-heavy calculations are performed in a parallel loop. Here, I am simulating the calculations by doing eigen:

using LinearAlgebra, Random
BLAS.set_num_threads(1) # turn off BLAS parallelisation
Random.seed!(1)

function testp(M, n)
    Threads.@threads for _ in 1:n
        eigen(M)
    end
end

for sz in (100, 7000) # call testp twice: frist with a small matrix, then with a large one
    M = rand(sz, sz)
    print("$sz: ")
    @time testp(M, Threads.nthreads())
end

The first call to testp with a 100x100 matrix is intended to compile the function, while the second call is a “real” calculation, which takes quite some time.

The problem is that when testp is called for the second time, not all CPU cores are utilised, and utilisation is inconsistent between Julia launches.

Here are the results of a sample run (on an M2 Pro Mac with 8 performance cores (32 GB RAM); launched as julia -t 8):

100:    0.479790 seconds (1.28 M allocations: 89.085 MiB, 1.87% gc time, 794.34% compilation time)
7000: 480.801438 seconds (1.02 k allocations: 11.741 GiB, 0.02% gc time)

During the 7000x7000 calculation, only two cores out of eight were utilised.
I restart Julia and repeat the calculation:

100:    0.488900 seconds (1.29 M allocations: 90.483 MiB, 2.78% gc time, 782.40% compilation time)
7000: 289.529241 seconds (265 allocations: 11.741 GiB, 0.02% gc time)

This time during the 7000x7000 calculation, at first only two cores were utilised, but after ~2 minutes five more cores kicked in and remained used till the end.

If I change the respective line of code to for sz in (7000, 7000), all eight cores might be utilised in a fresh Julia session:

7000: 182.412394 seconds (1.28 M allocations: 11.825 GiB, 0.02% gc time, 2.02% compilation time)
7000: 182.639939 seconds (265 allocations: 11.741 GiB, 0.01% gc time)

But another fresh run gives the following:

7000: 278.456918 seconds (1.29 M allocations: 11.826 GiB, 0.01% gc time, 1.31% compilation time) [CPU utilisation: starts at 7 cores, after 2.5 min only 3 cores]
7000: 384.224213 seconds (265 allocations: 11.741 GiB, 0.03% gc time) [CPU utilisation: starts at 2 cores, after 2 min -- 4 cores, after 2 more min -- 5 cores]

To see for yourself, you could simply run the above MWE and check CPU utilisation when the 7000^2 calculation starts (no need to wait for the calculation to finish).

Let me emphasise several points:

The issue does not appear if smaller matrices are used, e.g. 3000^2 instead of 7000^2. In the case of smaller matrices, all cores are always utilised.
The issue persists even if more work is supplied for the threads, i.e. if the second argument to testp is 100 * Threads.nthreads() instead of 1 * Threads.nthreads().
I could replicate the issue when using matrix multiplication M * M * M instead of eigen.

I also tested the above MWE on an Intel i7-1260P (32 GB RAM) running both Windows 11 and Linux. The results are of the same nature.
I am still trying to figure out whether this has to do with linear algebra operations or not. At least OpenBLAS is likely not to blame because I can replicate the results with AppleAccelerate and MKL.

In my real-world problem, I am solving ODE systems in a multithreaded loop on a 32/64 cores AMD Threadripper, with only 2 cores utilised (and plenty of free RAM).

So why is the CPU utilisation so inconsistent?

Any feedback is highly appreciated!

versionifo

Julia Version 1.10.1
Commit 7790d6f0641 (2024-02-13 20:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 12 × Apple M2 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8

abraemer · February 21, 2024, 2:05pm

Interesting. That might explain some issues I had with jobs sometimes randomly timing out on clusters…

Could you see if using ThreadPinning.jl helps in your case?

ysh · February 21, 2024, 3:04pm

Indeed, I have tried using ThreadPinning.jl, both on i7-1260P and AMD Threadripper on Linuxes, but the results are the same.
I have also tried using Floops.jl, both with its default ThreadedEx executor and also using WorkStealingEx from FoldsThreads.jl. It’s all the same!

carstenbauer · February 21, 2024, 5:17pm

I can’t reproduce on our cluster. I just took your code and ran it with 8 threads on one of our compute nodes (Linux, AMD Milan 7763, Julia 1.10.1). For the 7000 run I saw consistently ~800% CPU usage in htop. (I did pin the threads to make sure they run on the first 8 cores and repeated the benchmark 3 times.)

I would first try to benchmark these things on something other than macOS and on non-Apple hardware. Among other things, you can’t pin threads on macOS so the operating system might do whatever it wants (perhaps even move some of your computation to an efficiency core? who knows).

You might also want to test @threads :static for ... to rule out suboptimal task → thread mapping by Julia’s scheduler (shouldn’t be an issue here though).

carstenbauer · February 21, 2024, 5:29pm

FWIW, I also just ran the benchmark on an M1 mac mini (I used 4 threads because there are 4 performance and 4 efficiency cores). I saw ~400% CPU usage consistently in htop.

PetrKryslUCSD · February 21, 2024, 6:08pm

I can confirm.

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 24 × Apple M2 Ultra
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 16 virtual cores

julia>

For the 7000: For a minute or more it ran on 2 threads, the rest on 5. Done in 340 seconds.

mbauman · February 21, 2024, 6:55pm

This is GC — yes, even with it only showing ~0% GC time.

eigen allocates an 800MB matrix to store its result right at the get-go. You’re doing that 8 times, all at once. It’s highly likely that Julia may think that — after allocating a few of them — it’s a good idea to do garbage collection. BUT: the GC does not do its collection multi-threaded-ly. Instead, it holds the thread until all threads hit a GC safepoint. But a few of the threads have already kicked off into a multi-minute BLAS ccall. So it’s gotta wait for those to return. In the meantime, all the other threads are just sleeping. That sleeping time does not appear as GC time in @time’s metrics.

That’s why it’s inconsistent — both in number of times you perform the benchmark (it may be that you’ve not hit memory pressure the first time you do it) and in the machines folks are reporting back (they may have so much memory they don’t feel any pressure).

See: GC-safe `ccall`s · Issue #51574 · JuliaLang/julia · GitHub

abraemer · February 21, 2024, 7:24pm

Ok that explains it! Great

So perhaps in this case (and probably my case) avoiding all allocation with FastLapackInterface.jl should avoid triggering a GC and thus avoid the problem.

ysh · February 22, 2024, 3:53pm

Many thanks, now it is clear.

Doesn’t this mean that Julia’s GC strategy would benefit from some tweaking?

I have just tested the MWE on AMD 2990WX (32 cores / 64 threads) with 128 GB RAM running Linux. I used ThreadPinning.jl (pinthreads(:cores)) and ran Julia on only 8 cores (to keep memory usage low). Running it 10 times, I could get all eight cores launched only 3 times. Before each calculation, htop would report 2 GB RAM usage. When the calculation began, it climbed to 5 GB. The maximum during the calculation was 10 GB. Allocation size reported by @time was 11.741 GiB . So this calculation uses ~10% of total memory and could have been done without any intermediate GC (although this might be an oversimplification). To me, it seems that GC is performed too eagerly.

Specifically in the case of the above code, adding a call to GC.gc() just before calling testp() seems to help (but only together with thread pinning). I could get 8-core utilisation ten times in a row. However, if I give twice the work for each thread (testp(M, 2 * Threads.nthreads()), then again the same problem appears after each thread finishes its first iteration. I thought syncronising the threads and doing GC manually could help:

function testp(M, n)
    for i in 1:n÷Threads.nthreads()
        GC.gc()
        @sync for _ in 1:Threads.nthreads()
            Threads.@spawn eigen(M)
        end
    end
end

But it does not. When calling testp(M, 2 * Threads.nthreads(), I do get maximal utilisation of the first iteration (i=1), but on the second one only two cores are actively running.

versioninfo()

Julia Version 1.10.1
Commit 7790d6f0641 (2024-02-13 20:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD Ryzen Threadripper 2990WX 32-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver1)
Threads: 8 default, 0 interactive, 4 GC (on 64 virtual cores)

PetrKryslUCSD · February 22, 2024, 3:58pm

It is possible to disable GC. Essentials · The Julia Language

mbauman · February 22, 2024, 4:37pm

I mean, this is generating a lot of garbage. Every iteration generates 800MB that’s immediately trashable.

Could Julia do better? Of course. See the issue and PR I linked above, for example. But far easier is to do a bit of pre-allocation and avoid the trash in the first place — and you’ll see even better performance thanks to it.

Oscar_Smith · February 22, 2024, 5:05pm

the feature we need here is for ccalls to be able to tell Julia that they don’t touch any pointers which would let GC run concurrently to them (we also want to be able to do this for native Julia code automatically)

ysh · February 22, 2024, 11:19pm

Sure, pre-allocating is better if you have enough memory. I had a look at FastLapackInterface.jl (thanks, @abraemer!) and came up with this:

function testp(M, n)
    ws = [deepcopy(EigenWs(M, rvecs=true)) for _ in 1:n]
    Ms = [copy(M) for _ in 1:n]
    Threads.@threads for i in 1:n
        LAPACK.geevx!(ws[i], 'N', 'N', 'V', 'N', Ms[i])
    end
end

With julia -t 8 on that AMD system, I called testp(M, 4*8) with the result

7000: 1900.690449 seconds (1.20 k allocations: 35.492 GiB, 0.11% gc time)

This used all 8 cores as expected.
Still, sometimes you might not be in control of the functions called inside the threaded loop, or you might not have enough memory to pre-allocate for all iterations. As suggested by @PetrKryslUCSD, turning of GC, at least temporarily, does provide a workaround:

function testp(M, n)
    for _ in 1:n÷Threads.nthreads()
        GC.enable(false)
        @sync for _ in 1:Threads.nthreads()
            Threads.@spawn eigen(M)
        end
        GC.enable(true)
        GC.gc()
    end
end

With this I got identical execution time (with all 8 cores being utilised):

7000: 1900.498765 seconds (1.09 k allocations: 46.964 GiB, 0.06% gc time)

Perhaps once that ccall issue is solved, there will be no need for this GC switching.

abraemer · February 23, 2024, 6:27am

You don’t need to allocate a workspace for every matrix though. You only need one per thread and you can use something like TaskLocalStorage from OhMyThreads.jl to manage these conveniently.

using OhMyThreads: TaskLocalStorage.
function testp(M, n)
    ws_type = typeof{EigenWs(M, rvecs=true)}
    ws = TaskLocalStorage{ws_type}(() -> EigenWs(M, rvecs=true))
    Ms = [copy(M) for _ in 1:n]
    Threads.@threads for i in 1:n
        LAPACK.geevx!(ws[], 'N', 'N', 'V', 'N', Ms[i])
    end
end

EDIT: Fixed code. Thanks @carstenbauer.

carstenbauer · February 23, 2024, 8:16am

@abraemer I think you need to fix the code in your last post. First of all, it doesn’t look like what you probably intended to write (you’re reassigning ws and therefore you throw away the TLS). Second of all, it looks like you’re (incorrectly) citing a previous post by @ysh.

carstenbauer · February 23, 2024, 8:22am

And one more thing, if you’re using OhMyThreads anyways, you can just as well write

using OhMyThreads: TaskLocalStorage, tforeach
function testp(M, n)
    ws_type = typeof{EigenWs(M, rvecs=true)}
    ws = TaskLocalStorage{ws_type}(() -> EigenWs(M, rvecs=true))
    Ms = [copy(M) for _ in 1:n]
    tforeach(1:n) do i
        LAPACK.geevx!(ws[], 'N', 'N', 'V', 'N', Ms[i])
    end
end

Since you have a uniform workload, it might make sense to also pass the keyword argument scheduler=StaticScheduler() to tforeach.

Experimental: If the alternative “macro API” lands, you should also be able to write this as

function testp(M, n)
    ws_type = typeof{EigenWs(M, rvecs=true)}
    Ms = [copy(M) for _ in 1:n]
    @threaded scheduler=StaticScheduler() for i in 1:n
        @tasklocal ws::ws_type = EigenWs(M, rvecs=true)
        LAPACK.geevx!(ws, 'N', 'N', 'V', 'N', Ms[i])
    end
end

Disclaimer: I’m one of the two main authors of OhMyThreads.jl

abraemer · February 23, 2024, 8:27am

That macro API looks very nice! Looking forward to it

Can you clarify what is the difference between using @threads and tforeach (with default arguments)? I couldn’t find anything in the docs except that tforeach supports multiple schedulers.

carstenbauer · February 23, 2024, 8:41am

Apart from the obvious (one is a loop macro the other one isn’t), tforeach gives you

load balancing by default (it creates 2*nthreads() tasks by default)
more control over the “chunking”, that is, how the interval is divided into tasks (@threads :dynamic/:static always creates nthreads() tasks)

We want to add more features, like new schedulers, perhaps automatic pinning of threads, and more in the future. As for the “macro API”, it gives us some flexibility to make some patterns, like using TLS, a bit simpler (although we generally prefer the higher order function API).

OhMyThreads.jl also supports reductions, which @threads doesn’t (although @distributed does for multiprocessing). The @threaded macro will automatically expand to tmapreduce (rather than tforeach) if you provide a reducer argument, e.g.

julia> @threaded reducer=(+) for i in 1:3
           sin(i)
       end
1.8918884196934453

But now it’s time for me to stop (1) promoting our package and (2) derailing this thread any further.

ysh · February 23, 2024, 4:26pm

I gave OhMyThreads.jl a try. Actually, I would like to pre-allocate one buffer matrix per thread instead of allocating as many as there are iterations. So I did this:

using OhMyThreads: TaskLocalValues.TaskLocalValue, tforeach, StaticScheduler
BLAS.set_num_threads(1)
Random.seed!(1)
function testp(M, n)
    ws_type = typeof(EigenWs(M, rvecs=true))
    ws = TaskLocalValue{ws_type}(() -> EigenWs(M, rvecs=true))
    ms = TaskLocalValue{typeof(M)}(() -> similar(M))
    tforeach(1:n, scheduler=StaticScheduler()) do i
        m = ms[]
        m .= M # we're doing a simple copy, but m could depend on i
        LAPACK.geevx!(ws[], 'N', 'N', 'V', 'N', m)
    end
end

I was launching Julia with 8 threads and tested this with

for sz in (100, 7000) # matrix size
    M = rand(sz, sz)
    print("$sz: ")
    @time testp(M, 3 * Threads.nthreads())
end

On a Mac, I could get all cores launch 9 out of 10 times. On an i7-1260P running Linux and with thread pinning, success is 6 out of 10. I suppose this is because ms[] or ws[] can trigger GC.
Is there a way to make each thread access an element of a pre-allocated buffer? Something like ms[threadid()], but the docs say threadid() is not safe to use.

carstenbauer · February 23, 2024, 4:54pm

ysh:

I gave OhMyThreads.jl a try. Actually, I would like to pre-allocate one buffer matrix per thread instead of allocating as many as there are iterations. So I did this:

using OhMyThreads: TaskLocalValues.TaskLocalValue, tforeach, StaticScheduler
BLAS.set_num_threads(1)
Random.seed!(1)
function testp(M, n)
    ws_type = typeof(EigenWs(M, rvecs=true))
    ws = TaskLocalValue{ws_type}(() -> EigenWs(M, rvecs=true))
    ms = TaskLocalValue{typeof(M)}(() -> similar(M))
    tforeach(1:n, scheduler=StaticScheduler()) do i
        m = ms[]
        m .= M # we're doing a simple copy, but m could depend on i
        LAPACK.geevx!(ws[], 'N', 'N', 'V', 'N', m)
    end
end

Why do you make ms a TaskLocalValue and then perform a (strange) copy in the “parallel loop”? In your OP, M was just a single input matrix?

(I think you might be confusing input data with temporary task-local buffers.)

Not really.

If possible, you shouldn’t think about threads but tasks as Julia implements task-based multithreading. I recommend you take a look at JuliaUCL24/notebooks/Day3/2_multithreading.ipynb at main · carstenbauer/JuliaUCL24 · GitHub. Instead of accessing a pre-allocated element per thread you should likely be accessing a pre-allocated element per task and that’s what TaskLocalValue gives you.

But to answer your question, the following creates nthreads() tasks each running on a different thread and accessing an element of a pre-allocated buffer without using threadid().

data = rand(nthreads())
tforeach(1:nthreads(); scheduler=StaticScheduler()) do tid
    println(data[tid])
end

Don’t think that’s what you want to do though.

Topic		Replies	Views
Scaling of @threads for "embarrassingly parallel" problem Performance threads	29	1960	January 20, 2023
Multi-threading on a 2 CPU system New to Julia multithreading	15	1084	February 2, 2023
What is julia doing with your threads? General Usage	23	1129	February 21, 2024
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	3975	February 11, 2019
Huge performance fluctuations in parallel benchmark: insights? Performance parallel , multithreading , benchmarktools	52	2629	December 1, 2021

Inconsistent CPU utilisation in @threads loops

Related topics