@threads - machine stops using allocated cores during run

Dear all,

I have a function in a function that I can run in parallel, and I am using the ‘@threads’ macro for that. Calling that outside function takes several hours, and, at the beginning, works as expected and all intended threads are used on my machine.

Sometimes, at a - random - iteration however, I can see that not all allocated threads are used anymore, and instead only 1 thread is used until the outside function is finished. A simplified MWE:

function inside_fun(n)
    return randn(n)
end
function outside_fun(timesteps::Int64, n::Int64)
    for iter in 1:timesteps
        Base.Threads.@threads for Nthreads in 1:n
            inside_fun(10)
        end
    end
    return nothing
end
outside_fun(10, 5)

I have tried to google this problem, but was unsuccessful unfortunately.

(1) What is the reason that at some point Julia does not use all allocated threads anymore with the @threads macro?
(2) Also, is there a way to fix this and “force” Julia to continue using all allocated threads on my machine?

You could simply be witnessing the last (n-1) calls finishing up. When I call your outside_fun(10_000_000,5) (the only way to make it run long enough to measure with htop), the CPU stays pegged at 400% until the last second and then it drops to 300% and then ~0%. Alternatively, it could be the GC trying to clean up? It’s hard to tell without more information. If you use a benchmarker, does it say there is a lot of time spent in GC?

Thank you for your answer!

You could simply be witnessing the last (n-1) calls finishing up.

No, it often has more than a 1/3 of outside iterations left - the function above was just an example because the actual functions encompass several packages and I dont think anyone would look through that.

Alternatively, it could be the GC trying to clean up?

This could definitely be the case. It might also be the case that Memory might get full at some point until the GC kicks in. Could this cause the behaviour I described above?

If your memory is full, then usually you would see either a crash or a dramatic slowdown as it starts using swap (your disk as memory). If it persists indefinitely while also making progress on the iterations, it seems unlikely to be GC (but perhaps someone with more domain expertise might be able to offer an insight).

What seems more likely is that it is one of the “several packages” reaching a bottleneck, but again, without more information I can not say with confidence. What type of problem is it that you are solving? ODEs? Matrix factorizations? Which packages are you using?

I ran your script with 10 replaced by 1000000, and watched what julia was doing using perf, with the command

sudo perf top -p $(pidof julia)

The output

  62.68%  libjulia-internal.so.1.7  [.] get_next_task                                                                                                                                                                                                                                                  
   2.89%  libjulia-internal.so.1.7  [.] jl_task_get_next                                                                                                                                                                                                                                               
   2.58%  libjulia-internal.so.1.7  [.] jl_process_events                                                                                                                                                                                                                                              
   1.87%  [kernel]                  [k] delay_halt_mwaitx                                                                                                                                                                                                                                              

tells me that most of your code is spending its time in the overhead of switching between the different threads’ tasks. This is with 16 threads.

Maybe your MWE doesn’t represent your actual code, or maybe the threading overhead is degrading your performance that much, it’s hard to tell. I would recommend running perf top, htop, iotop, to see if your system can tell you more.