Multithreaded calculation slows down to 2 threads, debug how?

Hi all,

I’m running Julia on an AMD EPYC machine with 128 threads to solve a specific numerical problem. The code calculates a 2d coordinate grid and then loops over all the coordinates via
Threads.@threads for (a,b) in point_array … Inside the loop, DifferentialEquations is used to solve an ODE system, and the solution is fitted with various functions using LsqFit.

Recently (update to 1.9.4, but also changes in my code) the program starts running with all 128 threads as expected, but soon afterwards only 2 threads are busy, and this does not change for several days (if at all).

How can I debug this problem?

At the moment I’m re-running things with @profile - any other ideas?

Best, Andreas

How do you check this?

Linux system load tools, e.g. htop …

Is there any reason why 2 instances out of the whole bunch would take longer to run? Are there conditional statements inside of that for loop? Is the problem size constant for all iterations?

Is there any reason why 2 instances out of the whole bunch would take longer to run?

No obvious reason.

Are there conditional statements inside of that for loop?

Yes, the content of the loop is rather complex.

Is the problem size constant for all iterations?

Is solving differential equations numerically size constant? And doing least-square fits? Probably not.

However… even if the threading code encounters two cases where the loop content takes much longer, shouldn’t other terminating threads lead to new threads being started?

No, because Threads.@threads for causes all of the threads to wait at the end until the loop completes if two iterations of the loop (handled by 2 threads) are much slower.

The only way around this is if your loop iterations are internally multi-threaded, in which case the scheduler can dynamically re-assign idle threads to help work on the the slow iterations.

No, because Threads.@threads for causes all of the threads to wait at the end until the loop completes if two iterations of the loop (handled by 2 threads) are much slower.

Sorry, I don’t fully get this. Imagine this situation:

  • Threads.@threads for needs to go over a list of, say, 500 points, and we have 128 threads.
  • It starts running with 128 threads, handling 128 points.
  • Of these 128, 2 take much longer, while the others finish.

Does it really wait for the 2 until it starts a next 128 point block?

No, but once it finishes all 500 points except for 2, the threads sit and wait for those 2 to complete.

No, but once it finishes all 500 points except for 2, the threads sit and wait for those 2 to complete.

That’s not what is happening though. I have a control output which writes and syncs a message whenever 10 points are finished. That occurs eventually, but only after a long wait with only 2 threads busy.

(Small code piece fenced off by a SpinLock, which increases a global counter and checks modulo 10 …)

Maybe try Threads.@threads :static for to use the static scheduler — maybe the default dynamic scheduler is getting stuck on some code that never yields and it’s not able to migrate threads to keep them busy?

Will try.

In the meantime I compared Julia 1.8.5 and Julia 1.9.4 with the same code and parameters on my side:

  • Julia 1.8.5: average 69.7s per parameter point
  • Julia 1.9.4: average 132.5s per parameter point

(This is total runtime divided per number of points, same machine, same JULIA_NUM_THREADS=128, same code, 176 points)

Profiling runs of Julia 1.8 and 1.9 (with sampling once per second) can be found here:
https://www.akhuettel.de/~huettel/tmp/j-th-20231121/