Multithreading for nested for loops

Rick · August 14, 2023, 10:00pm

Hi everyone,

I am trying to understand the performance of @threads for nested for-loops.

Suppose I have three layer of loops, and I have access to 100 threads for my computation. Each calculation depends on the specific (i, j, k) only and are unrelated to each other. There are two ways that I can think of to perform the calculation.

Version 1:

using Base.Threads
df = zeros(15*18*20, 4)
@threads for i in 1:15
  for j in 1:18
    for k in 1:20
      id = (k-1)*(18*15) + (j-1)*15 + i
      # Some computation here that assigns the results to df[id, :]
    end
  end
end

Version 2:

using Base.Threads
df = zeros(15*18*20, 4)
@threads for id in 1:(15*18*20)
  i = Int(ceil(id/(18*20)))
  j = Int(ceil((id - (i-1)*18*20)/20))
  k = Int(id - (i - 1)*18*20 - (j - 1)*18)
  # Some computation here that assigns the results based on (i, j, k) to df[id, :]
end

In an example that I tried, I find that version 2 is much slower than version 1, but I do not quite understand why.

I thought both methods are equivalent, and version 2 should potentially be faster because it runs jobs using all 100 threads at once (id runs from 1 to 5,400), whereas i only runs from 1 to 15 in version 1.

My questions are as follows:

Is the implementation in version 1 always more preferable to version 2 in general? I am not sure if it is specific to my example, or I have done something wrong.
If version 1 is better than version 2, is there any advice on where I should put @threads? I found that putting it on the “j-loop” is better (faster) than putting it on the “i-loop”. I am not exactly sure why because I read from some other posts that it does not matter.

Thanks!!

lmiq · August 14, 2023, 10:46pm

Your version 2 should be generally better. You are probably (?) doing some computation for which the order of the access to arrays, or memory accesses are causing some slowdown.

Also, if your computation is allocating something, or if the computation is too fast, then spawning 100 threads may be detrimental.

danielwe · August 14, 2023, 10:51pm

There are three key questions:

How many CPU cores does your machine have?
How much work does each loop iteration do? In version 1, each task will do at least 18 * 20 = 360 iterations, while in version 2 a task is doing at most 5400 / 100 = 54 iterations, and possibly fewer. Maybe the latter is not enough to offset the overhead of task spawning and synchronization?
Does the computation within the loop body allocate intermediate arrays? If so, the speedup with multithreading will usually saturate at a fairly small number of parallel tasks and eventually decrease due to the implicit synchronization happening every time the garbage collector runs. (Julia 1.10 will come with a multithreaded garbage collector that may improve on this.)

As an aside, you can get away without the index arithmetic in version 2 as follows

@threads for (i, j, k) in collect(Iterators.product(1:15, 1:18, 1:20))
    ...
end

DNF · August 14, 2023, 10:56pm

The fourth key question: are you putting your code in a function? Running a multi-threaded loop in global scope is not good for performance.

A more minor point; this code

does a bunch of floating point divisions followed by rounding and conversion. Use integer division (such as div) directly. Since you do not show the actual main code, I can’t really tell if this matters at all, but it is anyway better style.

If you want good multi-threading performance, optimize the single-threaded performance first, following the official performance tips: https://docs.julialang.org/en/v1/manual/performance-tips/

Rick · August 15, 2023, 6:01am

I see, thanks – my computation happens to be that as the indices are larger (i.e., larger i or j or k), the computation takes a longer time, so the time needed to run the computation are not constant across the indices.

Rick · August 15, 2023, 6:05am

Thanks, this is helpful!

I am running it on the server so I have access to 100 CPU cores.
That’s a good point (seems similar to what @lmiq suggested) — the computation takes longer time as i/j/k is/are larger. So, in version 2, I could imagine the first ids would take a shorter time to finish. Would you suggest making each threads having approximately the same amount of computational work?
Just to be clear here – do you mean creating arrays down in the loops would reduce speed gains?
Thanks for your suggestion for the index arithmetic!

Rick · August 15, 2023, 6:07am

Thanks!!

Yes I used functions in the “some computation here” part.
Also thanks for your suggestion for integer division, I will keep this in mind in the future.

DNF · August 15, 2023, 6:53am

I meant that the outer loop, with the @threads annotation, should be inside a function. It’s also important to avoid global variables, so make sure to pass them as input arguments to thus outer function.

lmiq · August 15, 2023, 9:48am

Without further knowledge about your exact computation, these are two possible patterns you can try:

If the workload is large enough for each item, this may be enough to provide a good load-balancing:

using Base.Threads
function f!(df)
    work_items = collect(Iterators.product(1:15, 1:18, 1:20))
    @sync for ind in work_items
        @spawn begin
            # do something
        end
    end
end

Otherwise, you can try this, to split the workload into larger chunks, scattered over the items, since you say that the workload increase with the indexes:

using ChunkSplitters
using Base.Threads
function f!(df; nchunks=nthreads())
    work_items = collect(Iterators.product(1:15, 1:18, 1:20))
    @sync for (item_range, _) in chunks(work_items, nchunks, :scatter)
        @spawn for ind in item_range
            # do something
        end
    end
end

This will split the work items into nchunks chunks scattered over the indexes, such that the chunks will get more or less the same number of initial and final indexes. (see here)

Rick · August 15, 2023, 6:00pm

I see

Yes that’s what I did — I run a script that pass variables to the function that contains all the loops above (including the one with the @threads annotation.)

Rick · August 15, 2023, 6:06pm

I see, thanks! I will give it a try for both solutions and see how they go. I didn’t include the actual computations as they involve many other functions and I am not sure how to formulate a suitable MWE to represent my computation.

Would your experience be that both of your solutions typically work better than just using @threads?

lmiq · August 15, 2023, 6:13pm

Well, typically is too of a strong word. There are problems for which these alternatives provide better load balancing or workload distribution. Here is a discussion related to that: Home · ChunkSplitters.jl

Rick · August 15, 2023, 9:48pm

thanks for the link!

danielwe · August 16, 2023, 10:35pm

Yes, and quite significantly. This is something many users have had to learn the hard way. Hopefully, the issue will be less severe starting with Julia 1.10, but here’s a summary of my current understanding as a regular user, as of Julia 1.9:

Every time your code creates a new array or otherwise allocates new memory, it first asks the garbage collector (GC) whether it’s time to sweep up objects that are no longer in scope and release their memory. If the GC says yes, the sweep is performed before your code can resume. Such a check-in with the GC is called a safepoint (safepoints are inserted in a few other cases as well, and you can insert one manually by placing GC.safepoint() in your code). If only a single thread is running, that’s all there is to it.

However, if multiple threads are running and one of them hits a safepoint where the GC decides to do a sweep, execution will wait until all other threads also hit safepoints before the sweep is performed and execution can resume. In other words, the GC needs to “stop the world” in order to perform a sweep, and it can’t just stop the threads in their tracks, it must wait for each of them to check in at a safepoint. Thus, if the number of threads is large, they may end up spending most of their time waiting on each other.

(Moreover, if some threads allocate and others don’t, the wait can be indefinite and potentially lead to a deadlock, see Multithreaded program hangs without explict GC.gc() - #5 by danielwe. The solution is to insert safepoints by hand in the non-allocating tasks.)

If you have many long-running tasks that need to allocate memory but don’t need to access each other’s memory, and you have a large number of CPU cores to parallelize across, it’s often more efficient to use multiprocessing (e.g, Distributed.jl) rather than multithreading.