# Multithreading for nested for loops

Hi everyone,

I am trying to understand the performance of `@threads` for nested for-loops.

Suppose I have three layer of loops, and I have access to 100 threads for my computation. Each calculation depends on the specific `(i, j, k)` only and are unrelated to each other. There are two ways that I can think of to perform the calculation.

Version 1:

``````using Base.Threads
df = zeros(15*18*20, 4)
for j in 1:18
for k in 1:20
id = (k-1)*(18*15) + (j-1)*15 + i
# Some computation here that assigns the results to df[id, :]
end
end
end
``````

Version 2:

``````using Base.Threads
df = zeros(15*18*20, 4)
i = Int(ceil(id/(18*20)))
j = Int(ceil((id - (i-1)*18*20)/20))
k = Int(id - (i - 1)*18*20 - (j - 1)*18)
# Some computation here that assigns the results based on (i, j, k) to df[id, :]
end
``````

In an example that I tried, I find that version 2 is much slower than version 1, but I do not quite understand why.

I thought both methods are equivalent, and version 2 should potentially be faster because it runs jobs using all 100 threads at once (`id` runs from 1 to 5,400), whereas `i` only runs from 1 to 15 in version 1.

My questions are as follows:

1. Is the implementation in version 1 always more preferable to version 2 in general? I am not sure if it is specific to my example, or I have done something wrong.
2. If version 1 is better than version 2, is there any advice on where I should put `@threads`? I found that putting it on the â€ś`j`-loopâ€ť is better (faster) than putting it on the â€ś`i`-loopâ€ť. I am not exactly sure why because I read from some other posts that it does not matter.

Thanks!!

Your version 2 should be generally better. You are probably (?) doing some computation for which the order of the access to arrays, or memory accesses are causing some slowdown.

Also, if your computation is allocating something, or if the computation is too fast, then spawning 100 threads may be detrimental.

3 Likes

There are three key questions:

• How many CPU cores does your machine have?
• How much work does each loop iteration do? In version 1, each task will do at least 18 * 20 = 360 iterations, while in version 2 a task is doing at most 5400 / 100 = 54 iterations, and possibly fewer. Maybe the latter is not enough to offset the overhead of task spawning and synchronization?
• Does the computation within the loop body allocate intermediate arrays? If so, the speedup with multithreading will usually saturate at a fairly small number of parallel tasks and eventually decrease due to the implicit synchronization happening every time the garbage collector runs. (Julia 1.10 will come with a multithreaded garbage collector that may improve on this.)

As an aside, you can get away without the index arithmetic in version 2 as follows

``````@threads for (i, j, k) in collect(Iterators.product(1:15, 1:18, 1:20))
...
end
``````
5 Likes

The fourth key question: are you putting your code in a function? Running a multi-threaded loop in global scope is not good for performance.

A more minor point; this code

does a bunch of floating point divisions followed by rounding and conversion. Use integer division (such as `div`) directly. Since you do not show the actual main code, I canâ€™t really tell if this matters at all, but it is anyway better style.

If you want good multi-threading performance, optimize the single-threaded performance first, following the official performance tips: https://docs.julialang.org/en/v1/manual/performance-tips/

3 Likes

I see, thanks â€“ my computation happens to be that as the indices are larger (i.e., larger `i` or `j` or `k`), the computation takes a longer time, so the time needed to run the computation are not constant across the indices.

• I am running it on the server so I have access to 100 CPU cores.
• Thatâ€™s a good point (seems similar to what @lmiq suggested) â€” the computation takes longer time as `i`/`j`/`k` is/are larger. So, in version 2, I could imagine the first `id`s would take a shorter time to finish. Would you suggest making each threads having approximately the same amount of computational work?
• Just to be clear here â€“ do you mean creating arrays down in the loops would reduce speed gains?
• Thanks for your suggestion for the index arithmetic!

Thanks!!

• Yes I used functions in the â€śsome computation hereâ€ť part.
• Also thanks for your suggestion for integer division, I will keep this in mind in the future.

I meant that the outer loop, with the `@threads` annotation, should be inside a function. Itâ€™s also important to avoid global variables, so make sure to pass them as input arguments to thus outer function.

2 Likes

Without further knowledge about your exact computation, these are two possible patterns you can try:

If the workload is large enough for each item, this may be enough to provide a good load-balancing:

``````using Base.Threads
function f!(df)
work_items = collect(Iterators.product(1:15, 1:18, 1:20))
@sync for ind in work_items
@spawn begin
# do something
end
end
end
``````

Otherwise, you can try this, to split the workload into larger chunks, scattered over the items, since you say that the workload increase with the indexes:

``````using ChunkSplitters
work_items = collect(Iterators.product(1:15, 1:18, 1:20))
@sync for (item_range, _) in chunks(work_items, nchunks, :scatter)
@spawn for ind in item_range
# do something
end
end
end
``````

This will split the work items into `nchunks` chunks scattered over the indexes, such that the chunks will get more or less the same number of initial and final indexes. (see here)

3 Likes

I see

Yes thatâ€™s what I did â€” I run a script that pass variables to the function that contains all the loops above (including the one with the `@threads` annotation.)

I see, thanks! I will give it a try for both solutions and see how they go. I didnâ€™t include the actual computations as they involve many other functions and I am not sure how to formulate a suitable MWE to represent my computation.

Would your experience be that both of your solutions typically work better than just using `@threads`?

Well, typically is too of a strong word. There are problems for which these alternatives provide better load balancing or workload distribution. Here is a discussion related to that: Home Â· ChunkSplitters.jl

1 Like

Every time your code creates a new array or otherwise allocates new memory, it first asks the garbage collector (GC) whether itâ€™s time to sweep up objects that are no longer in scope and release their memory. If the GC says yes, the sweep is performed before your code can resume. Such a check-in with the GC is called a safepoint (safepoints are inserted in a few other cases as well, and you can insert one manually by placing `GC.safepoint()` in your code). If only a single thread is running, thatâ€™s all there is to it.