# Nested parallel loops with @spawn

Hi all,

I’m trying to choose the best way of parallelizing my code given the the current options of @spawn and @threads. A toy model for what I’m trying to do is the following:

``````function solve_model_serial!(A, Nu, Nx, Ny)
for j in 1:Ny, i in 1:Nx
A_ij = view(A, :, i, j)
solve_kernel!(A_ij, Nu)
end
nothing
end

function solve_kernel!(A_ij, Nu)
for a in 1:Nu
A_ij[a] = 42*a + rand()
end
nothing
end
``````

I’ve experimented with the following options:

``````function solve_model_threads1!(A, Nu, Nx, Ny)
for i in 1:Nx
A_ij = view(A, :, i, j)
solve_kernel!(A_ij, Nu)
end
end
nothing
end

function solve_model_spawn1!(A, Nu, Nx, Ny)
t = @spawn for j in 1:Ny
for i in 1:Nx
A_ij = view(A, :, i, j)
solve_kernel!(A_ij, Nu)
end
end
wait(t)
nothing
end

function solve_model_spawn2!(A, Nu, Nx, Ny)
t1 = @spawn for j in 1:Ny
@spawn for i in 1:Nx
A_ij = view(A, :, i, j)
solve_kernel!(A_ij, Nu)
end
end
wait(t1)
nothing
end

function solve_model_spawn3!(A, Nu, Nx, Ny)
@sync for j in 1:Ny, i in 1:Nx
@spawn begin
A_ij = view(A, :, i, j)
solve_kernel!(A_ij, Nu)
end
end
nothing
end
``````

These are the results I’m getting with 4 threads:

``````Nu = 32
Nx = 600
Ny = 600

A = zeros(Nu, Nx, Ny)

> @time solve_model_serial!(A, Nu, Nx, Ny)
0.146789 seconds
> @time solve_model_threads1!(A, Nu, Nx, Ny)
0.040972 seconds
> @time solve_model_spawn1!(A, Nu, Nx, Ny)
0.080115 seconds
> @time solve_model_spawn2!(A, Nu, Nx, Ny)
0.001354 seconds
> @time solve_model_spawn3!(A, Nu, Nx, Ny)
0.365831 seconds
``````

From these, it seems `solve_model_spawn2!` is clearly the best option, but I don’t fully understand why… It’s actually crazy how much faster it runs when compared with the serial version (and also with the other options). In particular, why is `solve_model_spawn3!` so slow?

Is `solve_model_spawn2!` the best way to handle such cases of nested independent loops? Are there any other (potentially better) options I’m forgetting?

I think `solve_model_spawn2!` is fast because it does not wait until the inner tasks are finished. See:

``````julia> t1 = Threads.@spawn begin
global t2
sleep(1.0)
println("DONE")
end
end
wait(t1) # this does not wait t2
t2

julia> wait(t2)
DONE
``````

As you can see, using raw-`@spawn` directly is pretty dangerous, since it does not do structured concurrency. Using `@threads` is OK but I think using map/filter/reduce framework for multithreading is a better option (e.g., KissThreading.jl, Transducers.jl (that’s my library)).

1 Like

Oh, so `solve_model_spawn2!` wasn’t waiting for the inner task to finish before actually returning from the function? For my case, since the loops are independent, the inner tasks have no dependence on the outer tasks, but I do care that both finish before the function returns, obviously. So I’m now trying this:

``````function solve_model_spawn2!(A, Nu, Nx, Ny)
t1 = @spawn for j in 1:Ny
global t2 = @spawn for i in 1:Nx
A_ij = view(A, :, i, j)
solve_kernel!(A_ij, Nu)
end
end
wait(t1)
wait(t2)
nothing
end
> @time solve_model_spawn2!(A, Nu, Nx, Ny)
0.003231 seconds
``````

which is slightly slower than before, but still crazy fast. Would this be the optimal way of implementing this? This seems to be a fairly vanilla example of independent nested loops, so I was a bit surprised that I could not find any such example…

t1 is only spawning a single thread, and t2 is overwritten on each loop, so you’re only waiting on the very last loop as far as real computations.

1 Like

I may be mistaken, but that as well seems to not do what you expect.
As `t2` is global, it gets overwritten by each newly spawned thread. So, it doesn’t wait until all `Ny` thread finished, it waits until one random one does (the one that wrote to `t2` just before `wait(t2)` is invoked). And it finishes pretty quickly, I guess, as it only does `1/Ny`th of the total workload.

So what can I do to make sure that both loops finish before the function returning?

You can either collect each `Task` returned by `@spawn` into an array and then `foreach(wait, t2s)` or enclose all the `@spawn`ed tasks in a `@sync` block like you did in `solve_mode_spawn3!`.

The really strange thing to me in your results is that `solve_model_spawn1!` was significantly faster than the serial version since it only spawns a single thread and then immediately `wait`s for it. Makes me think you’re getting significant delays due to compilation. You should use `BenchmarkTools` to make sure your timings are meaningful.

I see, so something like this:

``````function solve_model_spawn4!(A, Nu, Nx, Ny)
@sync begin
@spawn for j in 1:Ny
@spawn for i in 1:Nx
A_ij = view(A, :, i, j)
solve_kernel!(A_ij, Nu)
end
end
end
nothing
end
``````

Here are the execution times when running with 4 threads (now using `@btime`):

``````> @btime solve_model_serial!(\$A, Nu, Nx, Ny)
61.374 ms (360000 allocations: 21.97 MiB)
> @btime solve_model_spawn1!(\$A, Nu, Nx, Ny)
62.880 ms
> @btime solve_model_spawn4!(\$A, Nu, Nx, Ny)
22.104 ms
> @btime solve_model_threads1!(\$A, Nu, Nx, Ny)
19.983 ms
``````

These times seem to make more sense, but it also shows that `@spawn` does not perform better than `@threads` for this type of usage… Would this be a fair conclusion? Or should I expect a different behaviour as the number of threads increases?

And also, `solve_model_threads1!` leaves me a bit unsatisfied as it does nothing for the inner loop, which is also fully parallelizeable… Is there a way to (efficiently) parallelize this as well (similar, say, to the `collapse` clause in OpenMP)?

Again, I want to emphasize that enclosing an entire for-loop inside a `@spawn` simply runs all iterations of the loop serially inside a single thread created by `@spawn`. So writing something like

``````@sync begin
@spawn for i=1:10
dostuff(i)
end
end
``````

is equivalent to simply writing:

``````for i=1:10
dostuff(i)
end
``````

In particular this means that `solve_model_spawn4!` above is only getting parallelism on the outer loop rather than the inner, and `solve_model_spawn1!` is getting no parallelism whatsoever.

If you want to run all iterations of both the inner and outer loop in parallel, you want something like:

``````function solve_model_spawn4!(A, Nu, Nx, Ny)
@sync for j in 1:Ny
for i in 1:Nx
@spawn begin
A_ij = view(A, :, i, j)
solve_kernel!(A_ij, Nu)
end
end
end
nothing
end
``````

For cases where `@threads` fits well (basically a map over every iteration of a loop), I wouldn’t expect to see any performance difference from reimplementing `@threads` using `@spawn`.

2 Likes

I see… But then, isn’t it odd that this gives the worse performance of all?

``````function solve_model_spawn5!(A, Nu, Nx, Ny)
@sync for j in 1:Ny
for i in 1:Nx
@spawn begin
A_ij = view(A, :, i, j)
solve_kernel!(A_ij, Nu)
end
end
end
nothing
end
``````

it’s much worse than the serial version:

``````> @btime solve_model_spawn5!(\$A, Nu, Nx, Ny)
237.690 ms
> @btime solve_model_serial!(\$A, Nu, Nx, Ny)
69.839 ms
``````

Not really. The task that you’re giving each thread is so simple that I imagine the cost of switching tasks dominates. All the normal problems that come with shared-memory parallelism still apply in Julia. See also https://stackoverflow.com/questions/59078305/multi-threaded-parallelism-performance-problem-with-fibonacci-sequence-in-julia and https://stackoverflow.com/questions/52593588/julia-why-doesnt-shared-memory-multi-threading-give-me-a-speedup

EDIT: okay fine. Not all normal problems that come with shared-memory parallelism still apply with Julia; just most.

It’s not surprising. `@spwan` has rather high cost so you should not create `Nx * Ny` of them. Rather, a correct approach is to split the iterations in chunks and then execute each chunk in a `@spwan`. In fact, this is what `@threads` does and there is a PR to re-implement `@threads` using `@spwan`:

As I said, using `@spwan` is hard so you should be using `@threads` for such a high-level code. More ideally, if you think run-time of each `solve_kernel!` can vary, it’d be better to use threaded map from some package with better scheduling strategy. `@threads` uses static scheduling so it’s not ideal in such a case.

Thanks for all your input. My takeaway is to stick with `@threads` for these type of things.