Nested parallel loops with @spawn

Hi all,

I’m trying to choose the best way of parallelizing my code given the the current options of @spawn and @threads. A toy model for what I’m trying to do is the following:

function solve_model_serial!(A, Nu, Nx, Ny)
    for j in 1:Ny, i in 1:Nx
        A_ij = view(A, :, i, j)
        solve_kernel!(A_ij, Nu)
    end
    nothing
end

function solve_kernel!(A_ij, Nu)
    for a in 1:Nu
        A_ij[a] = 42*a + rand()
    end
    nothing
end

I’ve experimented with the following options:

function solve_model_threads1!(A, Nu, Nx, Ny)
    @threads for j in 1:Ny
        for i in 1:Nx
            A_ij = view(A, :, i, j)
            solve_kernel!(A_ij, Nu)
        end
    end
    nothing
end

function solve_model_spawn1!(A, Nu, Nx, Ny)
    t = @spawn for j in 1:Ny
        for i in 1:Nx
            A_ij = view(A, :, i, j)
            solve_kernel!(A_ij, Nu)
        end
    end
    wait(t)
    nothing
end

function solve_model_spawn2!(A, Nu, Nx, Ny)
    t1 = @spawn for j in 1:Ny
        @spawn for i in 1:Nx
            A_ij = view(A, :, i, j)
            solve_kernel!(A_ij, Nu)
        end
    end
    wait(t1)
    nothing
end

function solve_model_spawn3!(A, Nu, Nx, Ny)
    @sync for j in 1:Ny, i in 1:Nx
        @spawn begin
            A_ij = view(A, :, i, j)
            solve_kernel!(A_ij, Nu)
        end
    end
    nothing
end

These are the results I’m getting with 4 threads:

Nu = 32
Nx = 600
Ny = 600

A = zeros(Nu, Nx, Ny)

> @time solve_model_serial!(A, Nu, Nx, Ny)
  0.146789 seconds
> @time solve_model_threads1!(A, Nu, Nx, Ny)
  0.040972 seconds
> @time solve_model_spawn1!(A, Nu, Nx, Ny)
  0.080115 seconds
> @time solve_model_spawn2!(A, Nu, Nx, Ny)
  0.001354 seconds
> @time solve_model_spawn3!(A, Nu, Nx, Ny)
  0.365831 seconds

From these, it seems solve_model_spawn2! is clearly the best option, but I don’t fully understand why… It’s actually crazy how much faster it runs when compared with the serial version (and also with the other options). In particular, why is solve_model_spawn3! so slow?

Is solve_model_spawn2! the best way to handle such cases of nested independent loops? Are there any other (potentially better) options I’m forgetting?

I think solve_model_spawn2! is fast because it does not wait until the inner tasks are finished. See:

julia> t1 = Threads.@spawn begin
           global t2
           t2 = Threads.@spawn begin
               sleep(1.0)
               println("DONE")
           end
       end
       wait(t1) # this does not wait t2
       t2
Task (runnable) @0x00007f0eaff61ae0

julia> wait(t2)
DONE

As you can see, using raw-@spawn directly is pretty dangerous, since it does not do structured concurrency. Using @threads is OK but I think using map/filter/reduce framework for multithreading is a better option (e.g., KissThreading.jl, Transducers.jl (that’s my library)).

1 Like

Oh, so solve_model_spawn2! wasn’t waiting for the inner task to finish before actually returning from the function? For my case, since the loops are independent, the inner tasks have no dependence on the outer tasks, but I do care that both finish before the function returns, obviously. So I’m now trying this:

function solve_model_spawn2!(A, Nu, Nx, Ny)
    t1 = @spawn for j in 1:Ny
        global t2 = @spawn for i in 1:Nx
            A_ij = view(A, :, i, j)
            solve_kernel!(A_ij, Nu)
        end
    end
    wait(t1)
    wait(t2)
    nothing
end
> @time solve_model_spawn2!(A, Nu, Nx, Ny)
  0.003231 seconds

which is slightly slower than before, but still crazy fast. Would this be the optimal way of implementing this? This seems to be a fairly vanilla example of independent nested loops, so I was a bit surprised that I could not find any such example…

t1 is only spawning a single thread, and t2 is overwritten on each loop, so you’re only waiting on the very last loop as far as real computations.

1 Like

I may be mistaken, but that as well seems to not do what you expect.
As t2 is global, it gets overwritten by each newly spawned thread. So, it doesn’t wait until all Ny thread finished, it waits until one random one does (the one that wrote to t2 just before wait(t2) is invoked). And it finishes pretty quickly, I guess, as it only does 1/Nyth of the total workload.

So what can I do to make sure that both loops finish before the function returning?

You can either collect each Task returned by @spawn into an array and then foreach(wait, t2s) or enclose all the @spawned tasks in a @sync block like you did in solve_mode_spawn3!.

The really strange thing to me in your results is that solve_model_spawn1! was significantly faster than the serial version since it only spawns a single thread and then immediately waits for it. Makes me think you’re getting significant delays due to compilation. You should use BenchmarkTools to make sure your timings are meaningful.

I see, so something like this:

function solve_model_spawn4!(A, Nu, Nx, Ny)
    @sync begin
        @spawn for j in 1:Ny
            @spawn for i in 1:Nx
                A_ij = view(A, :, i, j)
                solve_kernel!(A_ij, Nu)
            end
        end
    end
    nothing
end

Here are the execution times when running with 4 threads (now using @btime):

> @btime solve_model_serial!($A, Nu, Nx, Ny)
 61.374 ms (360000 allocations: 21.97 MiB)
> @btime solve_model_spawn1!($A, Nu, Nx, Ny)
 62.880 ms
> @btime solve_model_spawn4!($A, Nu, Nx, Ny)
 22.104 ms
> @btime solve_model_threads1!($A, Nu, Nx, Ny)
 19.983 ms

These times seem to make more sense, but it also shows that @spawn does not perform better than @threads for this type of usage… Would this be a fair conclusion? Or should I expect a different behaviour as the number of threads increases?

And also, solve_model_threads1! leaves me a bit unsatisfied as it does nothing for the inner loop, which is also fully parallelizeable… Is there a way to (efficiently) parallelize this as well (similar, say, to the collapse clause in OpenMP)?

Again, I want to emphasize that enclosing an entire for-loop inside a @spawn simply runs all iterations of the loop serially inside a single thread created by @spawn. So writing something like

@sync begin
@spawn for i=1:10
    dostuff(i)
end
end

is equivalent to simply writing:

for i=1:10
    dostuff(i)
end

In particular this means that solve_model_spawn4! above is only getting parallelism on the outer loop rather than the inner, and solve_model_spawn1! is getting no parallelism whatsoever.

If you want to run all iterations of both the inner and outer loop in parallel, you want something like:

function solve_model_spawn4!(A, Nu, Nx, Ny)
    @sync for j in 1:Ny
        for i in 1:Nx
            @spawn begin
                A_ij = view(A, :, i, j)
                solve_kernel!(A_ij, Nu)
            end
        end
    end
    nothing
end

For cases where @threads fits well (basically a map over every iteration of a loop), I wouldn’t expect to see any performance difference from reimplementing @threads using @spawn.

2 Likes

I see… But then, isn’t it odd that this gives the worse performance of all?

function solve_model_spawn5!(A, Nu, Nx, Ny)
    @sync for j in 1:Ny
        for i in 1:Nx
            @spawn begin
                A_ij = view(A, :, i, j)
                solve_kernel!(A_ij, Nu)
            end
        end
    end
    nothing
end

it’s much worse than the serial version:

> @btime solve_model_spawn5!($A, Nu, Nx, Ny)
  237.690 ms
> @btime solve_model_serial!($A, Nu, Nx, Ny)
  69.839 ms

Not really. The task that you’re giving each thread is so simple that I imagine the cost of switching tasks dominates. All the normal problems that come with shared-memory parallelism still apply in Julia. See also https://stackoverflow.com/questions/59078305/multi-threaded-parallelism-performance-problem-with-fibonacci-sequence-in-julia and https://stackoverflow.com/questions/52593588/julia-why-doesnt-shared-memory-multi-threading-give-me-a-speedup

EDIT: okay fine. Not all normal problems that come with shared-memory parallelism still apply with Julia; just most.

It’s not surprising. @spwan has rather high cost so you should not create Nx * Ny of them. Rather, a correct approach is to split the iterations in chunks and then execute each chunk in a @spwan. In fact, this is what @threads does and there is a PR to re-implement @threads using @spwan:

As I said, using @spwan is hard so you should be using @threads for such a high-level code. More ideally, if you think run-time of each solve_kernel! can vary, it’d be better to use threaded map from some package with better scheduling strategy. @threads uses static scheduling so it’s not ideal in such a case.

Thanks for all your input. My takeaway is to stick with @threads for these type of things.