# Testing nested parallelization with @distributed and @spawn/@threads

I know this is a popular topic, but I wanted to check my understanding on nested parallelization with @distributed, @spawn, and @threads.

Here’s a working example of a few different strategies

``````using LinearAlgebra
import Distributed
using BenchmarkTools

function test(storemat)
matlength = length(storemat)
veclength = length(storemat)

@sync Distributed.@distributed for j = 1:matlength
end
end
end

function test2(storemat)
matlength = length(storemat)
veclength = length(storemat)

for j = 1:matlength
end
end
end

function test3(storemat)
matlength = length(storemat)
veclength = length(storemat)

for i = 1:veclength
end
end
end

function test4(storemat)
matlength = length(storemat)
veclength = length(storemat)

@sync Distributed.@distributed for j = 1:matlength
for i = 1:veclength
end
end
end

function test5(storemat)
matlength = length(storemat)
veclength = length(storemat)

for j = 1:matlength
for i = 1:veclength
end
end
end

function test6(storemat)
matlength = length(storemat)
veclength = length(storemat)

@sync Distributed.@distributed for j = 1:matlength
K = Threads.@spawn for i = 1:veclength
end
wait(K)
end
end

function test7(storemat)
matlength = length(storemat)
veclength = length(storemat)

K = Threads.@spawn for j = 1:matlength
end
end
wait(K)
end

matlength = 1000
veclength = 1000
storemat = [Array{Float64,1}(undef,veclength) for j = 1:matlength]

@btime test(storemat) #4.402 s (135841 allocations: 497.96 MiB)
@btime test2(storemat) #88.807 ms (102119 allocations: 8.82 MiB)
@btime test3(storemat) #107.910 μs (101 allocations: 9.00 KiB)
@btime test4(storemat) #3.898 s (135903 allocations: 497.95 MiB)
@btime test5(storemat) #327.020 μs (0 allocations: 0 bytes)
@btime test6(storemat) #3.903 s (138338 allocations: 498.01 MiB)
@btime test7(storemat) #1.366 ms (6005 allocations: 469.19 KiB)

``````

The times were taken from a run on a cluster with 8 nodes and 8 cores per node requested (total 64 requested in Julia). I did not implement SharedArrays in this example because it threw an error `ERROR: LoadError: SystemError: shm_open() failed for /jl25628797yr6QSLMDKnb3MV0yim: Too many open files`. Yes, I did generate a list of nodes and how many cores were requested on each. The code for that with slum was

``````srun hostname -s > nodenames.txt
``````

and then a simple script to put the output into a list like

``````8 host1
8 host2
8 host3
...
``````

to be called by `julia --machinefile nodes`

My questions are:

1. Have I understood this correctly that @distributed should be used for different nodes? @threads or @spawn for different cores on that node?
2. What kind of performance increase should be expected from future versions of these functions?
3. On more complicated examples (mainly copying values and matrix multiplies), @threads will outperform @spawn. Is that a general rule in other’s experience?
4. Are there other general tips to make this more efficient?