I know this is a popular topic, but I wanted to check my understanding on nested parallelization with @distributed, @spawn, and @threads.
Here’s a working example of a few different strategies
using LinearAlgebra
import Distributed
using BenchmarkTools
function test(storemat)
matlength = length(storemat)
veclength = length(storemat[1])
@sync Distributed.@distributed for j = 1:matlength
Threads.@threads for i = 1:veclength
storemat[j][i] = Threads.threadid()
end
end
end
function test2(storemat)
matlength = length(storemat)
veclength = length(storemat[1])
for j = 1:matlength
Threads.@threads for i = 1:veclength
storemat[j][i] = Threads.threadid()
end
end
end
function test3(storemat)
matlength = length(storemat)
veclength = length(storemat[1])
Threads.@threads for j = 1:matlength
for i = 1:veclength
storemat[j][i] = Threads.threadid()
end
end
end
function test4(storemat)
matlength = length(storemat)
veclength = length(storemat[1])
@sync Distributed.@distributed for j = 1:matlength
for i = 1:veclength
storemat[j][i] = Threads.threadid()
end
end
end
function test5(storemat)
matlength = length(storemat)
veclength = length(storemat[1])
for j = 1:matlength
for i = 1:veclength
storemat[j][i] = Threads.threadid()
end
end
end
function test6(storemat)
matlength = length(storemat)
veclength = length(storemat[1])
@sync Distributed.@distributed for j = 1:matlength
K = Threads.@spawn for i = 1:veclength
storemat[j][i] = Threads.threadid()
end
wait(K)
end
end
function test7(storemat)
matlength = length(storemat)
veclength = length(storemat[1])
K = Threads.@spawn for j = 1:matlength
Threads.@spawn for i = 1:veclength
storemat[j][i] = Threads.threadid()
end
end
wait(K)
end
matlength = 1000
veclength = 1000
storemat = [Array{Float64,1}(undef,veclength) for j = 1:matlength]
@btime test(storemat) #4.402 s (135841 allocations: 497.96 MiB)
@btime test2(storemat) #88.807 ms (102119 allocations: 8.82 MiB)
@btime test3(storemat) #107.910 μs (101 allocations: 9.00 KiB)
@btime test4(storemat) #3.898 s (135903 allocations: 497.95 MiB)
@btime test5(storemat) #327.020 μs (0 allocations: 0 bytes)
@btime test6(storemat) #3.903 s (138338 allocations: 498.01 MiB)
@btime test7(storemat) #1.366 ms (6005 allocations: 469.19 KiB)
The times were taken from a run on a cluster with 8 nodes and 8 cores per node requested (total 64 requested in Julia). I did not implement SharedArrays in this example because it threw an error ERROR: LoadError: SystemError: shm_open() failed for /jl25628797yr6QSLMDKnb3MV0yim: Too many open files
. Yes, I did generate a list of nodes and how many cores were requested on each. The code for that with slum was
srun hostname -s > nodenames.txt
and then a simple script to put the output into a list like
8 host1
8 host2
8 host3
...
to be called by julia --machinefile nodes
My questions are:
- Have I understood this correctly that @distributed should be used for different nodes? @threads or @spawn for different cores on that node?
- What kind of performance increase should be expected from future versions of these functions?
- On more complicated examples (mainly copying values and matrix multiplies), @threads will outperform @spawn. Is that a general rule in other’s experience?
- Are there other general tips to make this more efficient?