Testing nested parallelization with @distributed and @spawn/@threads

I know this is a popular topic, but I wanted to check my understanding on nested parallelization with @distributed, @spawn, and @threads.

Here’s a working example of a few different strategies

using LinearAlgebra
import Distributed
using BenchmarkTools

function test(storemat)
  matlength = length(storemat)
  veclength = length(storemat[1])

  @sync Distributed.@distributed for j = 1:matlength
    Threads.@threads for i = 1:veclength
      storemat[j][i] = Threads.threadid()
    end
  end
end

function test2(storemat)
  matlength = length(storemat)
  veclength = length(storemat[1])

  for j = 1:matlength
    Threads.@threads for i = 1:veclength
      storemat[j][i] = Threads.threadid()
    end
  end
end

function test3(storemat)
  matlength = length(storemat)
  veclength = length(storemat[1])

  Threads.@threads for j = 1:matlength
    for i = 1:veclength
      storemat[j][i] = Threads.threadid()
    end
  end
end

function test4(storemat)
  matlength = length(storemat)
  veclength = length(storemat[1])

  @sync Distributed.@distributed for j = 1:matlength
    for i = 1:veclength
      storemat[j][i] = Threads.threadid()
    end
  end
end

function test5(storemat)
  matlength = length(storemat)
  veclength = length(storemat[1])

  for j = 1:matlength
    for i = 1:veclength
      storemat[j][i] = Threads.threadid()
    end
  end
end

function test6(storemat)
  matlength = length(storemat)
  veclength = length(storemat[1])

  @sync Distributed.@distributed for j = 1:matlength
    K = Threads.@spawn for i = 1:veclength
      storemat[j][i] = Threads.threadid()
    end
    wait(K)
  end
end

function test7(storemat)
  matlength = length(storemat)
  veclength = length(storemat[1])

  K = Threads.@spawn for j = 1:matlength
    Threads.@spawn for i = 1:veclength
      storemat[j][i] = Threads.threadid()
    end
  end
  wait(K)
end


  matlength = 1000
  veclength = 1000
  storemat = [Array{Float64,1}(undef,veclength) for j = 1:matlength]

@btime test(storemat) #4.402 s (135841 allocations: 497.96 MiB)
@btime test2(storemat) #88.807 ms (102119 allocations: 8.82 MiB)
@btime test3(storemat) #107.910 μs (101 allocations: 9.00 KiB)
@btime test4(storemat) #3.898 s (135903 allocations: 497.95 MiB)
@btime test5(storemat) #327.020 μs (0 allocations: 0 bytes)
@btime test6(storemat) #3.903 s (138338 allocations: 498.01 MiB)
@btime test7(storemat) #1.366 ms (6005 allocations: 469.19 KiB)

The times were taken from a run on a cluster with 8 nodes and 8 cores per node requested (total 64 requested in Julia). I did not implement SharedArrays in this example because it threw an error ERROR: LoadError: SystemError: shm_open() failed for /jl25628797yr6QSLMDKnb3MV0yim: Too many open files. Yes, I did generate a list of nodes and how many cores were requested on each. The code for that with slum was

srun hostname -s > nodenames.txt

and then a simple script to put the output into a list like

8 host1
8 host2
8 host3
...

to be called by julia --machinefile nodes

My questions are:

  1. Have I understood this correctly that @distributed should be used for different nodes? @threads or @spawn for different cores on that node?
  2. What kind of performance increase should be expected from future versions of these functions?
  3. On more complicated examples (mainly copying values and matrix multiplies), @threads will outperform @spawn. Is that a general rule in other’s experience?
  4. Are there other general tips to make this more efficient?