Suppose I have two functions, with one nested inside the other. I have N
processors for the computation. Is it possible to distribute the outer function over n1
processors and distribute each call of the inside function over (Nn1)/n1
processors?
Well, youâ€™ll need (N  n1)/n1
to be an integer, which is equivalent to the fact that n1
must divide N
.
But youâ€™ll also need that n1 + (Nn1)/n1 <= N
, so that youâ€™ll have enough processors. For that, consider that N = (q+1)n1
, and note that this inequality is always true as soon as n >= 1
, which is given.
So you just need n1
to divide N
. However, note that a huge number of processors, q(n11)
to be precise, will not be used.
Suppose N=100, n1=10
so that each call of the inner function is distributed over 9 processors. Is there a way to nest this parallelization in Julia?
You shouldnâ€™t forget to specify that n1>0
. Not all integers work here.
0
does not divide N
More seriously though, what you seem to want is clearly doable, you might take a look at the manual there : MultiThreading Â· The Julia Language, remembering that Threads.@threads
ed loops can be nested.
Thanks. I am already threading the inner function and distributing the outer one. I am wondering if I can distribute both functions, instead of threading the inner one. The reason Iâ€™m interested in this is because the inner function actually has yet another nested function which I would like to thread. So the set up is something like f(x, \ g(x, y, \ h(x,y,z))). I would like to distribute f over x on n1
processors. For each x i then want to distribute g over y on (Nn1)/n1
processors. And finally for each x,y I want to thread h over z say over 10 threads.
It would probably help a lot to show us a MWE, if you can.
Here is the MWE showing three nested functions. innerfn
is threaded, outerfn
is distributed. I would like the forloop in middlefn
to be also distributed.
using Random, LinearAlgebra, Distributions, Optim, StatsBase
using Optim: converged, maximum, maximizer, minimizer, iterations
using DelimitedFiles
using Statistics
using SharedArrays
using Optim, NLSolversBase, Random
using LinearAlgebra: diag
Random.seed!(1234)
## define fake data
struct makedata
data::Array{Int64}
end
data = Vector{makedata}(undef, 1000)
for t = 1:1000
data[t] = makedata(rand(1:100, 10, 20))
end
function innerfn(x)
sumobj = zeros(size(x,1))
Threads.@threads for i = 1:size(x,1)
sumobj .+= x[:, i]
end
return sumobj
end
function middlefn(x)
results_interm = zeros(Float64, getindex(size(x.data), 1))
for i = 1:maximum(x.data[:,1]) # i.e., for i=1:100
dum_mat = ones(Int64, size(x.data))
dum_mat[findall(x.data[:,1] .>= i), :] .= 0
results_interm .+= innerfn(dum_mat)
end
return sum(results_interm, dims=1)
end
function outerfn(x)
res = pmap(middlefn, x)
# do something with the results e.g., construct a likelihood to input into Optim
end
outerfn(data) # In practice I would embed outerfn() inside an optim statement
donâ€™t worry about parallelism yet. first you should deal with the fact that your code is way slower than it should be. I recommend reading the performance tools and specifically about @views
I tried sticking @views
before a number of statements but it didnâ€™t lead to any massive gains in time. Allocations did fall though. Am I calling @views
wrong below/is this what you had in mind?
I imagine the gains from this wonâ€™t be so large that I no longer need to parallelize.
using Random, LinearAlgebra, Distributions, Optim, StatsBase
using Optim: converged, maximum, maximizer, minimizer, iterations
using DelimitedFiles
using Statistics
using SharedArrays
using Optim, NLSolversBase, Random
using LinearAlgebra: diag
Random.seed!(1234)
## define fake data
struct makedata
data::Array{Int64}
end
data = Vector{makedata}(undef, 1000)
for t = 1:1000
data[t] = makedata(rand(1:100, 10, 20))
end
function innerfn(x)
sumobj = zeros(size(x,1))
Threads.@threads for i = 1:size(x,1)
@views sumobj .+= x[:, i]
end
return sumobj
end
function middlefn(x)
results_interm = zeros(Float64, getindex(size(x.data), 1))
for i = 1:maximum(x.data[:,1]) # i.e., for i=1:100
dum_mat = ones(Int64, size(x.data))
dum_mat[findall(x.data[:,1] .>= i), :] .= 0
@views results_interm .+= innerfn(dum_mat)
end
return sum(results_interm, dims=1)
end
function outerfn(x)
res = pmap(middlefn, x)
# do something with the results e.g., construct a likelihood to input into Optim
end
outerfn(data) # In practice I would embed outerfn() inside an optim statement
In middlefn
you should make a view when indexing x.data
. The @views
on the results_interm
assignment line does nothing. @views
affects indexing in its argument expression.
You also have your threading at the wrong level. Moving the threads out of innerfn
and into middlefn
should dramatically reduce synchronization overhead.
Thanks. Changing where I place @views
did help as did moving the threading to middlefn
.
However this does not answer the original question about distributing twice. Is it possible to actually distribute in a nested fashion instead of threading the inner function? If so how would you do it?
I think the best thing you might do is :

Remove threading & distribution, and reduce size of example while we work on timings.

Refactor your code as a set of nested loops, in one and only one function. This will allow you to see the whole of it at once.

Write ALL loops : do not use x .+= y but write the full loop. This will allow you to see if there are loops that you can merge together, and reordering your loops is the best way to see if you are doing several times the same computations.

Try to reuse temporary vectors as much as possible : no need to recreate
sumobj
each time, reuse the same one. Creating a vector is a slow operation, and therefore should not be done inside the loops, while modifying the values of an existing vector is much faster. There are other vectors like that that might be superfluous. 
using BenchmarkTools; @btime outerfn(data)
to time your computation to be sure that you are working in the right direction.
Removing all distributed and threading in your code, on my laptop, this gives :
148.379 ms (1374076 allocations: 223.82 MiB)
My bet is that you might win a 100x factor at minimum if we rewrite your code correctly.
I tried to do it for you, but it is hard to see what you really want to compute, as the dummyfication of your MWE is not really clear (what do you relaly want ? what is dummy stuff that actually replace an important computtion, and what is just dummy stuff that can be removed ? I cant tell iâ€™m sorryâ€¦)
Once the code is â€śoptimalâ€ť on one core, if it is still too slow for you, we might start discussing distribution and threading again.