# How to do a nested parallelization of two nested functions?

Suppose I have two functions, with one nested inside the other. I have `N` processors for the computation. Is it possible to distribute the outer function over `n1` processors and distribute each call of the inside function over `(N-n1)/n1` processors?

Well, you’ll need `(N - n1)/n1` to be an integer, which is equivalent to the fact that `n1` must divide `N`.

But you’ll also need that `n1 + (N-n1)/n1 <= N`, so that you’ll have enough processors. For that, consider that `N = (q+1)n1`, and note that this inequality is always true as soon as `n >= 1`, which is given.

So you just need `n1` to divide `N`. However, note that a huge number of processors, `q(n1-1)` to be precise, will not be used.

Suppose `N=100, n1=10` so that each call of the inner function is distributed over 9 processors. Is there a way to nest this parallelization in Julia?

You shouldn’t forget to specify that `n1>0`. Not all integers work here.

1 Like

`0` does not divide `N` More seriously though, what you seem to want is clearly doable, you might take a look at the manual there : Multi-Threading · The Julia Language, remembering that `Threads.@threads`-ed loops can be nested.

Thanks. I am already threading the inner function and distributing the outer one. I am wondering if I can distribute both functions, instead of threading the inner one. The reason I’m interested in this is because the inner function actually has yet another nested function which I would like to thread. So the set up is something like f(x, \ g(x, y, \ h(x,y,z))). I would like to distribute f over x on `n1` processors. For each x i then want to distribute g over y on `(N-n1)/n1` processors. And finally for each x,y I want to thread h over z say over 10 threads.

It would probably help a lot to show us a MWE, if you can.

Here is the MWE showing three nested functions. `innerfn` is threaded, `outerfn` is distributed. I would like the forloop in `middlefn` to be also distributed.

``````using Random, LinearAlgebra, Distributions, Optim, StatsBase
using Optim: converged, maximum, maximizer, minimizer, iterations
using DelimitedFiles
using Statistics
using SharedArrays
using Optim, NLSolversBase, Random
using LinearAlgebra: diag

Random.seed!(1234)

## define fake data
struct makedata
data::Array{Int64}
end
data = Vector{makedata}(undef, 1000)
for t = 1:1000
data[t] = makedata(rand(1:100, 10, 20))
end

function innerfn(x)
sumobj = zeros(size(x,1))
sumobj .+= x[:, i]
end
return sumobj
end

function middlefn(x)
results_interm = zeros(Float64, getindex(size(x.data), 1))
for i = 1:maximum(x.data[:,1])  # i.e., for i=1:100
dum_mat = ones(Int64, size(x.data))
dum_mat[findall(x.data[:,1] .>= i), :] .= 0

results_interm .+= innerfn(dum_mat)
end
return sum(results_interm, dims=1)
end

function outerfn(x)
res = pmap(middlefn, x)
# do something with the results e.g., construct a likelihood to input into Optim
end

outerfn(data) # In practice I would embed outerfn() inside an optim statement
``````

don’t worry about parallelism yet. first you should deal with the fact that your code is way slower than it should be. I recommend reading the performance tools and specifically about `@views`

1 Like

I tried sticking `@views` before a number of statements but it didn’t lead to any massive gains in time. Allocations did fall though. Am I calling `@views` wrong below/is this what you had in mind?

I imagine the gains from this won’t be so large that I no longer need to parallelize.

``````using Random, LinearAlgebra, Distributions, Optim, StatsBase
using Optim: converged, maximum, maximizer, minimizer, iterations
using DelimitedFiles
using Statistics
using SharedArrays
using Optim, NLSolversBase, Random
using LinearAlgebra: diag

Random.seed!(1234)

## define fake data
struct makedata
data::Array{Int64}
end
data = Vector{makedata}(undef, 1000)
for t = 1:1000
data[t] = makedata(rand(1:100, 10, 20))
end

function innerfn(x)
sumobj = zeros(size(x,1))
@views  sumobj .+= x[:, i]
end
return sumobj
end

function middlefn(x)
results_interm = zeros(Float64, getindex(size(x.data), 1))
for i = 1:maximum(x.data[:,1])  # i.e., for i=1:100
dum_mat = ones(Int64, size(x.data))
dum_mat[findall(x.data[:,1] .>= i), :] .= 0

@views  results_interm .+= innerfn(dum_mat)
end
return sum(results_interm, dims=1)
end

function outerfn(x)
res = pmap(middlefn, x)
# do something with the results e.g., construct a likelihood to input into Optim
end

outerfn(data) # In practice I would embed outerfn() inside an optim statement
``````

In `middlefn` you should make a view when indexing `x.data`. The `@views` on the `results_interm` assignment line does nothing. `@views` affects indexing in its argument expression.

1 Like

You also have your threading at the wrong level. Moving the threads out of `innerfn` and into `middlefn` should dramatically reduce synchronization overhead.

Thanks. Changing where I place `@views` did help as did moving the threading to `middlefn`.

However this does not answer the original question about distributing twice. Is it possible to actually distribute in a nested fashion instead of threading the inner function? If so how would you do it?

I think the best thing you might do is :

• Remove threading & distribution, and reduce size of example while we work on timings.

• Refactor your code as a set of nested loops, in one and only one function. This will allow you to see the whole of it at once.

• Write ALL loops : do not use x .+= y but write the full loop. This will allow you to see if there are loops that you can merge together, and reordering your loops is the best way to see if you are doing several times the same computations.

• Try to reuse temporary vectors as much as possible : no need to recreate `sumobj` each time, reuse the same one. Creating a vector is a slow operation, and therefore should not be done inside the loops, while modifying the values of an existing vector is much faster. There are other vectors like that that might be superfluous.

• `using BenchmarkTools; @btime outerfn(data)` to time your computation to be sure that you are working in the right direction.

Removing all distributed and threading in your code, on my laptop, this gives :

``````  148.379 ms (1374076 allocations: 223.82 MiB)
``````

My bet is that you might win a 100x factor at minimum if we rewrite your code correctly.

I tried to do it for you, but it is hard to see what you really want to compute, as the dummyfication of your MWE is not really clear (what do you relaly want ? what is dummy stuff that actually replace an important computtion, and what is just dummy stuff that can be removed ? I cant tell i’m sorry…)

Once the code is “optimal” on one core, if it is still too slow for you, we might start discussing distribution and threading again.