I have made that splitting procedure long time ago, so it took me some time to remember how it works. It is indeed slightly cryptic
Idea is simple though. We have following nested loops
for i in 1:N
for j in i+1:N
# Some operation
end
end
and we are assuming, that operation
inside this loop take the same time, irrelevant of index. Since each task has some time to get scheduled we want to have bare minimum of the tasks spawned. It leads us to the task of splitting external i
loop in k
chunks (where k
is the number of threads), such that amount of operations in each chunk is equal.
Now, it can be solved exactly, but I am making another assumption, that N >> 1
and we can throw away discrete nature of i
and change it to continuous variable x
. Then number of internal operations for each x
is just N - x
(one can easily see that, taking into account linearity of the problem and the fact that when x = 0
number of elements in internal loop is N
and when x = N
number of elements is 0
.
In this approach, number of operations is \int_{0}^{N} (N - x) dx = \frac{N^2}{2} and each chunk should contain \frac{N^2}{2k} elements. For chunk number m, number of elements in all chunks from 1 to m is given by
\int_0^{t_m} (N - x) dx = \frac{N^2}{2k}m
where t_m is the point where we should split another chunk.
After integration and substitution we obtain
N t_m - \frac{t_m^2}{2} = \frac{N^2}{2k}m
Solving this equation we get
t_m = N (1 \pm \sqrt{1 - \frac{m}{k}})
and in order to satisfy boundary conditions t_0 = 0, t_k = N we choose minus sign.
As a last and to be honest unnecessary step I change the definition of m to m \rightarrow k - m and the answer is
t_m = N (1 - \sqrt{\frac{m}{k}}), \forall{m = k..0}
Chunks are just intervals (t_m, t_{m - 1}]
Of course it can be done more accurate and this assumptions are more imbalanced for small numbers of N, but multithreading for small N is meaningless anyway and deviations related to scheduling and other things will nullify more exact solutions (or at least that’s what I believe in, could be wrong).
–
One small note, the acceleration loop is going to
N
instead ofN-1
(line 107):
I do not quite get it, since it is written as a loop from 1 to N, I do not see any N - 1
.