Are "optimized" single thread algorithms typically different than "optimized" multithread implementations?

In a very, very general sense: the runtime of a parallelized program is dominated by the portion of the work that can’t be parallelized. Why is this?

If you have a total runtime T, which can be split up into two parts, the parallelizable part T_{parallel} (abbreviated T_p) and the non-parallelizable (or serial) part T_{serial} (abbreviated T_s) such that T = T_s + T_p, then as you increase the number of threads/processes to work on your program, T_p goes to 0 (as that time is divided between all processes/threads), leaving T = T_s. This is because we can use p = T_p / T, i.e. the proportion of the total runtime that we can parallelize, to write the above like:

T = T_s + T_p = (1-p)T + pT

If we now throw n parallel processors at the problem, we get

(1-\frac{p}{n})T + \frac{p}{n}T

We can see that as n \to \infty, the first half tends to T, while the second half tends to 0, resulting in the fact that the non-parallelizable portion dominates the runtime (in the limit).

This is called Amdahl’s Law.

The trouble is of course, that p depends heavily on the algorithm you want to calculate, some of which are easily parallelizeable (i.e. have a very large p close to 1), while others are very serial in nature (i.e. have a small p close to 0). Regardless of which it is though, it’s pretty much always beneficial to start optimizing the serial portion of the parallelizable work itself, since that’s the part that will benefit from parallelization.

5 Likes