Multithreading a for loop

Well this is the big question in HPC for the last 20 years… You have to increase the temporal and spatial data locality in order to minimize the RAM bandwidth stress and use your cache hierarchy. If you consider your simple loop, there is not enough work to do compared to the amount of reads and writes to the memory.

The classical example is the matrix*matrix multiplication that you could implement as a succession of scalar products and obtain awful performance. In this case there is a lot of work to do (N^3)
compared to the memory footprint (N^2) and a multilevel block implementation allows to reach peak performance.

#1 You have to increase the ratio (work to do)/(RAM data read+write) called arithmetic intensity or computation density in the roof model.
The most common way of doing so is to fuse operations and avoid read and write.
For example replace

@. x = y + z # 2R +1W
@. x += w # 2R +1W

with

@. x = y + z +w #3R + 1W

In this dummy example you increase the temporal data locality.

#2 If you operate on a machine with several memory pipes, initializing the data in parallel (using the same threads) may help to ensure that your different piece of data are not stored on only one physical memory chip.

4 Likes