First, you should almost never write code like what we wrote. It’s difficult to read and prone to mistakes. Usually, the compiler is smart enough to do this automatically when it’s permissible to do so. This can be blocked by things like the fact that floating point math is non-associative and you explicitly wrote the loop to add them first to last. The compiler won’t ignore that normally, but you can mark the loop @simd
or use @fastmath
(WARNING: @fastmath
can have all sorts of other consequences and should only be used very carefully) to tell it you don’t actually care about the order it adds them in.
The LoopVectorization
package’s @turbo
macro is usually better at vectorizing loops than the normal compiler. It’s my usual recommendation, but I didn’t have it installed and didn’t want to fiddle with it if it didn’t do what I wanted. I hand-wrote that monstrosity because it was simple and a proof of concept just to see what level of speed could be had.
There are two major factors that can greatly accelerate a loop. Both are about doing the same work but with fewer instructions. Your CPU has instructions for adding a pair of Float64
s, for example. But most contemporary general purpose CPUs have Single Instruction Multiple Data (SIMD) aka Vectorized instructions. Vectorized instructions are like scalar instructions except that they operate on more data at once. For example, my processor also has instructions for adding 2 or 4 Float64
(or 4 or 8 Float32
) pairs at a time (some recent processors can double this again). There are similar instructions for loading data and other arithmetic. So, instead of loading a Float64
and adding it to an accumulator, my computer can choose to load 4 adjacent Float64
and add them to 4 different accumulators. That’s an easy 4x speedup right there (although usually requires a little extra work at the end to add up the 4 accumulators).
Look at the assembly from above.
vfmadd231pd
is a “fused multiply-add” instruction. It takes one number, multiplies it by a second, and adds it to a third. This means it is faster and has less roundoff error than a multiply instruction followed by an add. The pd
suffix stands for “packed double”, which is to say it operates on a vector of Float64
values and the ymm
registers means that it is doing 4 per instruction. The remaining 3 instructions increment the loop counter, check if it’s finished, and loop back to the top if it’s not.
Now to your specific question. Unrolling is another optimization that addresses the fact that, in a simple loop, you can spend a majority of the time just doing the loop overhead instead of doing the actual calculations you wanted. Imagine that we instead used only one fma
and then looped back. We would have 1 fma
that processed 1 number and then 3 instructions to do the looping, so 1 number multiplied-and-accumulated per 4 instructions. Using a vectorized operation can get us up to 4 numbers per 4 instructions. But this actual code uses 4 vectorized fma
in a row, so processes 16 numbers per 7 instructions (not all instructions take the same time to process, but I’m ignoring that for now). This process is called “loop unrolling.”
At the very best, we could have N+3
instructions unrolled to process 4N
numbers per actual loop. Two things stand in the way. First, we only have so many registers (eg ymm0
) to store numbers in the processor. If we run out of those, we get a “register spill” and have to start using the stack, which is part of main memory and will slow us down considerably. Second, this code block can only process numbers in blocks of 16. If we want to add up 60 numbers, we run this loop 3 times to do the first 48 numbers but then have 12 numbers left over that we must handle. Usually, this “tail” is processed using a boring loop that only handles 1 number per loop. Recall that this tail is processing numbers 9x slower than our unrolled section. Those last 12 numbers can easily take twice as long as the first 48. Unrolling also takes more instructions, which can have negative effects on cache performance.
So vectorized and unrolled loops can be very fast, but the more efficient they get the bigger tails they can leave. These tails can be very slow to process. The compiler seems to like 4x unrolling for simple loops, so that’s what I replicated here (16 = vectorize by 4 * unroll by 4).