[ANN] LoopVectorization

I’ll try to start working on proper documentation, which will hopefully make @avx and its performance more transparent.
It’s worth noting here however that LoopVectorization uses a LoopSet struct as its internal representation of a set of loops.
This struct contains little more than a(n unordered!) collection of the loops, as well as a collection of operations performed in the loop. These operations contain their dependencies on other operations and on loops.

@avx on loops builds a LoopSet, which it then converts into an expression.
@avx on broadcasts swaps materialize(!) for vmaterialize(!), which is a generated function that creates a LoopSet, and then converts it into an expression.

The LoopSet does not contain information on the original order of the loops, or where the operations were written (order still matters insofar as defining dependencies between operations).

However, the orders the loops can eventually be placed in can be restricted. Writing:

    for m ∈ 1:M, n ∈ 1:N 
        Cmn = z
        for k ∈ 1:z
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end

Here, because Cmn is redefined as a constant for every m and n, it is taken as a hint to depend on the m and n loops.
Right now, that forces the m and n loops to be exterior to where Cmn is ultimately placed.
Additionally, Cmn is then summed over k. That forces k; the k loop thus depends on Cmn and must be internal with respect to Cmn’s definition.

Thus it only considers two possible orders (exterior to interior): m, n, k and n, m, k.
One of these is hopefully pretty good.

If instead Cmn = C[m,n], it would not infer the restriction that the m and n loops must be external to that operation. It will instead assuming that if it reorders, it will keep rereading the correct value from memory.
Similarly, if we want to reoder the loops ourselves, we have to replace the Cmn = zero(eltype(C)) with a C .= 0 external to all the loops, and switch to using Cmn = C[m,n] instead.

That is a transformation we are capable of, but that I haven’t implemented in LoopVectorization yet.

All that said, given the unrestricted version of the loop, what does @avx want to do with it?

julia> using LoopVectorization: LoopSet, choose_order
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

julia> nkm_q = :(for n in 1:N, k in 1:K 
                       Bkn = B[k,n]
                       for m in 1:M
                               C[m,n] += A[m,k] * Bkn
                       end
               end);

julia> nkm_ls = LoopSet(nkm_q);

julia> choose_order(nkm_ls)
([:n, :m, :k], :m, 4, 4)

This happens to be one of the two possible orders of the more restricted version.
Note, the integers will change depending on your CPU architecture. The symbols probably wont.

What this means is that the loop order it choose is (exterior to interior): n, m, k.
The second m means that is that SIMD vectors are loaded from the m loop. That is, adjacent elements along the m axis will be loaded into adjacent SIMD lanes.
If a number doesn’t depend on the m loop (like B[k,n]) it will insteadl be broadcast across a vector, so each element is equal.
I call this the vectorized loop. The vectorized loop is unrolled by the vector width.

The reason that the nkm loop is faster with base Julia/@simd is because that allows the m loop to be vectorized, while the mnk loop cannot.

The @avx macro can load SIMD vectors along the m axis regardless of the loop order it actually chooses.

The two integers, in this case (U, T) = 4, 4, refer to how the loops are unrolled. If T == -1, that means the outermost loop is unrolled by U (or U * vector width, if it is also the vectorized loop).
Otherwise, the outermost loop is unrolled by T, and the second outermost loop is unrolled by U (or U * vector width).
What this unrolling means in this example is that it will:

  1. Create U*T vectors of Cmn.
  2. It will start looping over k. On each iteration, it will…
  3. Load U vectors from A.
  4. for t = 1,…,T, it will load an element from B, and update the U vectors from Cmn that correspond to that t (using the U vectors from A).
  5. Finish looping, and store the Cmn vectors into C.

So on each iteration of k, you perform a total of U+T loads (U vector loads from A, T broadcasts to fill a register with a single value from B). However, you perform a total of U * T multiplications and additions, without any of the multiplications-and-additions depending on any others, so that they can all take advantage of out of order execution.
That is, it should be able to keep the CPU’s fma (fused multiply-add) units nearly 100% busy.

This requires a total of U*T registers for Cmn, U registers for A, and 1 register for B, or U*T + U + 1. For U = T = 4, that is 21 registers.
Computers with only AVX2 have 16 floating point registers per core, so choose_order should instead return U = 3; T = 4, yielding exactly 16.
It has to require less registers than the CPU has, otherwise it would have to do a bunch of extra loads and stores on each k iteration, cratering the compute / load ratio.

So, @PetrKryslUCSD, @avx does almost the same thing with either loop, because it doesn’t pay too much attention to the actual order you wrote them in. The reason mulCAB1! is slower is because C .= 0 is slower than not doing that, and then within the loops, loading values from C[m,n] is also slightly slower than simply assigning 0.

In both cases, it is able to vectorize the m loop, while @simd is only capable of doing that if the m loop is the most interior loop.

Wow, that’s fantastic!
Performance-wise, it is much simpler than trying to get a macro (which doesn’t see types!) to be able to do something fast.

The ratio of performance of your complex to real example was about 5.5.
I just released a new version of LoopVectorization (v"0.3.2") that improved the code generation for your mul_add_avx!, which now gives me a factor of about 4.2.
This is still behind the ideal 4, but much closer!
Mind rerunning your benchmarks?

The problem was that it did not realize it could place the load from C[i,j] in C[i,j] += factor * ΔCᵢⱼ after the loop, because I had naively been placing all loads prior to any interior loops.
This hurt because it then counted that as an additional occupied register (changing the values of U and T).
Adding a check to see if it is allowed to delay operations until after the loop, and then doing so if allowed, largely solved the problem.

OpenBLAS and MKL seem to have ratios of <4. IIRC, they default to 4m, so I’m guessing that’s because of BLAS overhead being amortized.

I wouldn’t say it aims to replace Cartesian, which aims to make it easier to write generic loops.
LoopVectorization aims to make loops you have written faster.

EDIT: I updated the opening post with more benchmarks.

14 Likes