I’ll try to start working on proper documentation, which will hopefully make @avx
and its performance more transparent.
It’s worth noting here however that LoopVectorization uses a LoopSet
struct as its internal representation of a set of loops.
This struct contains little more than a(n unordered!) collection of the loops, as well as a collection of operations performed in the loop. These operations contain their dependencies on other operations and on loops.
@avx
on loops builds a LoopSet
, which it then converts into an expression.
@avx
on broadcasts swaps materialize(!)
for vmaterialize(!)
, which is a generated function that creates a LoopSet
, and then converts it into an expression.
The LoopSet
does not contain information on the original order of the loops, or where the operations were written (order still matters insofar as defining dependencies between operations).
However, the orders the loops can eventually be placed in can be restricted. Writing:
for m ∈ 1:M, n ∈ 1:N
Cmn = z
for k ∈ 1:z
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
Here, because Cmn
is redefined as a constant for every m
and n
, it is taken as a hint to depend on the m
and n
loops.
Right now, that forces the m
and n
loops to be exterior to where Cmn
is ultimately placed.
Additionally, Cmn
is then summed over k
. That forces k
; the k
loop thus depends on Cmn
and must be internal with respect to Cmn
’s definition.
Thus it only considers two possible orders (exterior to interior): m
, n
, k
and n
, m
, k
.
One of these is hopefully pretty good.
If instead Cmn = C[m,n]
, it would not infer the restriction that the m
and n
loops must be external to that operation. It will instead assuming that if it reorders, it will keep rereading the correct value from memory.
Similarly, if we want to reoder the loops ourselves, we have to replace the Cmn = zero(eltype(C))
with a C .= 0
external to all the loops, and switch to using Cmn = C[m,n]
instead.
That is a transformation we are capable of, but that I haven’t implemented in LoopVectorization
yet.
All that said, given the unrestricted version of the loop, what does @avx
want to do with it?
julia> using LoopVectorization: LoopSet, choose_order
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]
julia> nkm_q = :(for n in 1:N, k in 1:K
Bkn = B[k,n]
for m in 1:M
C[m,n] += A[m,k] * Bkn
end
end);
julia> nkm_ls = LoopSet(nkm_q);
julia> choose_order(nkm_ls)
([:n, :m, :k], :m, 4, 4)
This happens to be one of the two possible orders of the more restricted version.
Note, the integers will change depending on your CPU architecture. The symbols probably wont.
What this means is that the loop order it choose is (exterior to interior): n
, m
, k
.
The second m
means that is that SIMD vectors are loaded from the m
loop. That is, adjacent elements along the m
axis will be loaded into adjacent SIMD lanes.
If a number doesn’t depend on the m
loop (like B[k,n]
) it will insteadl be broadcast across a vector, so each element is equal.
I call this the vectorized loop. The vectorized loop is unrolled by the vector width.
The reason that the nkm
loop is faster with base Julia/@simd
is because that allows the m
loop to be vectorized, while the mnk
loop cannot.
The @avx
macro can load SIMD vectors along the m
axis regardless of the loop order it actually chooses.
The two integers, in this case (U, T) = 4, 4
, refer to how the loops are unrolled. If T == -1
, that means the outermost loop is unrolled by U
(or U
* vector width, if it is also the vectorized loop).
Otherwise, the outermost loop is unrolled by T
, and the second outermost loop is unrolled by U
(or U
* vector width).
What this unrolling means in this example is that it will:
- Create
U*T
vectors ofCmn
. - It will start looping over
k
. On each iteration, it will… - Load
U
vectors fromA
. - for t = 1,…,T, it will load an element from
B
, and update theU
vectors fromCmn
that correspond to thatt
(using theU
vectors fromA
). - Finish looping, and store the
Cmn
vectors intoC
.
So on each iteration of k
, you perform a total of U+T
loads (U
vector loads from A
, T
broadcasts to fill a register with a single value from B
). However, you perform a total of U * T
multiplications and additions, without any of the multiplications-and-additions depending on any others, so that they can all take advantage of out of order execution.
That is, it should be able to keep the CPU’s fma (fused multiply-add) units nearly 100% busy.
This requires a total of U*T
registers for Cmn
, U
registers for A
, and 1 register for B
, or U*T + U + 1
. For U = T = 4
, that is 21 registers.
Computers with only AVX2
have 16 floating point registers per core, so choose_order
should instead return U = 3; T = 4
, yielding exactly 16.
It has to require less registers than the CPU has, otherwise it would have to do a bunch of extra loads and stores on each k
iteration, cratering the compute / load ratio.
So, @PetrKryslUCSD, @avx
does almost the same thing with either loop, because it doesn’t pay too much attention to the actual order you wrote them in. The reason mulCAB1!
is slower is because C .= 0
is slower than not doing that, and then within the loops, loading values from C[m,n]
is also slightly slower than simply assigning 0.
In both cases, it is able to vectorize the m
loop, while @simd
is only capable of doing that if the m
loop is the most interior loop.
Wow, that’s fantastic!
Performance-wise, it is much simpler than trying to get a macro (which doesn’t see types!) to be able to do something fast.
The ratio of performance of your complex to real example was about 5.5.
I just released a new version of LoopVectorization (v"0.3.2") that improved the code generation for your mul_add_avx!
, which now gives me a factor of about 4.2.
This is still behind the ideal 4
, but much closer!
Mind rerunning your benchmarks?
The problem was that it did not realize it could place the load from C[i,j]
in C[i,j] += factor * ΔCᵢⱼ
after the loop, because I had naively been placing all loads prior to any interior loops.
This hurt because it then counted that as an additional occupied register (changing the values of U
and T
).
Adding a check to see if it is allowed to delay operations until after the loop, and then doing so if allowed, largely solved the problem.
OpenBLAS and MKL seem to have ratios of <4. IIRC, they default to 4m, so I’m guessing that’s because of BLAS overhead being amortized.
I wouldn’t say it aims to replace Cartesian, which aims to make it easier to write generic loops.
LoopVectorization aims to make loops you have written faster.
EDIT: I updated the opening post with more benchmarks.