Yes, it does this.
Taking a matrix multiply example:
using LoopVectorization, BenchmarkTools, Cthulhu
# Cthulhu is for looking at the generated code
function AmulB!(C,A,B)
# `indices` is like a broadcasted `axes` that checks they're equal
@turbo for n = indices((C,B), 2), m = indices((C,A), 1)
Cmn = zero(eltype(C))
for k = indices((A,B), (2,1))
Cmn += A[m,k]*B[k,n]
end
C[m,n] = Cmn
end
end
M,K,N = 128,128,128; T=Float64;A=rand(T,M,K);B=rand(T,K,N);C0=A*B;C1=similar(C0);
t = @belapsed AmulB!($C1,$A,$B); 2e-9M*K*N/t # GFLOPS
C0 ≈ C1
@descend AmulB!(C1,A,B)
On an Intel 9940X (with AVX512), I get
julia> t = @belapsed AmulB!($C1,$A,$B); 2e-9M*K*N/t # GFLOPS
106.23870314083081
julia> C0 ≈ C1
true
This is a high percentage of peak flops, but the matrices must be small, as it only does the register-level caching.
Using Cthulhu to desciend into _turbo_!
, turn off debuginfo, and show native code, you’ll see this inner loop:
L432:
vbroadcastsd zmm28, qword ptr [r8]
vmovupd zmm29, zmmword ptr [rdi]
vmovupd zmm30, zmmword ptr [rdi + 64]
vmovupd zmm27, zmmword ptr [rdi + 128]
prefetcht0 byte ptr [rdi + rbx]
prefetcht0 byte ptr [rdi + rbx + 64]
prefetcht0 byte ptr [rdi + rbx + 128]
vfmadd231pd zmm18, zmm29, zmm28 # zmm18 = (zmm29 * zmm28) + zmm18
vfmadd231pd zmm9, zmm30, zmm28 # zmm9 = (zmm30 * zmm28) + zmm9
vbroadcastsd zmm31, qword ptr [r8 + rax]
vfmadd231pd zmm0, zmm27, zmm28 # zmm0 = (zmm27 * zmm28) + zmm0
vfmadd231pd zmm19, zmm29, zmm31 # zmm19 = (zmm29 * zmm31) + zmm19
vfmadd231pd zmm10, zmm30, zmm31 # zmm10 = (zmm30 * zmm31) + zmm10
vfmadd231pd zmm1, zmm27, zmm31 # zmm1 = (zmm27 * zmm31) + zmm1
vbroadcastsd zmm28, qword ptr [r8 + 2*rax]
vfmadd231pd zmm20, zmm29, zmm28 # zmm20 = (zmm29 * zmm28) + zmm20
vfmadd231pd zmm11, zmm30, zmm28 # zmm11 = (zmm30 * zmm28) + zmm11
vfmadd231pd zmm2, zmm27, zmm28 # zmm2 = (zmm27 * zmm28) + zmm2
vbroadcastsd zmm28, qword ptr [r8 + rcx]
vfmadd231pd zmm21, zmm29, zmm28 # zmm21 = (zmm29 * zmm28) + zmm21
vfmadd231pd zmm12, zmm30, zmm28 # zmm12 = (zmm30 * zmm28) + zmm12
vfmadd231pd zmm3, zmm27, zmm28 # zmm3 = (zmm27 * zmm28) + zmm3
vbroadcastsd zmm28, qword ptr [r8 + 4*rax]
vfmadd231pd zmm22, zmm29, zmm28 # zmm22 = (zmm29 * zmm28) + zmm22
vfmadd231pd zmm13, zmm30, zmm28 # zmm13 = (zmm30 * zmm28) + zmm13
vfmadd231pd zmm4, zmm27, zmm28 # zmm4 = (zmm27 * zmm28) + zmm4
vbroadcastsd zmm28, qword ptr [r8 + r15]
vfmadd231pd zmm23, zmm29, zmm28 # zmm23 = (zmm29 * zmm28) + zmm23
vfmadd231pd zmm14, zmm30, zmm28 # zmm14 = (zmm30 * zmm28) + zmm14
vfmadd231pd zmm5, zmm27, zmm28 # zmm5 = (zmm27 * zmm28) + zmm5
vbroadcastsd zmm28, qword ptr [r8 + 2*rcx]
vfmadd231pd zmm24, zmm29, zmm28 # zmm24 = (zmm29 * zmm28) + zmm24
vfmadd231pd zmm15, zmm30, zmm28 # zmm15 = (zmm30 * zmm28) + zmm15
vbroadcastsd zmm31, qword ptr [r8 + r12]
vfmadd231pd zmm6, zmm27, zmm28 # zmm6 = (zmm27 * zmm28) + zmm6
vfmadd231pd zmm25, zmm29, zmm31 # zmm25 = (zmm29 * zmm31) + zmm25
vfmadd231pd zmm16, zmm30, zmm31 # zmm16 = (zmm30 * zmm31) + zmm16
vfmadd231pd zmm7, zmm27, zmm31 # zmm7 = (zmm27 * zmm31) + zmm7
vbroadcastsd zmm28, qword ptr [r8 + 8*rax]
vfmadd231pd zmm26, zmm28, zmm29 # zmm26 = (zmm28 * zmm29) + zmm26
vfmadd231pd zmm17, zmm28, zmm30 # zmm17 = (zmm28 * zmm30) + zmm17
vfmadd231pd zmm8, zmm27, zmm28 # zmm8 = (zmm27 * zmm28) + zmm8
add rdi, r11
add r8, 8
cmp rdi, r10
jbe L432
Note, this loop features:
a. 3 contiguous loads (vmovupd
), loading 8 contiguous elements each. It effectively loads A[m .+ (1:24), k]
.
b. 9 broadcast loads (vbroadcastsd
). This is loading B[k, n .+ (1:9)]
; each load goes into a separate vector, and fills it.
c. 27 fma instructions. These are performing the outer product, C[m.+(1:24), n.+(1:9)] += A[m .+ (1:24), k] * B[k, n .+ (1:9)]
. Note we have 0 loads or stores to C within the inner most loop; these have been hoisted out.
d. A few miscellaneous instructions, such as prefetches to hide latency as the hardware prefetchers tend not to handle access across strides like for matrix A
, and of course incremeneting our loop counters/array pointers, cmp
to check if we’re done with the loop and the conditional jump if not.
So it’s doing a non-square tile, 24x9. This tile is non-square either way you look at it: total number of elements (24x9) or by number of registers (3x9).
LV solves an optimization problem to come up with the solution. It’s constrained by the number of architectural registers (which is 32 vector registers with AVX512; the systems actually generally have far more actual vector registers, e.g. 168 on the Skylake-X system I tested on, or 384 on the upcoming Zen5, but only 32 have architectural names we can actually use in the assembly), and the cost function tries to minimize the cost of the loads (also stores, but there aren’t any); contiguous loads are treated as more expensive than broadcast loads, which is why we end up with only 3 contiguous vs 9 broadcast. This consumes 31=3*9 + 3 + 1 vector registers; our unrollings are nested, we’re forced to carry the inner-most unrolling (m
) across the outer-most, which is why we need to pay the full cost of 3
. The outer-unrolled register gets to be reused. You can see that LLVM reused zmm28
and zmm31
for this (it could’ve reused only one of the two, but there’s no reason not to use all 32 registers here).
I have been working on cache-modeling lately, and think I now have a reasonably good handle/model of it. A naive triple loop will in the future be able to do much better than Octavian.jl, which was guided by extremely poor cache model/understanding of how caches work.
TLDR: LV does not unroll or vectorize the inner loop if it can help it. For a matrix multiply, it can help it (note, the inner k
loop is neither unrolled or vectorized). It uses a non-square tile of registers as fast memory.