[ANN] LoopVectorization

I’ve released LoopVectorization v0.8.1 today (and v0.8.0 yesterday).
The 0.8-series introduced a breaking change:

It now assumes that none of the iterables are empty. That means for i in 1:0; end is invalid, but for I in CartesianIndices(()); end is still fine.

julia> isempty(1:0)
true

julia> isempty(CartesianIndices(()))
false

If you want LoopVectorization to emit a check, you can pass check_empty=true, and it will only run the code if the iterables are in fact not empty.

julia> using LoopVectorization, Test

julia> function mysum_checked(x)
           s = zero(eltype(x))
           @avx check_empty=true for i ∈ eachindex(x)
               s += x[i]
           end
           s
       end
mysum_checked (generic function with 1 method)

julia> function mysum_unchecked(x)
           s = zero(eltype(x))
           @avx for i ∈ eachindex(x)
               s += x[i]
           end
           s
       end
mysum_unchecked (generic function with 1 method)

julia> x = fill(9999, 100, 10, 10);

julia> @test mysum_checked(x) == mysum_unchecked(x) == sum(x)
Test Passed

julia> xv = view(x, :, 1:0, :);

julia> @test iszero(mysum_checked(xv))
Test Passed

julia> @test iszero(mysum_unchecked(xv))
Test Failed at REPL[26]:1
  Expression: iszero(mysum_unchecked(xv))
ERROR: There was an error during testing

Maybe checking should be the default, as it takes extremely little time. But on Zulip, a few folks thought no-checks were reasonable.

LoopVectorization 0.8 had a major overhaul on how the resulting expression is emitted, with the focus on improving both performance and the quality of the generated assembly.
Matrix-multiplication performance is now very good:


While you’d still need to add extra loops and pack for a real matrix multiplication function, the raw macrokernel does extremely well on my system with AVX512.

[Brief aside: If you have any idea how to make all the colors easier to tell apart, please chime in on this issue or in this thread.]

I may have to tune it a bit for AVX2. I haven’t tested it there, but I think I am currently underestimating a cost for those architectures. Were I to make it more accurate, it may make a different decision that might improve performance there. “May”, “might” – I don’t know until I test.

Improvements in the assembly include eliminating a lot of redundant blocks and unnecessary addressing calculations.
Sample of the generated assembly on Julia master with LLVM 10 (disclaimer: there are still a few unnecessary instructions with LLVM 6, 8, and 9; I didn’t check 7 because Julia skipped from 6 to 8 with 1.3 → 1.4):

L1024:
        prefetcht0      byte ptr [r11 + r14]
        vbroadcastsd    zmm30, qword ptr [r8]
        vmovups zmm29, zmmword ptr [r11]                                                                                                                                                                                                                                                                                                                                                               vmovups zmm28, zmmword ptr [r11 + 64]
        vmovupd zmm27, zmmword ptr [r11 + 128]                                                                                                                                                                                                                                                                                                                                                         prefetcht0      byte ptr [r11 + r14 + 64]
        prefetcht0      byte ptr [r11 + r14 + 128]
        vfmadd231pd     zmm26, zmm30, zmm29 # zmm26 = (zmm30 * zmm29) + zmm26
        vfmadd231pd     zmm23, zmm30, zmm28 # zmm23 = (zmm30 * zmm28) + zmm23
        vbroadcastsd    zmm31, qword ptr [r8 + rdi]
        vfmadd231pd     zmm17, zmm30, zmm27 # zmm17 = (zmm30 * zmm27) + zmm17
        vfmadd231pd     zmm25, zmm31, zmm29 # zmm25 = (zmm31 * zmm29) + zmm25
        vfmadd231pd     zmm21, zmm31, zmm28 # zmm21 = (zmm31 * zmm28) + zmm21
        vfmadd231pd     zmm14, zmm31, zmm27 # zmm14 = (zmm31 * zmm27) + zmm14
        vbroadcastsd    zmm30, qword ptr [r8 + 2*rdi]
        vfmadd231pd     zmm24, zmm30, zmm29 # zmm24 = (zmm30 * zmm29) + zmm24
        vfmadd231pd     zmm19, zmm30, zmm28 # zmm19 = (zmm30 * zmm28) + zmm19
        vfmadd231pd     zmm11, zmm30, zmm27 # zmm11 = (zmm30 * zmm27) + zmm11
        vbroadcastsd    zmm30, qword ptr [r8 + r9]
        vfmadd231pd     zmm22, zmm30, zmm29 # zmm22 = (zmm30 * zmm29) + zmm22
        vfmadd231pd     zmm16, zmm30, zmm28 # zmm16 = (zmm30 * zmm28) + zmm16
        vfmadd231pd     zmm8, zmm30, zmm27 # zmm8 = (zmm30 * zmm27) + zmm8
        vbroadcastsd    zmm30, qword ptr [r8 + 4*rdi]
        vfmadd231pd     zmm20, zmm30, zmm29 # zmm20 = (zmm30 * zmm29) + zmm20
        vfmadd231pd     zmm13, zmm30, zmm28 # zmm13 = (zmm30 * zmm28) + zmm13
        vfmadd231pd     zmm6, zmm30, zmm27 # zmm6 = (zmm30 * zmm27) + zmm6
        vbroadcastsd    zmm30, qword ptr [r8 + r15]
        vfmadd231pd     zmm18, zmm30, zmm29 # zmm18 = (zmm30 * zmm29) + zmm18
        vfmadd231pd     zmm10, zmm30, zmm28 # zmm10 = (zmm30 * zmm28) + zmm10
        vfmadd231pd     zmm4, zmm30, zmm27 # zmm4 = (zmm30 * zmm27) + zmm4
        vbroadcastsd    zmm30, qword ptr [r8 + r12]
        vfmadd231pd     zmm15, zmm30, zmm29 # zmm15 = (zmm30 * zmm29) + zmm15
        vfmadd231pd     zmm7, zmm30, zmm28 # zmm7 = (zmm30 * zmm28) + zmm7
        vbroadcastsd    zmm31, qword ptr [r8 + rbp]
        vfmadd231pd     zmm2, zmm30, zmm27 # zmm2 = (zmm30 * zmm27) + zmm2
        vfmadd231pd     zmm12, zmm31, zmm29 # zmm12 = (zmm31 * zmm29) + zmm12
        vfmadd231pd     zmm5, zmm31, zmm28 # zmm5 = (zmm31 * zmm28) + zmm5
        vfmadd231pd     zmm1, zmm31, zmm27 # zmm1 = (zmm31 * zmm27) + zmm1
        vbroadcastsd    zmm30, qword ptr [r8 + 8*rdi]
        vfmadd231pd     zmm9, zmm30, zmm29 # zmm9 = (zmm30 * zmm29) + zmm9
        vfmadd231pd     zmm3, zmm30, zmm28 # zmm3 = (zmm30 * zmm28) + zmm3
        vfmadd231pd     zmm0, zmm30, zmm27 # zmm0 = (zmm30 * zmm27) + zmm0
        add     r11, r10
        add     r8, 8
        cmp     r11, rdx
        jbe     L1024

To see how far this has come, compare the above graphic with the opening post!

20 Likes