I’ve released LoopVectorization v0.8.1
today (and v0.8.0
yesterday).
The 0.8
-series introduced a breaking change:
It now assumes that none of the iterables are empty. That means for i in 1:0; end
is invalid, but for I in CartesianIndices(()); end
is still fine.
julia> isempty(1:0)
true
julia> isempty(CartesianIndices(()))
false
If you want LoopVectorization to emit a check, you can pass check_empty=true
, and it will only run the code if the iterables are in fact not empty.
julia> using LoopVectorization, Test
julia> function mysum_checked(x)
s = zero(eltype(x))
@avx check_empty=true for i ∈ eachindex(x)
s += x[i]
end
s
end
mysum_checked (generic function with 1 method)
julia> function mysum_unchecked(x)
s = zero(eltype(x))
@avx for i ∈ eachindex(x)
s += x[i]
end
s
end
mysum_unchecked (generic function with 1 method)
julia> x = fill(9999, 100, 10, 10);
julia> @test mysum_checked(x) == mysum_unchecked(x) == sum(x)
Test Passed
julia> xv = view(x, :, 1:0, :);
julia> @test iszero(mysum_checked(xv))
Test Passed
julia> @test iszero(mysum_unchecked(xv))
Test Failed at REPL[26]:1
Expression: iszero(mysum_unchecked(xv))
ERROR: There was an error during testing
Maybe checking should be the default, as it takes extremely little time. But on Zulip, a few folks thought no-checks were reasonable.
LoopVectorization 0.8 had a major overhaul on how the resulting expression is emitted, with the focus on improving both performance and the quality of the generated assembly.
Matrix-multiplication performance is now very good:
While you’d still need to add extra loops and pack for a real matrix multiplication function, the raw macrokernel does extremely well on my system with AVX512.
[Brief aside: If you have any idea how to make all the colors easier to tell apart, please chime in on this issue or in this thread.]
I may have to tune it a bit for AVX2. I haven’t tested it there, but I think I am currently underestimating a cost for those architectures. Were I to make it more accurate, it may make a different decision that might improve performance there. “May”, “might” – I don’t know until I test.
Improvements in the assembly include eliminating a lot of redundant blocks and unnecessary addressing calculations.
Sample of the generated assembly on Julia master with LLVM 10 (disclaimer: there are still a few unnecessary instructions with LLVM 6, 8, and 9; I didn’t check 7 because Julia skipped from 6 to 8 with 1.3 → 1.4):
L1024:
prefetcht0 byte ptr [r11 + r14]
vbroadcastsd zmm30, qword ptr [r8]
vmovups zmm29, zmmword ptr [r11] vmovups zmm28, zmmword ptr [r11 + 64]
vmovupd zmm27, zmmword ptr [r11 + 128] prefetcht0 byte ptr [r11 + r14 + 64]
prefetcht0 byte ptr [r11 + r14 + 128]
vfmadd231pd zmm26, zmm30, zmm29 # zmm26 = (zmm30 * zmm29) + zmm26
vfmadd231pd zmm23, zmm30, zmm28 # zmm23 = (zmm30 * zmm28) + zmm23
vbroadcastsd zmm31, qword ptr [r8 + rdi]
vfmadd231pd zmm17, zmm30, zmm27 # zmm17 = (zmm30 * zmm27) + zmm17
vfmadd231pd zmm25, zmm31, zmm29 # zmm25 = (zmm31 * zmm29) + zmm25
vfmadd231pd zmm21, zmm31, zmm28 # zmm21 = (zmm31 * zmm28) + zmm21
vfmadd231pd zmm14, zmm31, zmm27 # zmm14 = (zmm31 * zmm27) + zmm14
vbroadcastsd zmm30, qword ptr [r8 + 2*rdi]
vfmadd231pd zmm24, zmm30, zmm29 # zmm24 = (zmm30 * zmm29) + zmm24
vfmadd231pd zmm19, zmm30, zmm28 # zmm19 = (zmm30 * zmm28) + zmm19
vfmadd231pd zmm11, zmm30, zmm27 # zmm11 = (zmm30 * zmm27) + zmm11
vbroadcastsd zmm30, qword ptr [r8 + r9]
vfmadd231pd zmm22, zmm30, zmm29 # zmm22 = (zmm30 * zmm29) + zmm22
vfmadd231pd zmm16, zmm30, zmm28 # zmm16 = (zmm30 * zmm28) + zmm16
vfmadd231pd zmm8, zmm30, zmm27 # zmm8 = (zmm30 * zmm27) + zmm8
vbroadcastsd zmm30, qword ptr [r8 + 4*rdi]
vfmadd231pd zmm20, zmm30, zmm29 # zmm20 = (zmm30 * zmm29) + zmm20
vfmadd231pd zmm13, zmm30, zmm28 # zmm13 = (zmm30 * zmm28) + zmm13
vfmadd231pd zmm6, zmm30, zmm27 # zmm6 = (zmm30 * zmm27) + zmm6
vbroadcastsd zmm30, qword ptr [r8 + r15]
vfmadd231pd zmm18, zmm30, zmm29 # zmm18 = (zmm30 * zmm29) + zmm18
vfmadd231pd zmm10, zmm30, zmm28 # zmm10 = (zmm30 * zmm28) + zmm10
vfmadd231pd zmm4, zmm30, zmm27 # zmm4 = (zmm30 * zmm27) + zmm4
vbroadcastsd zmm30, qword ptr [r8 + r12]
vfmadd231pd zmm15, zmm30, zmm29 # zmm15 = (zmm30 * zmm29) + zmm15
vfmadd231pd zmm7, zmm30, zmm28 # zmm7 = (zmm30 * zmm28) + zmm7
vbroadcastsd zmm31, qword ptr [r8 + rbp]
vfmadd231pd zmm2, zmm30, zmm27 # zmm2 = (zmm30 * zmm27) + zmm2
vfmadd231pd zmm12, zmm31, zmm29 # zmm12 = (zmm31 * zmm29) + zmm12
vfmadd231pd zmm5, zmm31, zmm28 # zmm5 = (zmm31 * zmm28) + zmm5
vfmadd231pd zmm1, zmm31, zmm27 # zmm1 = (zmm31 * zmm27) + zmm1
vbroadcastsd zmm30, qword ptr [r8 + 8*rdi]
vfmadd231pd zmm9, zmm30, zmm29 # zmm9 = (zmm30 * zmm29) + zmm9
vfmadd231pd zmm3, zmm30, zmm28 # zmm3 = (zmm30 * zmm28) + zmm3
vfmadd231pd zmm0, zmm30, zmm27 # zmm0 = (zmm30 * zmm27) + zmm0
add r11, r10
add r8, 8
cmp r11, rdx
jbe L1024
To see how far this has come, compare the above graphic with the opening post!