[ANN] LoopVectorization

That “dude” was me :stuck_out_tongue:. One of two contributions I made to MatlabFileExchange before discovering Julia.

The version in Colors.jl is much better than the Matlab one (and yes I had a hand in it, but the person who made it truly awesome was @dcjones).

@Elrod, your progress with LoopVectorization is mind-blowing!

32 Likes

MLIR is LLVM’s way of raising the white flag and saying it’s not the right level to do this kind of stuff either. LoopVectorization has a bright future, hopefully integrated into Julia’s compiler at some point. That might allow it to operate at a higher level and leave the architecture-specific stuff to LLVM.

Alternatively we could embrace MLIR, but Chris’ progress with LV is such that we might do better by having control over this stuff ourselves. @vchuravy is diving deep with MLIR, so we should have both approaches covered and will be able to make an informed decision.

22 Likes

And at that time I made it into GMT but don’t remember anymore where (deep in it) it landed. :smile:

I can confirm that LoopVectorization also generates essentially a perfect microkernel for Haswell on Julia 1.4, too.

Julia code:

julia> function mymul!(C, A, B, α, β)
           @avx for m in axes(A, 1), n in axes(B, 2)
               ABmn = zero(eltype(C))
               for k in axes(A, 2)
                   ABmn += A[m, k] * B[k, n]
               end
               Cmn = C[m, n]
               C[m, n] = α * ABmn + β * Cmn
           end
           return C
       end
L784:
        vbroadcastsd    -8(%rcx,%rdx,8), %ymm2
        vmovupd -64(%r11), %ymm3
        vmovdqu -32(%r11), %ymm0
        vmovdqu (%r11), %ymm1
        vfmadd231pd     %ymm3, %ymm2, %ymm15 ## ymm15 = (ymm2 * ymm3) + ymm15
        vfmadd231pd     %ymm0, %ymm2, %ymm14 ## ymm14 = (ymm2 * ymm0) + ymm14
        vfmadd231pd     %ymm1, %ymm2, %ymm13 ## ymm13 = (ymm2 * ymm1) + ymm13
        vbroadcastsd    -8(%rsi,%rdx,8), %ymm2
        vfmadd231pd     %ymm3, %ymm2, %ymm12 ## ymm12 = (ymm2 * ymm3) + ymm12
        vfmadd231pd     %ymm0, %ymm2, %ymm11 ## ymm11 = (ymm2 * ymm0) + ymm11
        vfmadd231pd     %ymm1, %ymm2, %ymm10 ## ymm10 = (ymm2 * ymm1) + ymm10
        vbroadcastsd    -8(%r13,%rdx,8), %ymm2
        vfmadd231pd     %ymm3, %ymm2, %ymm4 ## ymm4 = (ymm2 * ymm3) + ymm4
        vfmadd231pd     %ymm0, %ymm2, %ymm9 ## ymm9 = (ymm2 * ymm0) + ymm9
        vfmadd231pd     %ymm1, %ymm2, %ymm8 ## ymm8 = (ymm2 * ymm1) + ymm8
        vbroadcastsd    -8(%rbp,%rdx,8), %ymm2
        vfmadd231pd     %ymm3, %ymm2, %ymm7 ## ymm7 = (ymm2 * ymm3) + ymm7
        vfmadd231pd     %ymm0, %ymm2, %ymm6 ## ymm6 = (ymm2 * ymm0) + ymm6
        vfmadd231pd     %ymm1, %ymm2, %ymm5 ## ymm5 = (ymm2 * ymm1) + ymm5
        addq    $1, %rdx
        addq    %rdi, %r11
        cmpq    %rax, %rdx
        jl      L784
5 Likes

I’m happy with that. I’d like to make it more robust.
As an extreme stop-gap, it would also be possible to define check_args to simply return false (regardless of what the arguments are) so that it uses an @inbounds @fastmath fallback loop instead.

This may be the better approach for those platforms, since the cost modeling currently assumes more or less the same instruction costs regardless of architecture.
Things have been reasonably consistent since Haswell, and Ryzen seems close enough as well.

I wanted to describe how it works, but yes, in those examples 100% of the benefit comes from the fallback loops.
Dot product:

julia> function mydot(a, b)
       s = zero(eltype(a))
       @inbounds @simd for i ∈ eachindex(a,b)
       s += a[i]*b[i]
       end
       s
       end
mydot (generic function with 1 method)

julia> using LoopVectorization, BenchmarkTools

julia> function mydotavx(a, b)
       s = zero(eltype(a))
       @avx for i ∈ eachindex(a,b)
       s += a[i]*b[i]
       end
       s
       end
mydotavx (generic function with 1 method)

julia> x128 = rand(128); y128 = rand(128);

julia> @btime mydot($x128, $y128)
  7.993 ns (0 allocations: 0 bytes)
31.004598810919788

julia> @btime mydotavx($x128, $y128)
  8.737 ns (0 allocations: 0 bytes)
31.004598810919788

julia> code_native(mydot, Tuple{Vector{Float64},Vector{Float64}}, syntax=:intel, debuginfo=:none)

julia> code_native(mydotavx, Tuple{Vector{Float64},Vector{Float64}}, syntax=:intel, debuginfo=:none)

Lets look at the hot loops, mytdot (@inbounds @simd):

L128:
        vmovupd zmm4, zmmword ptr [rcx + 8*rdi]
        vmovupd zmm5, zmmword ptr [rcx + 8*rdi + 64]
        vmovupd zmm6, zmmword ptr [rcx + 8*rdi + 128]
        vmovupd zmm7, zmmword ptr [rcx + 8*rdi + 192]
        vfmadd231pd     zmm0, zmm4, zmmword ptr [rdx + 8*rdi] # zmm0 = (zmm4 * mem) + zmm0
        vfmadd231pd     zmm1, zmm5, zmmword ptr [rdx + 8*rdi + 64] # zmm1 = (zmm5 * mem) + zmm1
        vfmadd231pd     zmm2, zmm6, zmmword ptr [rdx + 8*rdi + 128] # zmm2 = (zmm6 * mem) + zmm2
        vfmadd231pd     zmm3, zmm7, zmmword ptr [rdx + 8*rdi + 192] # zmm3 = (zmm7 * mem) + zmm3
        add     rdi, 32
        cmp     rsi, rdi
        jne     L128

mydotavx (@avx):

L112:
        vmovupd zmm4, zmmword ptr [rdx + 8*rdi]
        vmovupd zmm5, zmmword ptr [rdx + 8*rdi + 64]
        vmovupd zmm6, zmmword ptr [rdx + 8*rdi + 128]
        vmovupd zmm7, zmmword ptr [rdx + 8*rdi + 192]
        vfmadd231pd     zmm0, zmm4, zmmword ptr [rax + 8*rdi] # zmm0 = (zmm4 * mem) + zmm0
        vfmadd231pd     zmm3, zmm5, zmmword ptr [rax + 8*rdi + 64] # zmm3 = (zmm5 * mem) + zmm3
        vfmadd231pd     zmm1, zmm6, zmmword ptr [rax + 8*rdi + 128] # zmm1 = (zmm6 * mem) + zmm1
        vfmadd231pd     zmm2, zmm7, zmmword ptr [rax + 8*rdi + 192] # zmm2 = (zmm7 * mem) + zmm2
        add     rdi, 32
        cmp     rdi, rcx
        jl      L112

There are two primary differences I noticed in what happens when the hot loop runs:

  1. The former is aligned to a 32-byte boundary, and the latter to a 16-byte boundary. IIRC, my instruction decoder works in 32-byte chunks, but this is probably okay.
  2. Once the jump to the start of the loop isn’t taken, the first example immediately sums of the vector. The LoopVectorization version instead has to determine the correct cleanup approach and branches.

Do either of these points support:

I don’t think so, but the former (to the extant it matters) may be evidence that @llvm.expect is not enough to communicate which loop is hot.
Point 2 is thanks to LLVM’s poor tail-handling. If this improved, this advantage is likely to disapear.

We pick the same unrolling approach there.
But LLVM basically always picks either 1 or 4, which sounds more like heuristics than a model. But not knowing the internals, I can’t say for sure.
LoopVectorization uses latency vs throughput of the inner loop to decide.

Other examples, like just a sum of self-dot are cases where LLVM also picks 4, but LoopVectorization picks 8. It takes too long for that to pay off:

julia> @btime mysum($x128)
  6.043 ns (0 allocations: 0 bytes)
61.21314858020399

julia> @btime mysumavx($x128)
  7.227 ns (0 allocations: 0 bytes)
61.213148580204

julia> x256 = rand(256); y256 = rand(256);

julia> @btime mysum($x256)
  9.175 ns (0 allocations: 0 bytes)
122.63179874691728

julia> @btime mysumavx($x256)
  9.570 ns (0 allocations: 0 bytes)
122.6317987469173

julia> x512 = rand(512); y512 = rand(512);

julia> @btime mysum($x512)
  16.802 ns (0 allocations: 0 bytes)
245.33365652096614

julia> @btime mysumavx($x512)
  14.459 ns (0 allocations: 0 bytes)
245.33365652096614

So I should probably get it to adjust this behavior, as it wont be much longer before it becomes memory bound. Maybe they factor memory bandwidth into it.

Come again?

julia> X128 = rand(128, 128); x128discontig = view(X128, 1, :);

julia> @btime mydot($x128discontig, $y128)
  99.258 ns (0 allocations: 0 bytes)
31.142610557563987

julia> @btime mydotavx($x128discontig, $y128)
  69.848 ns (0 allocations: 0 bytes)
31.14261055756398

What I think LLVM does here is emit a check for if x128discontig is actually contiguous, vectorize if it is, and give up if is not.

julia> X2 = rand(1,128); xactuallycontig = view(X2, 1, :);

julia> @btime mydotavx($xactuallycontig, $y128)
  30.494 ns (0 allocations: 0 bytes)
32.523836125552

julia> @btime mydot($xactuallycontig, $y128)
  15.140 ns (0 allocations: 0 bytes)
32.52383612555201

I believe it’s slower than when using x128 because of this issue, but I am not sure.

LoopVectorization does not do loop versioning.

Which approach of handling unknown stride you prefer:

  1. Assume it is not one, and optimize that case.
  2. Emit a check, make it fast if it is 1, give up otherwise.

I think 1. is obviously a better choice.
Folks shouldn’t use views that have an Int as the first index if they don’t have to. They should vec or rehape as necessary.
I’d bet in the large majority of calls featuring views with dynamic leading strides, it will in fact be discontiguous – otherwise, they’re making a mistake.

I am skeptical of that claim. It struggles with things as basic as pointer offsetting, unless I generate code lining up with what I want the asm to look like.

Okay. No edge effects, we’ll make everything a clean 128:

using BenchmarkTools, LoopVectorization
function mygemm!(C, A, B)
    @inbounds @fastmath for m ∈ 1:size(A,1), n ∈ 1:size(B,2)
        Cmn = zero(eltype(C))
        for k ∈ 1:size(A,2)
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end

function mygemmavx!(C, A, B)
    @avx for m ∈ 1:size(A,1), n ∈ 1:size(B,2)
        Cmn = zero(eltype(C))
        for k ∈ 1:size(A,2)
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end

M = K = N = 128;

C1 = Matrix{Float64}(undef, M, N); A = randn(M, K); B = randn(K, N);
C2 = similar(C1); C3 = similar(C1);

This yields:

julia> @benchmark mygemmavx!($C1, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     34.779 μs (0.00% GC)
  median time:      34.910 μs (0.00% GC)
  mean time:        34.965 μs (0.00% GC)
  maximum time:     100.442 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mygemm!($C2, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.548 ms (0.00% GC)
  median time:      1.552 ms (0.00% GC)
  mean time:        1.552 ms (0.00% GC)
  maximum time:     1.636 ms (0.00% GC)
  --------------
  samples:          3222
  evals/sample:     1

Over 44 times faster.

I was actually a little dishonest here. While LLVM wasn’t suffering from edge effects in this example, LoopVectorization actually was because its modeling picked different unrolling factors (not 32).

But we can help LLVM do a little better by reordering loops ourselves:

julia> function mygemm!(C, A, B)
           fill!(C, 0)
           @inbounds @fastmath for k ∈ axes(B,1), n ∈ axes(C,2), m ∈ axes(C,1)
               C[m,n] += A[m,k] * B[k,n]
           end
       end
mygemm_better_order! (generic function with 1 method)

julia> @benchmark mygemm_better_order!($C2, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.072 ms (0.00% GC)
  median time:      1.096 ms (0.00% GC)
  mean time:        1.096 ms (0.00% GC)
  maximum time:     1.126 ms (0.00% GC)
  --------------
  samples:          4562
  evals/sample:     1

Now it’s only 30 times slower.

Now, let’s talk a little about memory layouts again.

julia> @benchmark mygemm_better_order!($C2, $A', $B')
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.303 ms (0.00% GC)
  median time:      1.346 ms (0.00% GC)
  mean time:        1.340 ms (0.00% GC)
  maximum time:     1.432 ms (0.00% GC)
  --------------
  samples:          3732
  evals/sample:     1

julia> @benchmark mygemmavx!($C2, $A', $B')
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     41.487 μs (0.00% GC)
  median time:      42.472 μs (0.00% GC)
  mean time:        42.487 μs (0.00% GC)
  maximum time:     103.311 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

LoopVectorization suffered less than a 20% performance degredation from these transposes, vs the 30% degradation of the LLVM version.
Given that LoopVectorization had a lot more to lose here from vectorization failure, but lost less in % performance anyway (stating the obvious: it still vectorized the loops)…

23 Likes

What do you consider to be the immediate term? Do you believe that fixes to these problems will appear in LLVM within a year or two? I think even if LLVM were to improve to a point where this package was totally unnecessary by one year from now, this effort would be far from wasted.

Even if LLVM can produce the same loops, doing so can be tricky and Julia doesn’t always make it easy to communicate exactly what we want to LLVM.

In general, I’m quite skeptical of the idea of relying on LLVM to do these sorts of optimizations well. We have very little control over the development and priroties of the LLVM team and have very few people willing and able to contribute directly to LLVM.

I’d much rather something that we have direct control of and access to from the Julia compiler which is why it’d be valuable to have things like Add `AbstractInterpreter` to parameterize compilation pipeline by staticfloat · Pull Request #33955 · JuliaLang/julia · GitHub so that LoopVectorization.jl can be implemented at a level that’s less awkward than the current macro + generated function approach.

6 Likes

Hmm, how is MLIR related?

Also, integrating into the compiler after all the type inference are done is something very different. I do believe we want more optimization there, mainly because of memory model differences. However, for all other ones, I don’t see why it is harder to pass information from arbitrary julia types to LoopVectorizer than to LLVM. All the informations mentioned below are the ones that LLVM is clearly aware of, but failed to make the right decision based on those information. And that’s what I meant by doing the thing at the right level. At least from the examples currently givien, the pointer arithmetics are all very simple and AFAICT in no case is LLVM generating worse code due to inability to figure out what the code is doing.

What do you mean by pointer offset?

OK, so yes, this is another case, in additional to the edge handling. I’m not surprised at all that LLVM struggles with strided memory access. I’m pretty disappointed by LLVM’s AVX512 performance. I don’t have the hardware yet so I’m not worrying too much. However, again, I don’t see how is LLVM missing information here. It can even figure out if the stride is 1 at runtime so I believe figuring out a different stride isn’t going to be a problem.

By wasting effort (or whatever I said above) I did not mean wasting time figuring out the best code to generate for each cases. That is the good part of the workaround. In fact, since it is exactly where LLVM’s problem are, it is the workaround itself. What I mean is that,

  1. Bending over for the limitation of doing this at the current level shouldn’t be ecouraged.
  2. The lesson learnt from here, all code generation ones not figuring out informtion ones AFAICT, should be reported to LLVM since they should be of general interest. Otherwise this is going to be duplicated effort.
  3. Related to the previous one. There are existing effort on this for sure. I assume polly has something to say about this and MLIR would have at least something to say regarding heterogeneous archs though I’m not convinced it helps on traditional CPUs fundamentally.

Again, in general I agree, hence I’m not proposing removing optimizations from type inference. For vectorization on simple types I have not yet seen a case yet.

And again, another statement I would agree in general. However, I strongly believe vectorized loop is an area that many major groups using LLVM are interested in so I don’t really see a conflict of interest here. Contributing directly to LLVM is also not necessary. We have many passes ourselves and we can certainly add more.

Yes, that WILL BE the right level. If this was implemented as a compiler stage AFTER type inference as I already mentioned above, I’ll have very little issue about it except that many lessons should still be communicated to LLVM. It is exactly the current implementation and the intrinsic limitation one have to work with that I’m talking about (point 1 above).

3 Likes

Let me start by saying that I definitely agree with you (and Mason, and Tim Holy, and everyone else I’ve talked about this with).
The nice thing about the generated function approach is that it does work at the level of a lot of Julia-abstractions we’re all familiar with reasoning about. But I’m sure we can preserve that.

And there’s a lot more than “the nice thing” about doing it at a lower level.
Because of course you’re right that there is more information available. It would be easier to support more types than simply StridedArray{T} where {T <: Base.HWReal} and a few extra additions like Adjoint and PermutedDimsArray without bailing out. Just being able to inline / scalar reduction of aggregate as necessary to turn code into operations on native types would be a boon.
Currently, LoopVectorization uses check_args to bail out on (ideally) anything violating those assumptions:

julia> using LoopVectorization

julia> x = rand(Complex{Float64}, 4); y = rand(4);

julia> LoopVectorization.check_args(x), LoopVectorization.check_args(y)
(false, true)

But compile times would also be much better. A 32-bit build of Julia runs LoopVectorization’s test suite several times faster than the 64-bit build, because the 32-bit build only has 8 floating point registers to work with*. That severely limits the amount of unrolling it can do without spilling registers – and that “severely” reduces the amount of work Julia and LLVM have to do to compile.
Inference should return the same result for everything that gets unrolled, there are obvious patterns in the indices that the compiler shouldn’t have to work to unravel…

  • But I’m not really sure why. It strikes me as weird to see zmm registers used, but only zmm0 through zmm7.

Three examples:

  1. Matrix multiplication
    Before I made the changes, the assembly for the loop body here had a large number of addq, leaq, and sometimes even movq instructions to calculate the addresses of the broadcasted elements.

In the code generated now, the rdi register contains the stride of the array B in bytes, while r9 = 3 * rdi.
The loads from B now look like:

vbroadcastsd    zmm30, qword ptr [r8] # Load from B[k,n]
vbroadcastsd    zmm31, qword ptr [r8 + rdi] # Load from B[k,n+1]
vbroadcastsd    zmm30, qword ptr [r8 + 2*rdi] # Load from B[k,n+2]
vbroadcastsd    zmm30, qword ptr [r8 + r9] # Load from B[k,n+3]
vbroadcastsd    zmm30, qword ptr [r8 + 4*rdi] # Load from B[k,n+4]
vbroadcastsd    zmm30, qword ptr [r8 + r15] # Load from B[k,n+5]
vbroadcastsd    zmm30, qword ptr [r8 + r12] # Load from B[k,n+6]
vbroadcastsd    zmm31, qword ptr [r8 + rbp] # Load from B[k,n+7]
vbroadcastsd    zmm30, qword ptr [r8 + 8*rdi] # Load from B[k,n+8]

Before, it was closer to this instead:

vbroadcastsd    zmm30, qword ptr [r8] # Load from B[k,n]
r8 += rdi
vbroadcastsd    zmm31, qword ptr [r8] # Load from B[k,n+1]
r8 += rdi
vbroadcastsd    zmm30, qword ptr [r8] # Load from B[k,n+2]
r8 += rdi
vbroadcastsd    zmm30, qword ptr [r8] # Load from B[k,n+3]
r8 += rdi
vbroadcastsd    zmm30, qword ptr [r8] # Load from B[k,n+4]
r8 += rdi
vbroadcastsd    zmm30, qword ptr [r8] # Load from B[k,n+5]
r8 += rdi
vbroadcastsd    zmm30, qword ptr [r8] # Load from B[k,n+6]
r8 += rdi
vbroadcastsd    zmm31, qword ptr [r8] # Load from B[k,n+7]
r8 += rdi
vbroadcastsd    zmm30, qword ptr [r8] # Load from B[k,n+8]

What it looks like now still isn’t perfect. I think the B[k,n+6] line should be:

vbroadcastsd    zmm30, qword ptr [r8 + 2*r9] # Load from B[k,n+6]

But at least it isn’t doing these calculations within the loop anymore.

LLVM 8 and 9 also needed me to actually inttoptr the pointers into i16*, i32*, and i64* and use these as the base for getelementptr operations to get the 2*, 4*, and 8* to appear. LLVM 10 did this on its own. I don’t think I checked LLVM 6 before the fix.

  1. Convolutions
using OffsetArrays, Cthulhu, LoopVectorization
Cthulhu.CONFIG.asm_syntax = :intel
function avxgeneric!(out, A, kern)
    @avx for I in CartesianIndices(out)
        tmp = zero(eltype(out))
        for J in CartesianIndices(kern)
            tmp += A[I+J]*kern[J]
        end
        out[I] = tmp
    end
    out
end
A = rand(100, 100);
kern = OffsetArray(rand(T, 3, 3), -1:1, -1:1);
out = OffsetArray(view(similar(A, size(A) .+ 32), (1:98) .+ 32, (1:98) .+ 32), 1, 1); 
@descend avxgeneric!(out, A, kern)

For a while, it was generating code like this (sorry for the AT&T syntax, this is an old copy/paste before I switched; memory addressing is much easier to read IMO with Intel syntax):

L1824:
        leaq    (%r12,%rdi), %rbx
        vmovupd (%r15,%rdi,8), %zmm9
        vmovupd (%rax,%rdi,8), %zmm10
        vbroadcastsd    (%rcx,%rdi,8), %zmm11
        vfmadd231pd     (%rbp,%rdi,8), %zmm11, %zmm5 # zmm5 = (zmm11 * mem) + zmm5
        vbroadcastsd    (%rsi,%rdi,8), %zmm12
        vfmadd231pd     %zmm9, %zmm12, %zmm5 # zmm5 = (zmm12 * zmm9) + zmm5
        vbroadcastsd    (%r14,%rdi,8), %zmm13
        vfmadd231pd     %zmm10, %zmm13, %zmm5 # zmm5 = (zmm13 * zmm10) + zmm5
        vmovupd (%rdx,%rdi,8), %zmm14
        vfmadd231pd     %zmm9, %zmm11, %zmm7 # zmm7 = (zmm11 * zmm9) + zmm7
        vfmadd231pd     %zmm10, %zmm12, %zmm7 # zmm7 = (zmm12 * zmm10) + zmm7
        vfmadd231pd     %zmm14, %zmm13, %zmm7 # zmm7 = (zmm13 * zmm14) + zmm7
        vpbroadcastq    %rbx, %ymm9
        vpaddq  %ymm8, %ymm9, %ymm9
        vmovq   %xmm9, %rbx
        vmovupd (%r13,%rbx,8), %zmm15
        vfmadd231pd     %zmm10, %zmm11, %zmm6 # zmm6 = (zmm11 * zmm10) + zmm6
        vfmadd231pd     %zmm14, %zmm12, %zmm6 # zmm6 = (zmm12 * zmm14) + zmm6
        vpextrq $1, %xmm9, %rbx
        vfmadd231pd     %zmm15, %zmm13, %zmm6 # zmm6 = (zmm13 * zmm15) + zmm6
        vmovupd (%r13,%rbx,8), %zmm10
        vfmadd231pd     %zmm14, %zmm11, %zmm4 # zmm4 = (zmm11 * zmm14) + zmm4
        vfmadd231pd     %zmm15, %zmm12, %zmm4 # zmm4 = (zmm12 * zmm15) + zmm4
        vfmadd231pd     %zmm10, %zmm13, %zmm4 # zmm4 = (zmm13 * zmm10) + zmm4
        vextracti128    $1, %ymm9, %xmm0
        vmovq   %xmm0, %rbx
        vmovupd (%r13,%rbx,8), %zmm9
        vfmadd231pd     %zmm15, %zmm11, %zmm3 # zmm3 = (zmm11 * zmm15) + zmm3
        vfmadd231pd     %zmm10, %zmm12, %zmm3 # zmm3 = (zmm12 * zmm10) + zmm3
        vpextrq $1, %xmm0, %rbx
        vfmadd231pd     %zmm9, %zmm13, %zmm3 # zmm3 = (zmm13 * zmm9) + zmm3
        vfmadd231pd     %zmm10, %zmm11, %zmm2 # zmm2 = (zmm11 * zmm10) + zmm2
        vfmadd231pd     %zmm9, %zmm12, %zmm2 # zmm2 = (zmm12 * zmm9) + zmm2
        vfmadd231pd     (%r13,%rbx,8), %zmm13, %zmm2 # zmm2 = (zmm13 * mem) + zmm2
        incq    %rdi
        cmpq    %rdi, %r10
        jne     L1824

In the above, LLVM calculates pointer offsets in a ymm register, and then uses expensive extract instructions to load them:

vpextrq $1, %xmm0, %rbx

After I made a few tweaks, it actually started using zmm registers to calculate even more pointer offsets, making the code even slower.
I did not use SIMD intrinsics for any of the indexing calculations; that was LLVM’s ingenuity.

This example needs more work, because I still have the series of add instructions I managed to get rid of in the matrix-multiplication example, but the add instructions are a major improvement over what it was:

L1376:
        lea     rax, [rsi + rbp]
        vmovupd zmm7, zmmword ptr [rbp + rsi]
        vmovupd zmm8, zmmword ptr [rsi + rax]
        add     rax, rsi
        vmovupd zmm9, zmmword ptr [rsi + rax]
        add     rax, rsi
        vmovupd zmm10, zmmword ptr [rsi + rax]
        add     rax, rsi
        vmovupd zmm11, zmmword ptr [rsi + rax]
        add     rax, rsi
        vmovupd zmm12, zmmword ptr [rsi + rax]
        add     rax, rsi
        vmovupd zmm13, zmmword ptr [rsi + rax]
        add     rax, rsi
        vmovupd zmm14, zmmword ptr [rsi + rax]
        add     rax, rsi
        vbroadcastsd    zmm15, qword ptr [rdx]
        vbroadcastsd    zmm16, qword ptr [rdx + r15]
        vfmadd231pd     zmm4, zmm15, zmmword ptr [rbp] # zmm4 = (zmm15 * mem) + zmm4
        vfmadd231pd     zmm4, zmm16, zmm7 # zmm4 = (zmm16 * zmm7) + zmm4
        vbroadcastsd    zmm17, qword ptr [rdx + 2*r15]
        vbroadcastsd    zmm18, qword ptr [rdx + r13]
        vfmadd231pd     zmm4, zmm17, zmm8 # zmm4 = (zmm17 * zmm8) + zmm4
        vfmadd231pd     zmm4, zmm18, zmm9 # zmm4 = (zmm18 * zmm9) + zmm4
        vfmadd231pd     zmm6, zmm15, zmm7 # zmm6 = (zmm15 * zmm7) + zmm6
        vfmadd231pd     zmm6, zmm16, zmm8 # zmm6 = (zmm16 * zmm8) + zmm6
        vfmadd231pd     zmm6, zmm17, zmm9 # zmm6 = (zmm17 * zmm9) + zmm6
        vfmadd231pd     zmm6, zmm18, zmm10 # zmm6 = (zmm18 * zmm10) + zmm6
        vfmadd231pd     zmm5, zmm15, zmm8 # zmm5 = (zmm15 * zmm8) + zmm5
        vfmadd231pd     zmm5, zmm16, zmm9 # zmm5 = (zmm16 * zmm9) + zmm5
        vfmadd231pd     zmm5, zmm17, zmm10 # zmm5 = (zmm17 * zmm10) + zmm5
        vfmadd231pd     zmm5, zmm18, zmm11 # zmm5 = (zmm18 * zmm11) + zmm5
        vfmadd231pd     zmm3, zmm15, zmm9 # zmm3 = (zmm15 * zmm9) + zmm3
        vfmadd231pd     zmm3, zmm16, zmm10 # zmm3 = (zmm16 * zmm10) + zmm3
        vfmadd231pd     zmm3, zmm17, zmm11 # zmm3 = (zmm17 * zmm11) + zmm3
        vfmadd231pd     zmm3, zmm18, zmm12 # zmm3 = (zmm18 * zmm12) + zmm3
        vfmadd231pd     zmm2, zmm15, zmm10 # zmm2 = (zmm15 * zmm10) + zmm2
        vfmadd231pd     zmm2, zmm16, zmm11 # zmm2 = (zmm16 * zmm11) + zmm2
        vfmadd231pd     zmm2, zmm17, zmm12 # zmm2 = (zmm17 * zmm12) + zmm2
        vfmadd231pd     zmm2, zmm18, zmm13 # zmm2 = (zmm18 * zmm13) + zmm2
        vfmadd231pd     zmm1, zmm15, zmm11 # zmm1 = (zmm15 * zmm11) + zmm1
        vfmadd231pd     zmm1, zmm16, zmm12 # zmm1 = (zmm16 * zmm12) + zmm1
        vfmadd231pd     zmm1, zmm17, zmm13 # zmm1 = (zmm17 * zmm13) + zmm1
        vfmadd231pd     zmm1, zmm18, zmm14 # zmm1 = (zmm18 * zmm14) + zmm1
        vfmadd231pd     zmm0, zmm15, zmm12 # zmm0 = (zmm15 * zmm12) + zmm0
        vfmadd231pd     zmm0, zmm16, zmm13 # zmm0 = (zmm16 * zmm13) + zmm0
        vfmadd231pd     zmm0, zmm17, zmm14 # zmm0 = (zmm17 * zmm14) + zmm0
        vfmadd231pd     zmm0, zmm18, zmmword ptr [rsi + rax] # zmm0 = (zmm18 * mem) + zmm0
        add     rdx, 8
        add     rbp, 8
        cmp     rdi, r8
        lea     rdi, [rdi + 1]
        jl      L1376

This is about 30% faster in the benchmark I ran than when it used ymm registers, and 50% faster than zmm, even though the amount of unrolling is now actually less suited to the benchmark – actually, the above loop doesn’t get run at all in the benchmark, but instead some clean up loop I’d have to dig for.

That regression was so severe that I never released a LoopVectorization version generating it.

  1. Repeated (e.g., diagonal) indices: for i in 1:N; A[i,i]; end
function logdettriangle(B::Union{LowerTriangular,UpperTriangular})
    A = parent(B) # using a triangular matrix would fall back to the default loop.
    ld = zero(eltype(A))
    @avx for n ∈ axes(A,1)
        ld += log(A[n,n])
    end
    ld
end

This is what it looked like before, and this after I made those changes.

LLVM would calculate the pointer offset from the strides inside the loop.
I now handle this via A[i,i] -> Adiagonalview[i], where stride(Adiagonalview[i],1) == stride(A,1) + stride(A,2), which reduces addressing calculations by enough to make a notable difference vs the cost of a SIMD-log.

This particular issue may be in part because it has to use vectors for the discontiguous offsets into A, so for A[i,i], we have something like ptr + <0,1,2,3,4,5,6,7> * st + stride(A,2) * <0,1,2,3,4,5,6,7> * st where st = sizeof(eltype(T)), and LLVM might be hesitant to optimize the vectors / realize that

<0,1,2,3,4,5,6,7> * st * i + stride(A,2) * <0,1,2,3,4,5,6,7> * st * i
== (<0,1,2,3,4,5,6,7> * st * (1 + stride(A,2))) * i

and that from there it can avoid the calculations inside the loop, and just keep vpaddq-ing <0,1,2,3,4,5,6,7> * st * (1 + stride(A,2)) onto the pointer vector instead.

LLVM sometimes figured this out, but not always. Flattening repeated indices into one with combined stride made it more reliable.

If you like to waste time with micro-optimizations as much as I do, AVX512 is fun because of the extra options it provides and large possible reward when you properly leverage it (that large reward being just as much a function of the compiler’s poor AVX512 performance as you note as it is the larger vectors; the opportunity cost of hand optimizing / autogenerating optimized code is what the compiler would have done).

I used to have Clang-polly in the benchmarks, but I removed it because the performance was bad, it took a long time to compile using Polly, and then once I wiped my hard drive I decided rebuilding Clang with Polly from source instead of just using my distro’s Clang (which wasn’t built with Polly) wasn’t worthwhile. The only benchmark that seemed to improve with Polly was A * B, but it didn’t improve to a level beyond how the Intel compilers performed, and they have vastly better compilation times.
None of the transposed cases were notably better than regular Clang.

I’m looking forward to trying MLIR, whether through Brutus, Flang, or some other convenient frontend.

I think Flang is spearheaded by NVidea, so perhaps that is the primary goal of them lowering to MLIR.
I should look into it more than I have, but there has at least been discussion of an AVX512 dialect and this paper got a pretty good gemm on AVX2.

5 Likes

Maybe another example I could have added is that LoopVectorization will (under certain conditions) use asm call instead of @llvm.muladd to specifically use the vfmadd231 instruction on CPUs with the FMA3 instruction set.
vfmadd231 basically does:
c = a * b + c
While @llvm.muladd would often choose to use vfmadd213 instead:
b = a * b + c
and then add an extra vmovapd instruction to set c equal to b.

This has now been fixed on LLVM master, after I asked for help getting rid of those vmovapds online. That means on Julia versions with an LLVM using that commit, I can quit using asm call. I think that’ll be starting with LLVM 11?

So at least one of the optimization’s here has already led to an improvement in LLVM.

6 Likes

I am surprised that llvm left such optimization on the table, though for most (all?) of it, it again seems that LLVM does know what’s the offsets are but didn’t manage to emit the optimal code to generate it. It can certainly be added to the cases were better codgen is needed but I don’t think this info has ever be hidden from LLVM IR.

I like to do it when it matters for me ;-p… So yes when I use AVX512 capable hardware as my main computer.

As far as my concern, you are very welcome to submit that patch to be included in our LLVM build. I basically did that for all of the few LLVM performance related issues that I’ve fixed/experienced, and ususally before they are merged upstream. There’s no need to wait for a new LLVM release to be used.


And I’ll also add that doing these pattern matching with LLVM IR is not hard. In fact I always found it to be easier than working with julia IR. The switch to SSA for julia IR have cerntainly helpped (though that’s not the level this package uses) but LLVM still have WAY more utilities to do all of these analysis.

3 Likes

This presentation lays out the relationship and motivation better than I could: 2020-02-26 - CGO 2020 Talk - Google Slides. MLIR is being pushed by core developers of LLVM, which is what I meant by “raising the white flag” on solving all problems at the level of LLVM.

There’s interest in turning LoopVectorization into a compiler pass, see Add a compiler pass for iterate(::CartesianIndices) by timholy · Pull Request #35074 · JuliaLang/julia · GitHub. I hope to get back to that “real soon” (invalidations and Revise have dominated lately…). But honestly, what Chris is doing at the package level will all be useful once we pave the way for integrating at a deeper level, so there’s every reason to be excited about seeing his work continue to move forward.

16 Likes

Sure I knew that they want MLIR since the current IR can’t solve some problems. I don’t see how that applies here though. There’s nothing language specific here and there’s nothing that LLVM isn’t already aware of. Moving to an IR might make existing code easier to (re)write but that does not seem to be the limitation here. Again, all what’s new here seem to be emitting better code given information already known.

  1. I’ve already said that working on better code generation isn’t what I’m complaining about. Bending over to fit intrinsic limitation of current implementation is.
  2. Similarly, if this was implemented on LLVM IR, nothing would be wasted either and it’ll even have a much broader audience. (in that sense, it is wasted effort since it duplicate the LLVM one and ones from other people. It’s very unfortunate but I have less direct issue about that.)
1 Like

for my own education since i know very little about the Julia compiler…

i would imagine that Julia takes code, and then attempts to re-write it into a form which LLVM can then optimize, much like a lot of scheme compilers try to emit very boring C code which is really easy to optimize because it kind of looks like assembly written in C. Does Julia emit some form of intermediate language and then LLVM works on compiling that secondary thing ? i’ll make another guess and suppose that the secondary thing is in fact a processor independent assembly language of some sort.

loop vectorization, it seems, is a macro which attempts to take your code, and because it’s a macro, can actually emit assembly directly bypassing LLVM entirely ?

And so loopvectorization as a compiler pass would do what ? actually emit assembly or emit that intermediate code for LLVM (assuming i’m right about that) ?

You should probably watch JuliaCon 2017 | AoT or JIT: How Does Julia Work? | Jameson Nash - YouTube it’s a pretty detailed view of Julia’s compiler

4 Likes

A small addition to @Oscar_Smith’s excellent recommendation (short answer: Julia’s compiler emits LLVM IR and then calls LLVM passes), Base.llvmcall means that the line between “ordinary Julia code” and “emit assembly” (via LLVM) is blurry.

1 Like

While all the information is available, I think it’s easier to make those sorts of optimizations at the same time as you perform loop transformations, as the patterns and relations are clear at this time; e.g. if you’re unrolling a loop in which you’re loading from a strided array, you know that some of the pointer offsets are 2x, 4x, and possibly 8x larger than a base, and thus you can use the 2, 4,or 8x multipliers available for addressing (and that this is a better use for them than sizeof(eltype(pointer)) as LLVM normally likes to do).

Some of LLVM’s mistakes are truly bizarre.
How could it possibly not figure this one out?
This happens on LLVM 8 and 9 (and I think 6), but not with LLVM 10 (meaning no use reporting it, as it’s already been fixed):

L976:
        prefetcht0      byte ptr [r10 + rax - 128]
        vbroadcastsd    zmm0, qword ptr [r9]
        vmovups zmm31, zmmword ptr [rdi]
        vmovups zmm30, zmmword ptr [r10 + 64]
        vmovupd zmm29, zmmword ptr [r10 + 128]
        prefetcht0      byte ptr [r10 + rax - 64]
        prefetcht0      byte ptr [r10 + rax]
        vfmadd231pd     zmm28, zmm0, zmm31 # zmm28 = (zmm0 * zmm31) + zmm28
        vfmadd231pd     zmm25, zmm0, zmm30 # zmm25 = (zmm0 * zmm30) + zmm25
        vbroadcastsd    zmm1, qword ptr [rbp + rsi - 8]
        vfmadd231pd     zmm19, zmm0, zmm29 # zmm19 = (zmm0 * zmm29) + zmm19
        vfmadd231pd     zmm27, zmm1, zmm31 # zmm27 = (zmm1 * zmm31) + zmm27
        vfmadd231pd     zmm23, zmm1, zmm30 # zmm23 = (zmm1 * zmm30) + zmm23
        vfmadd231pd     zmm16, zmm1, zmm29 # zmm16 = (zmm1 * zmm29) + zmm16
        vbroadcastsd    zmm0, qword ptr [r9 + 2*rsi]
        vfmadd231pd     zmm26, zmm0, zmm31 # zmm26 = (zmm0 * zmm31) + zmm26
        vfmadd231pd     zmm21, zmm0, zmm30 # zmm21 = (zmm0 * zmm30) + zmm21
        vfmadd231pd     zmm13, zmm0, zmm29 # zmm13 = (zmm0 * zmm29) + zmm13
        vbroadcastsd    zmm0, qword ptr [rbp + r11 - 8]
        vfmadd231pd     zmm24, zmm0, zmm31 # zmm24 = (zmm0 * zmm31) + zmm24
        vfmadd231pd     zmm18, zmm0, zmm30 # zmm18 = (zmm0 * zmm30) + zmm18
        vfmadd231pd     zmm10, zmm0, zmm29 # zmm10 = (zmm0 * zmm29) + zmm10
        vbroadcastsd    zmm0, qword ptr [r9 + 4*rsi]
        vfmadd231pd     zmm22, zmm0, zmm31 # zmm22 = (zmm0 * zmm31) + zmm22
        vfmadd231pd     zmm15, zmm0, zmm30 # zmm15 = (zmm0 * zmm30) + zmm15
        vfmadd231pd     zmm8, zmm0, zmm29 # zmm8 = (zmm0 * zmm29) + zmm8
        vbroadcastsd    zmm0, qword ptr [rbp + rcx - 8]
        vfmadd231pd     zmm20, zmm0, zmm31 # zmm20 = (zmm0 * zmm31) + zmm20
        vfmadd231pd     zmm12, zmm0, zmm30 # zmm12 = (zmm0 * zmm30) + zmm12
        vfmadd231pd     zmm6, zmm0, zmm29 # zmm6 = (zmm0 * zmm29) + zmm6
        vbroadcastsd    zmm0, qword ptr [r9 + 2*r11]
        vfmadd231pd     zmm17, zmm0, zmm31 # zmm17 = (zmm0 * zmm31) + zmm17
        vfmadd231pd     zmm9, zmm0, zmm30 # zmm9 = (zmm0 * zmm30) + zmm9
        vpbroadcastq    zmm1, qword ptr [rbp + r15 - 8]
        vfmadd231pd     zmm4, zmm0, zmm29 # zmm4 = (zmm0 * zmm29) + zmm4
        vfmadd231pd     zmm14, zmm1, zmm31 # zmm14 = (zmm1 * zmm31) + zmm14
        vfmadd231pd     zmm7, zmm1, zmm30 # zmm7 = (zmm1 * zmm30) + zmm7
        vfmadd231pd     zmm3, zmm1, zmm29 # zmm3 = (zmm1 * zmm29) + zmm3
        vpbroadcastq    zmm0, qword ptr [r9 + 8*rsi]
        vfmadd231pd     zmm11, zmm0, zmm31 # zmm11 = (zmm0 * zmm31) + zmm11
        vfmadd231pd     zmm5, zmm0, zmm30 # zmm5 = (zmm0 * zmm30) + zmm5
        vfmadd231pd     zmm2, zmm0, zmm29 # zmm2 = (zmm0 * zmm29) + zmm2
        add     r10, rdx
        mov     r9, rbp
        add     rbp, 8
        mov     rdi, r10
        cmp     r10, rbx
        jbe     L976

Filtering away the parts I’m not talking about:

        vbroadcastsd    zmm0, qword ptr [r9]
        vbroadcastsd    zmm1, qword ptr [rbp + rsi - 8]
        vbroadcastsd    zmm0, qword ptr [r9 + 2*rsi]
        vbroadcastsd    zmm0, qword ptr [rbp + r11 - 8]
        vbroadcastsd    zmm0, qword ptr [r9 + 4*rsi]
        vbroadcastsd    zmm0, qword ptr [rbp + rcx - 8]
        vbroadcastsd    zmm0, qword ptr [r9 + 2*r11]
        vpbroadcastq    zmm1, qword ptr [rbp + r15 - 8]
        vpbroadcastq    zmm0, qword ptr [r9 + 8*rsi]
        add     r10, rdx
        mov     r9, rbp
        add     rbp, 8

rbp == r9 + 8.
Every time it uses rbp, it subtracts 8.
Once it’s done with all the loads, before repeating the loop, it sets r9 = rbp, and then adds 8 to rbp.

It seems that LLVM is obviously aware of the relationship here.
Yet, why doesn’t it eliminate rbp?

The most bewildering case of bad code gen I came across is that, if it possible for a loop to have 0 iterations, LLVM likes to generate code that:

  1. zero initializes all the registers
  2. checks if the loop will iterate at least once. If it does not, it jumps to the appropriate clean up. If the loop will iterate, it continues and…
  3. zero initializes all the registers have just been zero initialized and not used since said xor-ing to 0.

This look really bad when the number of registers being xored twice is something like 25 or 27.
As a workaround, LoopVectorization emits non-unrolled loops of the form

cond = true
while cond
    #do stuff
    cond = # check if it should exit; (`cond || break` leads to worse code gen)
end

(It can check to make sure they’re not empty outside if you pass @avx check_empty=true.)
But this trick doesn’t work when loops are unrolled, as I wont guarantee that any particular one will iterate at least once.

I’d still have to carry the code managing asm call until I drop support for Julia versions that don’t carry that patch.
But asm call is definitely a hack that it would be better to be rid of.
I have a fair bit of logic deciding whether or not to use a the asm version. On AVX512, LLVM can optimize the @llvm.muladd version by combining it with select statements or with loads/broadcasts.
So I have to check if it seems likely LLVM may want to do that, to avoid missing those opportunities.

I’ll have to look at other PRs adding LLVM patches to Julia to see what it’d take.

If Julia 1.6 is to be the next LTS, it would be worth it for me to get it in there.

Also, regarding performance on multiples of 32, I now have:

julia> @btime mydot($x128, $y128)
  7.992 ns (0 allocations: 0 bytes)
34.751488939009256

julia> @btime mydotavx($x128, $y128)
  8.483 ns (0 allocations: 0 bytes)
34.75148893900926

Versus what it was in the recent post:

julia> @btime mydot($x128, $y128)
  7.993 ns (0 allocations: 0 bytes)
31.004598810919788

julia> @btime mydotavx($x128, $y128)
  8.737 ns (0 allocations: 0 bytes)
31.004598810919788

By only calculating the mask if we need it. Some of the remaining difference is probably because, after the loop, the LLVM version immediately reduces the vector to a scalar, while mine checks for the appropriate clean up loop.

At these sizes, selfdot and sum examples are also helped by only unrolling by 4 instead of 8. As the size range where 8 is better is probably small (wont be long before they’re memory bound), I may limit the maximum unrolling of such reductions to 4.
I’ll have to think about it.

I’d have to look into that, but it of course makes sense that this would be true given how many passes there are that would want to know about these things.
I know close to 0 about LLVM’s internals.

@purplishrock here, in answer of how things work (note also that LoopVectorization has some developer docs):
The macro generates a call to the generated function _avx_!:

julia> using LoopVectorization

julia> @macroexpand @avx for i ∈ eachindex(a,b)
       s += a[i]*b[i]
       end
quote
    begin
        var"##loopi#592" = LoopVectorization.maybestaticrange(eachindex(a, b))
        var"##i_loop_lower_bound#593" = LoopVectorization.maybestaticfirst(var"##loopi#592")
        var"##i_loop_upper_bound#594" = LoopVectorization.maybestaticlast(var"##loopi#592")
    end
    if LoopVectorization.check_args(a, b)
        var"##vptr##_a" = LoopVectorization.stridedpointer(a)
        var"##vptr##_b" = LoopVectorization.stridedpointer(b)
        local var"##s_0"
        begin
            $(Expr(:gc_preserve, :(var"##s_0" = LoopVectorization._avx_!(Val{(0, 0, 0)}(), Tuple{:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02), :LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x03), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000001, 0x0000000000000000, 0x0000000000010203, LoopVectorization.compute, 0x00, 0x03)}, Tuple{LoopVectorization.ArrayRefStruct{:a,Symbol("##vptr##_a")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:b,Symbol("##vptr##_b")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000)}, Tuple{0, Tuple{4}, Tuple{3}, Tuple{}, Tuple{}, Tuple{}, Tuple{}}, Tuple{:i}, (var"##i_loop_lower_bound#593":var"##i_loop_upper_bound#594",), var"##vptr##_a", var"##vptr##_b", s)), :a, :b))
        end
        s = LoopVectorization.reduced_add(var"##s_0", s)
    else
        $(Expr(:inbounds, true))
        local var"#79#val" = for i = eachindex(a, b)
                    #= REPL[52]:2 =#
                    s = Base.FastMath.add_fast(s, Base.FastMath.mul_fast(a[i], b[i]))
                end
        $(Expr(:inbounds, :pop))
        var"#79#val"
    end
end

The function _avx_! is given everything it needs to know to reconstruct a summary of the loop’s structure, encoded in type information.
The macro LoopVectorization.@avx_debug gives us this summary:

julia> a = rand(147); b = rand(147);

julia> s = 0.0;

julia> ls = LoopVectorization.@avx_debug for i ∈ eachindex(a,b)
       s += a[i]*b[i]
       end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:LOOPCONSTANTINSTRUCTION,LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x03),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000001, 0x0000000000000000, 0x0000000000010203, LoopVectorization.compute, 0x00, 0x03)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:a,Symbol("##vptr##_a")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:b,Symbol("##vptr##_b")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000)}
AM = Tuple{0,Tuple{4},Tuple{3},Tuple{},Tuple{},Tuple{},Tuple{}}
LPSYM = Tuple{:i}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f5936f90050, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f5936f905a0, ()), 0.0)

julia> typeof(ls)
LoopVectorization.LoopSet

The LoopSet contains all the operations involved in the loop, and their dependencies on each other and the individual loops. This is used both for (a) estimating the cost of evaluating the loop set following different loop orders and unrolling factors (the estimate is less sophisticated than MCA, but uses estimates of instruction throughput, register utilization, etc).
It will only choose to unroll up to 2 loops:

julia> LoopVectorization.choose_order(ls)
([:i], :i, Symbol("##undefined##"), :i, 4, -1)

The returned arguments are loop order (in a vector), the first and second unrolled loop (second is Symbol("##undefined##")), which loop is vectorized, and then the unrolling factors for the unrolled loops (-1 is a sentinel value for not unrolled).

Then, given these values, it emits a Julia Expr object. You can call LoopVectorization.lower(ls) explicitly, or it will print this as the show method via simply:

julia> ls

This is largely reduced to a sequence of llvmcalls and control flow; the main loop:

julia> @code_typed mydotavx(a, b)
6 ┄─ %40  = φ (#5 => %37, #7 => %116)::NTuple{8,VecElement{Float64}}
│    %41  = φ (#5 => %37, #7 => %116)::NTuple{8,VecElement{Float64}}
│    %42  = φ (#5 => %37, #7 => %116)::NTuple{8,VecElement{Float64}}
│    %43  = φ (#5 => %37, #7 => %116)::NTuple{8,VecElement{Float64}}
│    %44  = φ (#5 => %37, #7 => %116)::NTuple{8,VecElement{Float64}}
│    %45  = φ (#5 => %39, #7 => %118)::NTuple{8,VecElement{Float64}}
│    %46  = φ (#5 => %39, #7 => %118)::NTuple{8,VecElement{Float64}}
│    %47  = φ (#5 => %39, #7 => %118)::NTuple{8,VecElement{Float64}}
│    %48  = φ (#5 => %39, #7 => %118)::NTuple{8,VecElement{Float64}}
│    %49  = φ (#5 => %35, #7 => %114)::NTuple{8,VecElement{Float64}}
│    %50  = φ (#5 => %35, #7 => %114)::NTuple{8,VecElement{Float64}}
│    %51  = φ (#5 => %35, #7 => %114)::NTuple{8,VecElement{Float64}}
│    %52  = φ (#5 => %35, #7 => %114)::NTuple{8,VecElement{Float64}}
│    %53  = φ (#5 => %35, #7 => %114)::NTuple{8,VecElement{Float64}}
│    %54  = φ (#5 => %35, #7 => %114)::NTuple{8,VecElement{Float64}}
│    %55  = φ (#5 => %33, #7 => %112)::NTuple{8,VecElement{Float64}}
│    %56  = φ (#5 => %33, #7 => %112)::NTuple{8,VecElement{Float64}}
│    %57  = φ (#5 => %33, #7 => %112)::NTuple{8,VecElement{Float64}}
│    %58  = φ (#5 => %33, #7 => %112)::NTuple{8,VecElement{Float64}}
│    %59  = φ (#5 => %33, #7 => %112)::NTuple{8,VecElement{Float64}}
│    %60  = φ (#5 => %33, #7 => %112)::NTuple{8,VecElement{Float64}}
│    %61  = φ (#5 => %33, #7 => %112)::NTuple{8,VecElement{Float64}}
│    %62  = φ (#5 => 0, #7 => %120)::Int64
│    %63  = Base.llvmcall::Core.IntrinsicFunction
│    %64  = (%63)("%res = sub nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, %3, 31)::Int64
│    %65  = Base.slt_int(%62, %64)::Bool
└───        goto #8 if not %65
7 ── %67  = Base.llvmcall::Core.IntrinsicFunction
│    %68  = (%67)("%res = mul nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, 8, %62)::Int64
│    %69  = Base.llvmcall::Core.IntrinsicFunction
│    %70  = (%69)(("!1 = !{!\"noaliasdomain\"}\n!2 = !{!\"noaliasscope\", !1}\n!3 = !{!2}\n!4 = !{!\"jtbaa\"}\n!5 = !{!6, !6, i64 0, i64 0}\n!6 = !{!\"jtbaa_arraybuf\", !4, i64 0}\n", "%typptr = inttoptr i64 %0 to i8*\n%offsetptr = getelementptr inbounds i8, i8* %typptr, i64 %1\n%ptr = bitcast i8* %offsetptr to <8 x double>*\n%res = load <8 x double>, <8 x double>* %ptr, align 8, !alias.scope !3, !tbaa !5\nret <8 x double> %res"), NTuple{8,VecElement{Float64}}, Tuple{Ptr{Float64},Int64}, %29, %68)::NTuple{8,VecElement{Float64}}
│    %71  = Base.llvmcall::Core.IntrinsicFunction
│    %72  = (%71)("%res = add nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, 8, %62)::Int64
│    %73  = Base.llvmcall::Core.IntrinsicFunction
│    %74  = (%73)("%res = mul nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, 8, %72)::Int64
│    %75  = Base.llvmcall::Core.IntrinsicFunction
│    %76  = (%75)(("!1 = !{!\"noaliasdomain\"}\n!2 = !{!\"noaliasscope\", !1}\n!3 = !{!2}\n!4 = !{!\"jtbaa\"}\n!5 = !{!6, !6, i64 0, i64 0}\n!6 = !{!\"jtbaa_arraybuf\", !4, i64 0}\n", "%typptr = inttoptr i64 %0 to i8*\n%offsetptr = getelementptr inbounds i8, i8* %typptr, i64 %1\n%ptr = bitcast i8* %offsetptr to <8 x double>*\n%res = load <8 x double>, <8 x double>* %ptr, align 8, !alias.scope !3, !tbaa !5\nret <8 x double> %res"), NTuple{8,VecElement{Float64}}, Tuple{Ptr{Float64},Int64}, %29, %74)::NTuple{8,VecElement{Float64}}
│    %77  = Base.llvmcall::Core.IntrinsicFunction
│    %78  = (%77)("%res = add nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, 16, %62)::Int64
│    %79  = Base.llvmcall::Core.IntrinsicFunction
│    %80  = (%79)("%res = mul nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, 8, %78)::Int64
│    %81  = Base.llvmcall::Core.IntrinsicFunction
│    %82  = (%81)(("!1 = !{!\"noaliasdomain\"}\n!2 = !{!\"noaliasscope\", !1}\n!3 = !{!2}\n!4 = !{!\"jtbaa\"}\n!5 = !{!6, !6, i64 0, i64 0}\n!6 = !{!\"jtbaa_arraybuf\", !4, i64 0}\n", "%typptr = inttoptr i64 %0 to i8*\n%offsetptr = getelementptr inbounds i8, i8* %typptr, i64 %1\n%ptr = bitcast i8* %offsetptr to <8 x double>*\n%res = load <8 x double>, <8 x double>* %ptr, align 8, !alias.scope !3, !tbaa !5\nret <8 x double> %res"), NTuple{8,VecElement{Float64}}, Tuple{Ptr{Float64},Int64}, %29, %80)::NTuple{8,VecElement{Float64}}
│    %83  = Base.llvmcall::Core.IntrinsicFunction
│    %84  = (%83)("%res = add nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, 24, %62)::Int64
│    %85  = Base.llvmcall::Core.IntrinsicFunction
│    %86  = (%85)("%res = mul nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, 8, %84)::Int64
│    %87  = Base.llvmcall::Core.IntrinsicFunction
│    %88  = (%87)(("!1 = !{!\"noaliasdomain\"}\n!2 = !{!\"noaliasscope\", !1}\n!3 = !{!2}\n!4 = !{!\"jtbaa\"}\n!5 = !{!6, !6, i64 0, i64 0}\n!6 = !{!\"jtbaa_arraybuf\", !4, i64 0}\n", "%typptr = inttoptr i64 %0 to i8*\n%offsetptr = getelementptr inbounds i8, i8* %typptr, i64 %1\n%ptr = bitcast i8* %offsetptr to <8 x double>*\n%res = load <8 x double>, <8 x double>* %ptr, align 8, !alias.scope !3, !tbaa !5\nret <8 x double> %res"), NTuple{8,VecElement{Float64}}, Tuple{Ptr{Float64},Int64}, %29, %86)::NTuple{8,VecElement{Float64}}
│    %89  = Base.llvmcall::Core.IntrinsicFunction
│    %90  = (%89)("%res = mul nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, 8, %62)::Int64
│    %91  = Base.llvmcall::Core.IntrinsicFunction
│    %92  = (%91)(("!1 = !{!\"noaliasdomain\"}\n!2 = !{!\"noaliasscope\", !1}\n!3 = !{!2}\n!4 = !{!\"jtbaa\"}\n!5 = !{!6, !6, i64 0, i64 0}\n!6 = !{!\"jtbaa_arraybuf\", !4, i64 0}\n", "%typptr = inttoptr i64 %0 to i8*\n%offsetptr = getelementptr inbounds i8, i8* %typptr, i64 %1\n%ptr = bitcast i8* %offsetptr to <8 x double>*\n%res = load <8 x double>, <8 x double>* %ptr, align 8, !alias.scope !3, !tbaa !5\nret <8 x double> %res"), NTuple{8,VecElement{Float64}}, Tuple{Ptr{Float64},Int64}, %31, %90)::NTuple{8,VecElement{Float64}}
│    %93  = Base.llvmcall::Core.IntrinsicFunction
│    %94  = (%93)("%res = add nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, 8, %62)::Int64
│    %95  = Base.llvmcall::Core.IntrinsicFunction
│    %96  = (%95)("%res = mul nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, 8, %94)::Int64
│    %97  = Base.llvmcall::Core.IntrinsicFunction
│    %98  = (%97)(("!1 = !{!\"noaliasdomain\"}\n!2 = !{!\"noaliasscope\", !1}\n!3 = !{!2}\n!4 = !{!\"jtbaa\"}\n!5 = !{!6, !6, i64 0, i64 0}\n!6 = !{!\"jtbaa_arraybuf\", !4, i64 0}\n", "%typptr = inttoptr i64 %0 to i8*\n%offsetptr = getelementptr inbounds i8, i8* %typptr, i64 %1\n%ptr = bitcast i8* %offsetptr to <8 x double>*\n%res = load <8 x double>, <8 x double>* %ptr, align 8, !alias.scope !3, !tbaa !5\nret <8 x double> %res"), NTuple{8,VecElement{Float64}}, Tuple{Ptr{Float64},Int64}, %31, %96)::NTuple{8,VecElement{Float64}}
│    %99  = Base.llvmcall::Core.IntrinsicFunction
│    %100 = (%99)("%res = add nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, 16, %62)::Int64
│    %101 = Base.llvmcall::Core.IntrinsicFunction
│    %102 = (%101)("%res = mul nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, 8, %100)::Int64
│    %103 = Base.llvmcall::Core.IntrinsicFunction
│    %104 = (%103)(("!1 = !{!\"noaliasdomain\"}\n!2 = !{!\"noaliasscope\", !1}\n!3 = !{!2}\n!4 = !{!\"jtbaa\"}\n!5 = !{!6, !6, i64 0, i64 0}\n!6 = !{!\"jtbaa_arraybuf\", !4, i64 0}\n", "%typptr = inttoptr i64 %0 to i8*\n%offsetptr = getelementptr inbounds i8, i8* %typptr, i64 %1\n%ptr = bitcast i8* %offsetptr to <8 x double>*\n%res = load <8 x double>, <8 x double>* %ptr, align 8, !alias.scope !3, !tbaa !5\nret <8 x double> %res"), NTuple{8,VecElement{Float64}}, Tuple{Ptr{Float64},Int64}, %31, %102)::NTuple{8,VecElement{Float64}}
│    %105 = Base.llvmcall::Core.IntrinsicFunction
│    %106 = (%105)("%res = add nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, 24, %62)::Int64
│    %107 = Base.llvmcall::Core.IntrinsicFunction
│    %108 = (%107)("%res = mul nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, 8, %106)::Int64
│    %109 = Base.llvmcall::Core.IntrinsicFunction
│    %110 = (%109)(("!1 = !{!\"noaliasdomain\"}\n!2 = !{!\"noaliasscope\", !1}\n!3 = !{!2}\n!4 = !{!\"jtbaa\"}\n!5 = !{!6, !6, i64 0, i64 0}\n!6 = !{!\"jtbaa_arraybuf\", !4, i64 0}\n", "%typptr = inttoptr i64 %0 to i8*\n%offsetptr = getelementptr inbounds i8, i8* %typptr, i64 %1\n%ptr = bitcast i8* %offsetptr to <8 x double>*\n%res = load <8 x double>, <8 x double>* %ptr, align 8, !alias.scope !3, !tbaa !5\nret <8 x double> %res"), NTuple{8,VecElement{Float64}}, Tuple{Ptr{Float64},Int64}, %31, %108)::NTuple{8,VecElement{Float64}}
│    %111 = Base.llvmcall::Core.IntrinsicFunction
│    %112 = (%111)("%prod = fmul nnan ninf nsz arcp contract <8 x double> %0, %1\n%res = fadd fast <8 x double> %prod, %2\nret <8 x double> %res\n", NTuple{8,VecElement{Float64}}, Tuple{NTuple{8,VecElement{Float64}},NTuple{8,VecElement{Float64}},NTuple{8,VecElement{Float64}}}, %70, %92, %55)::NTuple{8,VecElement{Float64}}
│    %113 = Base.llvmcall::Core.IntrinsicFunction
│    %114 = (%113)("%prod = fmul nnan ninf nsz arcp contract <8 x double> %0, %1\n%res = fadd fast <8 x double> %prod, %2\nret <8 x double> %res\n", NTuple{8,VecElement{Float64}}, Tuple{NTuple{8,VecElement{Float64}},NTuple{8,VecElement{Float64}},NTuple{8,VecElement{Float64}}}, %76, %98, %49)::NTuple{8,VecElement{Float64}}
│    %115 = Base.llvmcall::Core.IntrinsicFunction
│    %116 = (%115)("%prod = fmul nnan ninf nsz arcp contract <8 x double> %0, %1\n%res = fadd fast <8 x double> %prod, %2\nret <8 x double> %res\n", NTuple{8,VecElement{Float64}}, Tuple{NTuple{8,VecElement{Float64}},NTuple{8,VecElement{Float64}},NTuple{8,VecElement{Float64}}}, %82, %104, %40)::NTuple{8,VecElement{Float64}}
│    %117 = Base.llvmcall::Core.IntrinsicFunction
│    %118 = (%117)("%prod = fmul nnan ninf nsz arcp contract <8 x double> %0, %1\n%res = fadd fast <8 x double> %prod, %2\nret <8 x double> %res\n", NTuple{8,VecElement{Float64}}, Tuple{NTuple{8,VecElement{Float64}},NTuple{8,VecElement{Float64}},NTuple{8,VecElement{Float64}}}, %88, %110, %45)::NTuple{8,VecElement{Float64}}
│    %119 = Base.llvmcall::Core.IntrinsicFunction
│    %120 = (%119)("%res = add nsw i64 %0, %1\nret i64 %res", VectorizationBase.Int, Tuple{Int64,Int64}, 32, %62)::Int64
└───        goto #6

If there’s more than one loop, the expression probably wont be inlined, so I’d recomend Cthulhu for introspection.

As a compiler pass, LoopVectorization would similarly take nested loops, build its internal representation, analyze it to estimate which method of evaluating the loop will generally yield the best runtime performance (under the assumption that the loops are hot), and then emit the corresponding IR. Ideally, this should be locally opt-in via pragmas/macros/some form of decorator, as it is now.

I think we all agree that this would be better off integrated into a compiler. A better question is whether it should work with LLVM, or with Julia IR.
An incentive for the latter is that it is still my intention to have this support AD, i.e. use the LoopSet’s understanding of relationships between variables to create the reverse passes for loops.

3 Likes

Also meaning if you want the fix, it should be just a bisect away.

I’ve honestly not seen that one before and it should certainly be reported. I have seen quite a few cases where it does a better job for the empty case/loop entry better then what I can easily come up with…

As early as possible is always good.

It’s very easy to pick up, at least the LLVM IR part. Machine IR is probably not harder though it is somewhat harder to dump and look at IMHO…
And back to the point of wasted effort. Sadly, most of the time you spent on debugging and working around the three issues mentioned here are wasted in the long term, and some even not that long term since a LLVM patch could be included in the next/previous julia release. I almost always find it much more productive to report and even fix the issue at the right level first before developping a temporary workaround. This is more true the more issues you have to workaround which is the whole point about this package. With so many individual issues (and apparently many fixed already) you’ve hit so far, it should be very easy for you to look at the fix/debug the issue in LLVM and it should get you started in fixing those in LLVM in no time.

1 Like

Turning off the “workaround” within LoopVectorization:

L1160:
        mov     rcx, qword ptr [rsp - 56]
        lea     rbx, [rdx + rcx]
        vxorpd  xmm27, xmm27, xmm27
        vxorps  xmm25, xmm25, xmm25
        vxorpd  xmm22, xmm22, xmm22
        vxorpd  xmm19, xmm19, xmm19
        vxorpd  xmm16, xmm16, xmm16
        vxorpd  xmm13, xmm13, xmm13
        vxorpd  xmm10, xmm10, xmm10
        vxorpd  xmm7, xmm7, xmm7
        vxorpd  xmm4, xmm4, xmm4
        vxorps  xmm28, xmm28, xmm28
        vxorpd  xmm24, xmm24, xmm24
        vxorpd  xmm21, xmm21, xmm21
        vxorpd  xmm18, xmm18, xmm18
        vxorpd  xmm15, xmm15, xmm15
        vxorpd  xmm12, xmm12, xmm12
        vxorpd  xmm9, xmm9, xmm9
        vxorpd  xmm6, xmm6, xmm6
        vxorpd  xmm3, xmm3, xmm3
        vxorps  xmm26, xmm26, xmm26
        vxorpd  xmm23, xmm23, xmm23
        vxorpd  xmm20, xmm20, xmm20
        vxorpd  xmm17, xmm17, xmm17
        vxorpd  xmm14, xmm14, xmm14
        vxorpd  xmm11, xmm11, xmm11
        vxorpd  xmm8, xmm8, xmm8
        vxorpd  xmm5, xmm5, xmm5
        vxorpd  xmm2, xmm2, xmm2
        cmp     r8, rbx
        mov     rcx, qword ptr [rsp - 112]
        ja      L896
        vxorpd  xmm2, xmm2, xmm2
        mov     rsi, qword ptr [rsp + 8]
        mov     rdi, qword ptr [rsp - 40]
        mov     rbp, rdx
        vxorpd  xmm5, xmm5, xmm5
        vxorpd  xmm8, xmm8, xmm8
        vxorpd  xmm11, xmm11, xmm11
        vxorpd  xmm14, xmm14, xmm14
        vxorpd  xmm17, xmm17, xmm17
        vxorpd  xmm20, xmm20, xmm20
        vxorpd  xmm23, xmm23, xmm23
        vxorps  xmm26, xmm26, xmm26
        vxorpd  xmm3, xmm3, xmm3
        vxorpd  xmm6, xmm6, xmm6
        vxorpd  xmm9, xmm9, xmm9
        vxorpd  xmm12, xmm12, xmm12
        vxorpd  xmm15, xmm15, xmm15
        vxorpd  xmm18, xmm18, xmm18
        vxorpd  xmm21, xmm21, xmm21
        vxorpd  xmm24, xmm24, xmm24
        vxorps  xmm28, xmm28, xmm28
        vxorpd  xmm4, xmm4, xmm4
        vxorpd  xmm7, xmm7, xmm7
        vxorpd  xmm10, xmm10, xmm10
        vxorpd  xmm13, xmm13, xmm13
        vxorpd  xmm16, xmm16, xmm16
        vxorpd  xmm19, xmm19, xmm19
        vxorpd  xmm22, xmm22, xmm22
        vxorps  xmm25, xmm25, xmm25
        vxorpd  xmm27, xmm27, xmm27
        nop     dword ptr [rax + rax]
L1488: # main loop starts here

After the 27 vxor instructions, if the ja L896 doesn’t happen, it xors them all again.
The jump is also unlikely to happen (which I attempted to communicate with @llvm.expect.i1(condtion), meaning the typical case is double-xoring.

I also would not call all of this a workaround. LLVM itself seems averse to nested loop optimizations, and (in my tests) polly generally did not help.
The simple single-loops primarily benefit either from (a) “workaround” for LLVM’s slow cleanup, or (b) using SIMD special functions, but the nested loop cases involve loop switching and register blocking.

EDIT:
I wonder what sort of benchmarks could convince LLVM it’s worth handling clean up loops differently?

julia> using Random, LoopVectorization, BenchmarkTools

julia> function mydotavx(a, b, N)
       s = zero(eltype(a))
       @avx for i ∈ 1:N
       s += a[i]*b[i]
       end
       s
       end
mydotavx (generic function with 2 methods)

julia> function mydot(a, b, N)
       s = zero(eltype(a))
       @inbounds @simd for i ∈ 1:N
       s += a[i]*b[i]
       end
       s
       end
mydot (generic function with 2 methods)

julia> a = rand(147); b = rand(147);

julia> Ns = shuffle!(((1:4096) .% 32) .+ 1); Ns'
1×4096 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 13  6  30  8  30  13  7  22  21  27  19  14  18  17  6  25  19  21  14  30  24  5  27  3  24  14  26  24  19  19  29  14  3  23  31  28  22  19  24  7  24  10  1  18  29  20  9  29  18  …  16  24  6  29  26  10  9  21  10  22  4  4  18  27  17  29  13  6  6  6  11  24  10  14  18  26  14  10  24  26  13  6  20  12  6  2  9  26  31  1  24  4  21  29  22  11  7  6  12  24  12

julia> @btime sum(n -> mydot($a, $b, n), $Ns)
  50.925 μs (0 allocations: 0 bytes)
16944.929071850365

julia> @btime sum(n -> mydotavx($a, $b, n), $Ns)
  30.677 μs (0 allocations: 0 bytes)
16944.929071850365

julia> Ns = shuffle!(((1:4096) .% 64) .+ 1); Ns'
1×4096 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 16  10  6  57  12  35  3  14  1  24  31  4  38  50  42  11  20  39  20  25  45  32  31  28  54  45  54  32  59  28  5  20  25  41  59  15  43  48  7  12  63  35  31  30  63  7  38  21  20  …  57  10  49  30  36  15  43  38  36  50  24  37  27  47  38  32  38  46  45  8  23  23  39  51  37  24  12  22  35  14  19  18  57  43  62  32  43  20  29  8  57  59  31  18  10  46  7  60

julia> @btime sum(n -> mydot($a, $b, n), $Ns)
  64.971 μs (0 allocations: 0 bytes)
32688.631383205167

julia> @btime sum(n -> mydotavx($a, $b, n), $Ns)
  40.321 μs (0 allocations: 0 bytes)
32688.631383205167

julia> Ns = shuffle!(((1:4096) .% 96) .+ 1); Ns'
1×4096 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 14  20  45  58  43  63  64  67  85  18  15  26  54  72  36  56  30  25  8  80  81  3  42  16  6  71  35  96  59  24  62  83  27  95  41  32  22  20  16  70  56  23  5  53  70  90  30  66  36  …  86  10  28  13  75  48  1  56  93  85  29  39  45  42  75  93  16  76  16  51  79  8  64  14  70  76  51  43  92  56  37  24  32  15  76  29  19  23  46  61  58  96  87  39  27  22  59

julia> @btime sum(n -> mydot($a, $b, n), $Ns)
  70.463 μs (0 allocations: 0 bytes)
47345.96836115388

julia> @btime sum(n -> mydotavx($a, $b, n), $Ns)
  48.614 μs (0 allocations: 0 bytes)
47345.96836115388

julia> Ns = shuffle!(((1:4096) .% 128) .+ 1); Ns'
1×4096 LinearAlgebra.Adjoint{Int64,Array{Int64,1}}:
 66  81  3  121  119  81  70  3  115  51  32  47  45  57  108  61  7  2  98  115  8  17  65  14  43  1  29  78  91  90  29  20  96  125  116  105  94  117  124  103  14  81  69  103  87  107  …  120  72  104  47  62  70  48  86  16  65  98  105  37  91  17  79  127  13  46  52  42  27  53  102  7  81  104  31  84  28  61  104  86  119  64  32  62  44  105  97  94  103  28  127

julia> @btime sum(n -> mydot($a, $b, n), $Ns)
  78.189 μs (0 allocations: 0 bytes)
62195.6519006811

julia> @btime sum(n -> mydotavx($a, $b, n), $Ns)
  52.912 μs (0 allocations: 0 bytes)
62195.65190068109
2 Likes