[ANN] LoopVectorization

I’m happy with that. I’d like to make it more robust.
As an extreme stop-gap, it would also be possible to define check_args to simply return false (regardless of what the arguments are) so that it uses an @inbounds @fastmath fallback loop instead.

This may be the better approach for those platforms, since the cost modeling currently assumes more or less the same instruction costs regardless of architecture.
Things have been reasonably consistent since Haswell, and Ryzen seems close enough as well.

I wanted to describe how it works, but yes, in those examples 100% of the benefit comes from the fallback loops.
Dot product:

julia> function mydot(a, b)
       s = zero(eltype(a))
       @inbounds @simd for i ∈ eachindex(a,b)
       s += a[i]*b[i]
       end
       s
       end
mydot (generic function with 1 method)

julia> using LoopVectorization, BenchmarkTools

julia> function mydotavx(a, b)
       s = zero(eltype(a))
       @avx for i ∈ eachindex(a,b)
       s += a[i]*b[i]
       end
       s
       end
mydotavx (generic function with 1 method)

julia> x128 = rand(128); y128 = rand(128);

julia> @btime mydot($x128, $y128)
  7.993 ns (0 allocations: 0 bytes)
31.004598810919788

julia> @btime mydotavx($x128, $y128)
  8.737 ns (0 allocations: 0 bytes)
31.004598810919788

julia> code_native(mydot, Tuple{Vector{Float64},Vector{Float64}}, syntax=:intel, debuginfo=:none)

julia> code_native(mydotavx, Tuple{Vector{Float64},Vector{Float64}}, syntax=:intel, debuginfo=:none)

Lets look at the hot loops, mytdot (@inbounds @simd):

L128:
        vmovupd zmm4, zmmword ptr [rcx + 8*rdi]
        vmovupd zmm5, zmmword ptr [rcx + 8*rdi + 64]
        vmovupd zmm6, zmmword ptr [rcx + 8*rdi + 128]
        vmovupd zmm7, zmmword ptr [rcx + 8*rdi + 192]
        vfmadd231pd     zmm0, zmm4, zmmword ptr [rdx + 8*rdi] # zmm0 = (zmm4 * mem) + zmm0
        vfmadd231pd     zmm1, zmm5, zmmword ptr [rdx + 8*rdi + 64] # zmm1 = (zmm5 * mem) + zmm1
        vfmadd231pd     zmm2, zmm6, zmmword ptr [rdx + 8*rdi + 128] # zmm2 = (zmm6 * mem) + zmm2
        vfmadd231pd     zmm3, zmm7, zmmword ptr [rdx + 8*rdi + 192] # zmm3 = (zmm7 * mem) + zmm3
        add     rdi, 32
        cmp     rsi, rdi
        jne     L128

mydotavx (@avx):

L112:
        vmovupd zmm4, zmmword ptr [rdx + 8*rdi]
        vmovupd zmm5, zmmword ptr [rdx + 8*rdi + 64]
        vmovupd zmm6, zmmword ptr [rdx + 8*rdi + 128]
        vmovupd zmm7, zmmword ptr [rdx + 8*rdi + 192]
        vfmadd231pd     zmm0, zmm4, zmmword ptr [rax + 8*rdi] # zmm0 = (zmm4 * mem) + zmm0
        vfmadd231pd     zmm3, zmm5, zmmword ptr [rax + 8*rdi + 64] # zmm3 = (zmm5 * mem) + zmm3
        vfmadd231pd     zmm1, zmm6, zmmword ptr [rax + 8*rdi + 128] # zmm1 = (zmm6 * mem) + zmm1
        vfmadd231pd     zmm2, zmm7, zmmword ptr [rax + 8*rdi + 192] # zmm2 = (zmm7 * mem) + zmm2
        add     rdi, 32
        cmp     rdi, rcx
        jl      L112

There are two primary differences I noticed in what happens when the hot loop runs:

  1. The former is aligned to a 32-byte boundary, and the latter to a 16-byte boundary. IIRC, my instruction decoder works in 32-byte chunks, but this is probably okay.
  2. Once the jump to the start of the loop isn’t taken, the first example immediately sums of the vector. The LoopVectorization version instead has to determine the correct cleanup approach and branches.

Do either of these points support:

I don’t think so, but the former (to the extant it matters) may be evidence that @llvm.expect is not enough to communicate which loop is hot.
Point 2 is thanks to LLVM’s poor tail-handling. If this improved, this advantage is likely to disapear.

We pick the same unrolling approach there.
But LLVM basically always picks either 1 or 4, which sounds more like heuristics than a model. But not knowing the internals, I can’t say for sure.
LoopVectorization uses latency vs throughput of the inner loop to decide.

Other examples, like just a sum of self-dot are cases where LLVM also picks 4, but LoopVectorization picks 8. It takes too long for that to pay off:

julia> @btime mysum($x128)
  6.043 ns (0 allocations: 0 bytes)
61.21314858020399

julia> @btime mysumavx($x128)
  7.227 ns (0 allocations: 0 bytes)
61.213148580204

julia> x256 = rand(256); y256 = rand(256);

julia> @btime mysum($x256)
  9.175 ns (0 allocations: 0 bytes)
122.63179874691728

julia> @btime mysumavx($x256)
  9.570 ns (0 allocations: 0 bytes)
122.6317987469173

julia> x512 = rand(512); y512 = rand(512);

julia> @btime mysum($x512)
  16.802 ns (0 allocations: 0 bytes)
245.33365652096614

julia> @btime mysumavx($x512)
  14.459 ns (0 allocations: 0 bytes)
245.33365652096614

So I should probably get it to adjust this behavior, as it wont be much longer before it becomes memory bound. Maybe they factor memory bandwidth into it.

Come again?

julia> X128 = rand(128, 128); x128discontig = view(X128, 1, :);

julia> @btime mydot($x128discontig, $y128)
  99.258 ns (0 allocations: 0 bytes)
31.142610557563987

julia> @btime mydotavx($x128discontig, $y128)
  69.848 ns (0 allocations: 0 bytes)
31.14261055756398

What I think LLVM does here is emit a check for if x128discontig is actually contiguous, vectorize if it is, and give up if is not.

julia> X2 = rand(1,128); xactuallycontig = view(X2, 1, :);

julia> @btime mydotavx($xactuallycontig, $y128)
  30.494 ns (0 allocations: 0 bytes)
32.523836125552

julia> @btime mydot($xactuallycontig, $y128)
  15.140 ns (0 allocations: 0 bytes)
32.52383612555201

I believe it’s slower than when using x128 because of this issue, but I am not sure.

LoopVectorization does not do loop versioning.

Which approach of handling unknown stride you prefer:

  1. Assume it is not one, and optimize that case.
  2. Emit a check, make it fast if it is 1, give up otherwise.

I think 1. is obviously a better choice.
Folks shouldn’t use views that have an Int as the first index if they don’t have to. They should vec or rehape as necessary.
I’d bet in the large majority of calls featuring views with dynamic leading strides, it will in fact be discontiguous – otherwise, they’re making a mistake.

I am skeptical of that claim. It struggles with things as basic as pointer offsetting, unless I generate code lining up with what I want the asm to look like.

Okay. No edge effects, we’ll make everything a clean 128:

using BenchmarkTools, LoopVectorization
function mygemm!(C, A, B)
    @inbounds @fastmath for m ∈ 1:size(A,1), n ∈ 1:size(B,2)
        Cmn = zero(eltype(C))
        for k ∈ 1:size(A,2)
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end

function mygemmavx!(C, A, B)
    @avx for m ∈ 1:size(A,1), n ∈ 1:size(B,2)
        Cmn = zero(eltype(C))
        for k ∈ 1:size(A,2)
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end

M = K = N = 128;

C1 = Matrix{Float64}(undef, M, N); A = randn(M, K); B = randn(K, N);
C2 = similar(C1); C3 = similar(C1);

This yields:

julia> @benchmark mygemmavx!($C1, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     34.779 μs (0.00% GC)
  median time:      34.910 μs (0.00% GC)
  mean time:        34.965 μs (0.00% GC)
  maximum time:     100.442 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mygemm!($C2, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.548 ms (0.00% GC)
  median time:      1.552 ms (0.00% GC)
  mean time:        1.552 ms (0.00% GC)
  maximum time:     1.636 ms (0.00% GC)
  --------------
  samples:          3222
  evals/sample:     1

Over 44 times faster.

I was actually a little dishonest here. While LLVM wasn’t suffering from edge effects in this example, LoopVectorization actually was because its modeling picked different unrolling factors (not 32).

But we can help LLVM do a little better by reordering loops ourselves:

julia> function mygemm!(C, A, B)
           fill!(C, 0)
           @inbounds @fastmath for k ∈ axes(B,1), n ∈ axes(C,2), m ∈ axes(C,1)
               C[m,n] += A[m,k] * B[k,n]
           end
       end
mygemm_better_order! (generic function with 1 method)

julia> @benchmark mygemm_better_order!($C2, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.072 ms (0.00% GC)
  median time:      1.096 ms (0.00% GC)
  mean time:        1.096 ms (0.00% GC)
  maximum time:     1.126 ms (0.00% GC)
  --------------
  samples:          4562
  evals/sample:     1

Now it’s only 30 times slower.

Now, let’s talk a little about memory layouts again.

julia> @benchmark mygemm_better_order!($C2, $A', $B')
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.303 ms (0.00% GC)
  median time:      1.346 ms (0.00% GC)
  mean time:        1.340 ms (0.00% GC)
  maximum time:     1.432 ms (0.00% GC)
  --------------
  samples:          3732
  evals/sample:     1

julia> @benchmark mygemmavx!($C2, $A', $B')
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     41.487 μs (0.00% GC)
  median time:      42.472 μs (0.00% GC)
  mean time:        42.487 μs (0.00% GC)
  maximum time:     103.311 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

LoopVectorization suffered less than a 20% performance degredation from these transposes, vs the 30% degradation of the LLVM version.
Given that LoopVectorization had a lot more to lose here from vectorization failure, but lost less in % performance anyway (stating the obvious: it still vectorized the loops)…

23 Likes