[ANN] LoopVectorization

Very interesting! Thank you!

I have been using Cartesian (to write generic tensor operations) on floats and other types (automatic differentiation). Would you say that LoopVectorization aims at replacing Cartesian? Any comments on how both packages compare?

Here’s a little demo of using StructArrays with LoopVectorization for those interested. This is a demo gemm kernel that does slightly better than BLAS (LinearAlgebra.mul!) for Float64 inputs:

using LoopVectorization, LinearAlgebra, StructArrays, BenchmarkTools, Test
BLAS.set_num_threads(1); @show BLAS.vendor()

const MatrixFInt64 = Union{Matrix{Float64}, Matrix{Int}}

function mul_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64)
    z = zero(eltype(C))
    @avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
        Cᵢⱼ = z
        for k ∈ 1:size(A,2)
            Cᵢⱼ += A[i,k] * B[k,j]
        end
        C[i,j] = Cᵢⱼ
    end
end

M, K, N = 50, 51, 52
A = randn(M, K); B = randn(K, N)
C1 = Matrix{Float64}(undef, M, N);
C2 = similar(C1);

@btime mul_avx!($C1, $A, $B)
@btime mul!(    $C2, $A, $B)
@test C1 ≈ C2

which gives an output

BLAS.vendor() = :openblas64
  5.594 μs (0 allocations: 0 bytes)
  8.741 μs (0 allocations: 0 bytes)
Test Passed

Now, we use StructArrays to extend this to complex matrices:

function mul_add_avx!(C::MatrixFInt64, A::MatrixFInt64, B::MatrixFInt64, factor=1)
    z = zero(eltype(C))
    @avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
        ΔCᵢⱼ = z
        for k ∈ 1:size(A,2)
            ΔCᵢⱼ += A[i,k] * B[k,j]
        end
        C[i,j] += factor * ΔCᵢⱼ
    end
end

const StructMatrixComplexFInt64 = Union{StructArray{ComplexF64,2}, StructArray{Complex{Int},2}}

function mul_avx!(C:: StructMatrixComplexFInt64, A::StructMatrixComplexFInt64, B::StructMatrixComplexFInt64)
    mul_avx!(  C.re, A.re, B.re)       # C.re = A.re * B.re
    mul_add_avx!(C.re, A.im, B.im, -1) # C.re = C.re - A.im * B.im
    mul_avx!(  C.im, A.re, B.im)       # C.im = A.re * B.im
    mul_add_avx!(C.im, A.im, B.re)     # C.im = C.im + A.im * B.re
end

A  = StructArray(randn(ComplexF64, M, K)); 
B  = StructArray(randn(ComplexF64, K, N));
C1 = StructArray(Matrix{ComplexF64}(undef, M, N)); 
C2 = collect(similar(C1));

@btime mul_avx!($C1, $A, $B)
@btime mul!(    $C2, $(collect(A)), $(collect(B))) # collect turns the StructArray into a regular Array
@test  collect(C1) ≈ C2

giving

  30.696 μs (0 allocations: 0 bytes)
  27.582 μs (0 allocations: 0 bytes)
Test Passed

so we go from beating OpenBlas to doing slightly worse when we switch to complex matrices.

It seems that users won’t necessarily be able to use this approach for any given problem, but it works well at least for things like matrix multiply / tensor contractions. On the library side, I think it’s possible that StructArrays could be integrated into the @avx macro in such a way that all this works normally on complex and dual numbers, but that’s more speculative.


Edit: With the new v0.3.2 update, the codegen for the mul_add_avx function was improved so the timing for the complex struct matrix went from 30.696 µs to 22.988, now beating OpenBLAS.

9 Likes

Have you tried the 3m method? That should be faster for these sizes.

Curious effect when using @avx: This code for the matrix-matrix multiplication has a slight edge without @avx

function mulCAB1!(C, A, B)
	M, N = size(C); K = size(B,1)
	C .= 0
	for n in 1:N, k in 1:K 
		Bkn = B[k,n]
		for m in 1:M
			C[m,n] += A[m,k] * Bkn
		end
	end
	return C
end

over this code

function mulCAB2!(C, A, B)
	M, N = size(C); K = size(B,1)
	z = zero(eltype(C))
	for m ∈ 1:M, n ∈ 1:N 
		Cmn = z
	    for k ∈ 1:K
	        Cmn += A[m,k] * B[k,n]
	    end
	    C[m,n] = Cmn
	end
	return C
end

which is faster when the @avx macro is used (in front of the outermost loop; obviously either both functions have it or neither one does).
The first version is usually recommended as the fastest in CS books.

Of course, without the @avx macro both functions are an order of magnitude slower than when instrumented with the macro. I tested this for matrices of size 64, 32, 16, and 8. BLAS was faster only for size 64.

1 Like

I’ll try to start working on proper documentation, which will hopefully make @avx and its performance more transparent.
It’s worth noting here however that LoopVectorization uses a LoopSet struct as its internal representation of a set of loops.
This struct contains little more than a(n unordered!) collection of the loops, as well as a collection of operations performed in the loop. These operations contain their dependencies on other operations and on loops.

@avx on loops builds a LoopSet, which it then converts into an expression.
@avx on broadcasts swaps materialize(!) for vmaterialize(!), which is a generated function that creates a LoopSet, and then converts it into an expression.

The LoopSet does not contain information on the original order of the loops, or where the operations were written (order still matters insofar as defining dependencies between operations).

However, the orders the loops can eventually be placed in can be restricted. Writing:

    for m ∈ 1:M, n ∈ 1:N 
        Cmn = z
        for k ∈ 1:z
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end

Here, because Cmn is redefined as a constant for every m and n, it is taken as a hint to depend on the m and n loops.
Right now, that forces the m and n loops to be exterior to where Cmn is ultimately placed.
Additionally, Cmn is then summed over k. That forces k; the k loop thus depends on Cmn and must be internal with respect to Cmn's definition.

Thus it only considers two possible orders (exterior to interior): m, n, k and n, m, k.
One of these is hopefully pretty good.

If instead Cmn = C[m,n], it would not infer the restriction that the m and n loops must be external to that operation. It will instead assuming that if it reorders, it will keep rereading the correct value from memory.
Similarly, if we want to reoder the loops ourselves, we have to replace the Cmn = zero(eltype(C)) with a C .= 0 external to all the loops, and switch to using Cmn = C[m,n] instead.

That is a transformation we are capable of, but that I haven’t implemented in LoopVectorization yet.

All that said, given the unrestricted version of the loop, what does @avx want to do with it?

julia> using LoopVectorization: LoopSet, choose_order
[ Info: Precompiling LoopVectorization [bdcacae8-1622-11e9-2a5c-532679323890]

julia> nkm_q = :(for n in 1:N, k in 1:K 
                       Bkn = B[k,n]
                       for m in 1:M
                               C[m,n] += A[m,k] * Bkn
                       end
               end);

julia> nkm_ls = LoopSet(nkm_q);

julia> choose_order(nkm_ls)
([:n, :m, :k], :m, 4, 4)

This happens to be one of the two possible orders of the more restricted version.
Note, the integers will change depending on your CPU architecture. The symbols probably wont.

What this means is that the loop order it choose is (exterior to interior): n, m, k.
The second m means that is that SIMD vectors are loaded from the m loop. That is, adjacent elements along the m axis will be loaded into adjacent SIMD lanes.
If a number doesn’t depend on the m loop (like B[k,n]) it will insteadl be broadcast across a vector, so each element is equal.
I call this the vectorized loop. The vectorized loop is unrolled by the vector width.

The reason that the nkm loop is faster with base Julia/@simd is because that allows the m loop to be vectorized, while the mnk loop cannot.

The @avx macro can load SIMD vectors along the m axis regardless of the loop order it actually chooses.

The two integers, in this case (U, T) = 4, 4, refer to how the loops are unrolled. If T == -1, that means the outermost loop is unrolled by U (or U * vector width, if it is also the vectorized loop).
Otherwise, the outermost loop is unrolled by T, and the second outermost loop is unrolled by U (or U * vector width).
What this unrolling means in this example is that it will:

  1. Create U*T vectors of Cmn.
  2. It will start looping over k. On each iteration, it will…
  3. Load U vectors from A.
  4. for t = 1,…,T, it will load an element from B, and update the U vectors from Cmn that correspond to that t (using the U vectors from A).
  5. Finish looping, and store the Cmn vectors into C.

So on each iteration of k, you perform a total of U+T loads (U vector loads from A, T broadcasts to fill a register with a single value from B). However, you perform a total of U * T multiplications and additions, without any of the multiplications-and-additions depending on any others, so that they can all take advantage of out of order execution.
That is, it should be able to keep the CPU’s fma (fused multiply-add) units nearly 100% busy.

This requires a total of U*T registers for Cmn, U registers for A, and 1 register for B, or U*T + U + 1. For U = T = 4, that is 21 registers.
Computers with only AVX2 have 16 floating point registers per core, so choose_order should instead return U = 3; T = 4, yielding exactly 16.
It has to require less registers than the CPU has, otherwise it would have to do a bunch of extra loads and stores on each k iteration, cratering the compute / load ratio.

So, @PetrKryslUCSD, @avx does almost the same thing with either loop, because it doesn’t pay too much attention to the actual order you wrote them in. The reason mulCAB1! is slower is because C .= 0 is slower than not doing that, and then within the loops, loading values from C[m,n] is also slightly slower than simply assigning 0.

In both cases, it is able to vectorize the m loop, while @simd is only capable of doing that if the m loop is the most interior loop.

Wow, that’s fantastic!
Performance-wise, it is much simpler than trying to get a macro (which doesn’t see types!) to be able to do something fast.

The ratio of performance of your complex to real example was about 5.5.
I just released a new version of LoopVectorization (v"0.3.2") that improved the code generation for your mul_add_avx!, which now gives me a factor of about 4.2.
This is still behind the ideal 4, but much closer!
Mind rerunning your benchmarks?

The problem was that it did not realize it could place the load from C[i,j] in C[i,j] += factor * ΔCᵢⱼ after the loop, because I had naively been placing all loads prior to any interior loops.
This hurt because it then counted that as an additional occupied register (changing the values of U and T).
Adding a check to see if it is allowed to delay operations until after the loop, and then doing so if allowed, largely solved the problem.

OpenBLAS and MKL seem to have ratios of <4. IIRC, they default to 4m, so I’m guessing that’s because of BLAS overhead being amortized.

I wouldn’t say it aims to replace Cartesian, which aims to make it easier to write generic loops.
LoopVectorization aims to make loops you have written faster.

EDIT: I updated the opening post with more benchmarks.

12 Likes

Wow, thanks for the quick update! I installed the update and re-ran my benchmarks, finding negligible change in the Matrix{Float64}benchmarks ( 5.594 -> 5.603 µs) but the StructArray{ComplexF64, 2} went from 30.696 µs to 22.988 μs, so I’m observing a 4.1x increased cost for the complex matrix gemm.


Another thing that’s worth pointing out about the above gemm kernels is that they absolutely smoke LinearAlgebra.mul! for matrices of integers. The reason is that OpenBLAS doesn’t do integer matrix multiplication, so mul! will fall back on a generic kernel.

Using the same setup code from my post above, I find

A = rand(-100:100, M, K); B = rand(-100:100, K, N)
C1 = Matrix{eltype(A)}(undef, M, N);
C2 = similar(C1);

@btime mul_avx!($C1, $A, $B) 
@btime mul!(    $C2, $A, $B) 

gives

 28.842 μs (0 allocations: 0 bytes)
 65.560 μs (6 allocations: 336 bytes)

and for complex matrices

A  = StructArray(rand(-100:100, M, K) + im * rand(-100:100, M, K)); 
B  = StructArray(rand(-100:100, K, N) + im * rand(-100:100, K, N));
C1 = StructArray(Matrix{eltype(A)}(undef, M, N)); 
C2 = collect(similar(C1));

@btime mul_avx!($C1, $A, $B)                     
@btime mul!(    $C2, $(collect(A)), $(collect(B)))

gives

 116.938 μs (0 allocations: 0 bytes)
 232.860 μs (6 allocations: 336 bytes)

I’m sure that with an algorithmic improvement, the advantage for Complex{Int} would be even greater.

10 Likes

I tried - naively it turns out - to vectorise a piece of code I’ve long wanted to optimise. A MWE is the following:

using BenchmarkTools, LoopVectorization

maxdeg = 20
nbasis = 10_000
dim = 10
basis = rand(1:(maxdeg+1), (dim, nbasis))
coeffs = rand(nbasis)
P = rand(dim, maxdeg+1)

function mvp(P, basis, coeffs::Vector{T}) where {T}
   len_c = length(coeffs)
   len_P = size(P, 1)
   p = zero(T)
   for n = 1:len_c
      pn = coeffs[n]
      for a = 1:len_P
         pn *= P[a, basis[a, n]]
      end
      p += pn
   end
   return p
end

@btime mvp($P, $basis, $coeffs)

Using the @avx macro on either the inner or outer loop gives me the following error:

MethodError: Cannot `convert` an object of type Expr to an object of type Union{Int64, Symbol}

It has something to do with indexing via basis[a, n]. Can anybody explain to me (1) why my naive approach won’t work and (2) whether this code could still be vectorised?

1 Like

It can be vectorized using gather instructions.
These instructions are slow, which is one reason compilers don’t often use them.
I also didn’t add support to LoopVectorization for that yet, but that is something I very much want to support. In statistics, it’s common to write code like when you have random effects tied to categorical variables.

So I will work on supporting this with LoopVectorization shortly. I’ll update here (or on a related GitHub issue) once it does.

LoopVectorization will use LLVM’s gather intrinsics. Note that only AVX2 and AVX512 provide hardware support/corresponding assembly instructions for gather (as far as x86 goes; GPUs support it).
The code will still work on computers without it, but will almost certainly be slow. Meaning you’d need at least AVX2.

Even with it, gather isn’t fast. But it would be well worth testing for each particular loop, so a convenient way to turn it off and on (by adding and removing @avx) would be great.

5 Likes

Thanks for the explanation. I look forward to trying this.

Your code works now (and is tested as of LoopVectorization v0.3.6), but it is slower on my computer (with avx512) than just adding @inbounds:

julia> using BenchmarkTools, LoopVectorization

julia> maxdeg = 20; dim = 10; nbasis = 1_000;

julia> P = rand(dim, maxdeg + 1);

julia> function mvp(P, basis, coeffs::Vector{T}) where {T}
          len_c = length(coeffs)
          len_P = size(P, 1)
          p = zero(T)
          for n = 1:len_c
             pn = coeffs[n]
             for a = 1:len_P
                pn *= P[a, basis[a, n]]
             end
             p += pn
          end
          return p
       end
mvp (generic function with 1 method)

julia> function mvpinbounds(P, basis, coeffs::Vector{T}) where {T}
          len_c = length(coeffs)
          len_P = size(P, 1)
          p = zero(T)
          @inbounds for n = 1:len_c
             pn = coeffs[n]
             for a = 1:len_P
                pn *= P[a, basis[a, n]]
             end
             p += pn
          end
          return p
       end
mvpinbounds (generic function with 1 method)

julia> function mvpavx(P, basis, coeffs::Vector{T}) where {T}
          len_c = length(coeffs)
          len_P = size(P, 1)
          p = zero(T)
          @avx for n = 1:len_c
             pn = coeffs[n]
             for a = 1:len_P
                pn *= P[a, basis[a, n]]
             end
             p += pn
          end
          return p
       end
mvpavx (generic function with 1 method)

julia> basis = rand(1:(maxdeg+1), (dim, nbasis));

julia> coeffs = rand(nbasis);

julia> @btime mvp($P, $basis, $coeffs)
  9.464 μs (0 allocations: 0 bytes)
0.8416355183275668

julia> @btime mvpinbounds($P, $basis, $coeffs)
  5.322 μs (0 allocations: 0 bytes)
0.8416355183275668

julia> @btime mvpavx($P, $basis, $coeffs)
  7.692 μs (0 allocations: 0 bytes)
0.8416355183275668

julia> nbasis = 10_000;

julia> basis = rand(1:(maxdeg+1), (dim, nbasis));

julia> coeffs = rand(nbasis);

julia> @btime mvp($P, $basis, $coeffs)
  97.394 μs (0 allocations: 0 bytes)
9.049892658664312

julia> @btime mvpinbounds($P, $basis, $coeffs)
  55.024 μs (0 allocations: 0 bytes)
9.049892658664312

julia> @btime mvpavx($P, $basis, $coeffs)
  71.441 μs (0 allocations: 0 bytes)
9.049892658664312

You can also confirm with @code_native or @code_llvm that the @inbounds version is not vectorized, while the @avx is (but is slower anyway):

julia> @code_llvm debuginfo=:none mvpinbounds(P, basis, coeffs)

define double @julia_mvpinbounds_18548(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
top:
  %3 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*
  %4 = bitcast %jl_value_t addrspace(11)* %3 to %jl_array_t addrspace(11)*
  %5 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %4, i64 0, i32 1
  %6 = load i64, i64 addrspace(11)* %5, align 8
  %7 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %8 = bitcast %jl_value_t addrspace(11)* %7 to %jl_value_t addrspace(10)* addrspace(11)*
  %9 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %8, i64 3
  %10 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %9 to i64 addrspace(11)*
  %11 = load i64, i64 addrspace(11)* %10, align 8
  %12 = icmp sgt i64 %6, 0
  %13 = select i1 %12, i64 %6, i64 0
  br i1 %12, label %L14.preheader, label %L59

L14.preheader:                                    ; preds = %top
  %14 = bitcast %jl_value_t addrspace(11)* %3 to double addrspace(13)* addrspace(11)*
  %15 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %14, align 8
  %16 = icmp sgt i64 %11, 0
  %17 = select i1 %16, i64 %11, i64 0
  %18 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*
  %19 = bitcast %jl_value_t addrspace(11)* %18 to %jl_value_t addrspace(10)* addrspace(11)*
  %20 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %19, i64 3
  %21 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %20 to i64 addrspace(11)*
  %22 = load i64, i64 addrspace(11)* %21, align 8
  %23 = bitcast %jl_value_t addrspace(11)* %18 to i64 addrspace(13)* addrspace(11)*
  %24 = load i64 addrspace(13)*, i64 addrspace(13)* addrspace(11)* %23, align 8
  %25 = bitcast %jl_value_t addrspace(11)* %7 to double addrspace(13)* addrspace(11)*
  %26 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %25, align 8
  br i1 %16, label %L29.preheader.us, label %L46

L29.preheader.us:                                 ; preds = %L14.preheader, %L46.us
  %value_phi3.us = phi double [ %43, %L46.us ], [ 0.000000e+00, %L14.preheader ]
  %value_phi4.us = phi i64 [ %45, %L46.us ], [ 1, %L14.preheader ]
  %27 = add nsw i64 %value_phi4.us, -1
  %28 = getelementptr inbounds double, double addrspace(13)* %15, i64 %27
  %29 = load double, double addrspace(13)* %28, align 8
  %30 = mul i64 %22, %27
  br label %L29.us

L29.us:                                           ; preds = %L29.us, %L29.preheader.us
  %value_phi9.us = phi double [ %40, %L29.us ], [ %29, %L29.preheader.us ]
  %value_phi10.us = phi i64 [ %42, %L29.us ], [ 1, %L29.preheader.us ]
  %31 = add nsw i64 %value_phi10.us, -1
  %32 = add i64 %31, %30
  %33 = getelementptr inbounds i64, i64 addrspace(13)* %24, i64 %32
  %34 = load i64, i64 addrspace(13)* %33, align 8
  %35 = add i64 %34, -1
  %36 = mul i64 %35, %11
  %37 = add i64 %31, %36
  %38 = getelementptr inbounds double, double addrspace(13)* %26, i64 %37
  %39 = load double, double addrspace(13)* %38, align 8
  %40 = fmul double %value_phi9.us, %39
  %41 = icmp eq i64 %value_phi10.us, %17
  %42 = add nuw i64 %value_phi10.us, 1
  br i1 %41, label %L46.us, label %L29.us

L46.us:                                           ; preds = %L29.us
  %43 = fadd double %value_phi3.us, %40
  %44 = icmp eq i64 %value_phi4.us, %13
  %45 = add nuw i64 %value_phi4.us, 1
  br i1 %44, label %L59, label %L29.preheader.us

L46:                                              ; preds = %L14.preheader, %L46
  %value_phi3 = phi double [ %49, %L46 ], [ 0.000000e+00, %L14.preheader ]
  %value_phi4 = phi i64 [ %51, %L46 ], [ 1, %L14.preheader ]
  %46 = add nsw i64 %value_phi4, -1
  %47 = getelementptr inbounds double, double addrspace(13)* %15, i64 %46
  %48 = load double, double addrspace(13)* %47, align 8
  %49 = fadd double %value_phi3, %48
  %50 = icmp eq i64 %value_phi4, %13
  %51 = add nuw i64 %value_phi4, 1
  br i1 %50, label %L59, label %L46

L59:                                              ; preds = %L46, %L46.us, %top
  %value_phi19 = phi double [ 0.000000e+00, %top ], [ %43, %L46.us ], [ %49, %L46 ]
  ret double %value_phi19
}

vs

;; julia> @code_llvm debuginfo=:none mvpavx(P, basis, coeffs)

define double @julia_mvpavx_18517(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
top:
  %3 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*
  %4 = bitcast %jl_value_t addrspace(11)* %3 to %jl_array_t addrspace(11)*
  %5 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %4, i64 0, i32 1
  %6 = load i64, i64 addrspace(11)* %5, align 8
  %7 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %8 = bitcast %jl_value_t addrspace(11)* %7 to %jl_value_t addrspace(10)* addrspace(11)*
  %9 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %8, i64 3
  %10 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %9 to i64 addrspace(11)*
  %11 = load i64, i64 addrspace(11)* %10, align 8
  %12 = addrspacecast %jl_value_t addrspace(11)* %3 to %jl_value_t*
  %13 = bitcast %jl_value_t* %12 to i64*
  %14 = load i64, i64* %13, align 8
  %15 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*
  %16 = addrspacecast %jl_value_t addrspace(11)* %15 to %jl_value_t*
  %17 = bitcast %jl_value_t* %16 to i64*
  %18 = load i64, i64* %17, align 8
  %19 = bitcast %jl_value_t addrspace(11)* %15 to %jl_value_t addrspace(10)* addrspace(11)*
  %20 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %19, i64 3
  %21 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %20 to i64 addrspace(11)*
  %22 = load i64, i64 addrspace(11)* %21, align 8
  %23 = addrspacecast %jl_value_t addrspace(11)* %7 to %jl_value_t*
  %24 = bitcast %jl_value_t* %23 to i64*
  %25 = load i64, i64* %24, align 8
  %26 = trunc i64 %11 to i8
  %27 = and i8 %26, 7
  %notmask = shl nsw i8 -1, %27
  %28 = xor i8 %notmask, -1
  %29 = icmp sgt i64 %6, 0
  br i1 %29, label %L28.lr.ph, label %L179

L28.lr.ph:                                        ; preds = %top
  %30 = add i64 %11, -7
  %31 = icmp sgt i64 %30, 0
  %ptr.i = inttoptr i64 %18 to i64*
  %ie.i241 = insertelement <8 x i64> undef, i64 %11, i32 0
  %v.i242 = shufflevector <8 x i64> %ie.i241, <8 x i64> undef, <8 x i32> zeroinitializer
  %ptr.i233 = inttoptr i64 %25 to double*
  %ptr.i194282 = inttoptr i64 %14 to double*
  %mask.i222 = bitcast i8 %28 to <8 x i1>
  br i1 %31, label %L44.lr.ph.us, label %L93

L44.lr.ph.us:                                     ; preds = %L28.lr.ph, %L175.us
  %value_phi1361.us = phi double [ %40, %L175.us ], [ 0.000000e+00, %L28.lr.ph ]
  %value_phi355.us = phi i64 [ %41, %L175.us ], [ 0, %L28.lr.ph ]
  %.sroa.0117.0354.us = phi <8 x double> [ %.sroa.0113.0.us, %L175.us ], [ undef, %L28.lr.ph ]
  %.sroa.0117.0.vec.insert.us = insertelement <8 x double> %.sroa.0117.0354.us, double 1.000000e+00, i32 0
  %32 = mul i64 %value_phi355.us, %22
  br label %L44.us

L44.us:                                           ; preds = %L86.us, %L44.lr.ph.us
  %tindex_phi348.us = phi i8 [ 1, %L44.lr.ph.us ], [ 2, %L86.us ]
  %value_phi9347.us = phi <8 x double> [ undef, %L44.lr.ph.us ], [ %.sroa.0113.0.us, %L86.us ]
  %value_phi2346.us = phi i64 [ 0, %L44.lr.ph.us ], [ %34, %L86.us ]
  %.sroa.0117.1345.us = phi <8 x double> [ %.sroa.0117.0.vec.insert.us, %L44.lr.ph.us ], [ %.sroa.0113.0.us, %L86.us ]
  %33 = add i64 %value_phi2346.us, %32
  %offsetptr.i.us = getelementptr inbounds i64, i64* %ptr.i, i64 %33
  %ptr.i244.us = bitcast i64* %offsetptr.i.us to <8 x i64>*
  %res.i245.us = load <8 x i64>, <8 x i64>* %ptr.i244.us, align 8
  %res.i243.us = add nsw <8 x i64> %res.i245.us, <i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1>
  %res.i240.us = mul nsw <8 x i64> %res.i243.us, %v.i242
  %ie.i238.us = insertelement <8 x i64> undef, i64 %value_phi2346.us, i32 0
  %v.i239.us = shufflevector <8 x i64> %ie.i238.us, <8 x i64> undef, <8 x i32> zeroinitializer
  %res.i237.us = add nsw <8 x i64> %v.i239.us, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
  %res.i236.us = add nsw <8 x i64> %res.i237.us, %res.i240.us
  %offsetptr.i234.us = getelementptr inbounds double, double* %ptr.i233, <8 x i64> %res.i236.us
  %res.i232.us = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %offsetptr.i234.us, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
  switch i8 %tindex_phi348.us, label %L84 [
    i8 2, label %L86.us
    i8 1, label %L77.us
  ]

L77.us:                                           ; preds = %L44.us
  %v.i229.us = shufflevector <8 x double> %.sroa.0117.1345.us, <8 x double> undef, <8 x i32> zeroinitializer
  br label %L86.us

L86.us:                                           ; preds = %L77.us, %L44.us
  %v.i229.pn.us = phi <8 x double> [ %v.i229.us, %L77.us ], [ %value_phi9347.us, %L44.us ]
  %.sroa.0113.0.us = fmul reassoc nnan ninf nsz arcp contract <8 x double> %res.i232.us, %v.i229.pn.us
  %34 = add i64 %value_phi2346.us, 8
  %35 = icmp slt i64 %34, %30
  br i1 %35, label %L44.us, label %L93.us

L93.us:                                           ; preds = %L86.us
  %36 = icmp slt i64 %34, %11
  br i1 %36, label %L139.us, label %L175.us

L139.us:                                          ; preds = %L93.us
  %ie.i215.us = insertelement <8 x i64> undef, i64 %34, i32 0
  %v.i216.us = shufflevector <8 x i64> %ie.i215.us, <8 x i64> undef, <8 x i32> zeroinitializer
  %res.i214.us = add nsw <8 x i64> %v.i216.us, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
  %37 = mul i64 %value_phi355.us, %22
  %38 = add i64 %34, %37
  %offsetptr.i225.us = getelementptr inbounds i64, i64* %ptr.i, i64 %38
  %ptr.i221.us = bitcast i64* %offsetptr.i225.us to <8 x i64>*
  %res.i223.us = call <8 x i64> @llvm.masked.load.v8i64.p0v8i64(<8 x i64>* nonnull %ptr.i221.us, i32 8, <8 x i1> %mask.i222, <8 x i64> zeroinitializer)
  %res.i220.us = add nsw <8 x i64> %res.i223.us, <i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1>
  %res.i217.us = mul nsw <8 x i64> %res.i220.us, %v.i242
  %res.i213.us = add nsw <8 x i64> %res.i214.us, %res.i217.us
  %offsetptr.i211.us = getelementptr inbounds double, double* %ptr.i233, <8 x i64> %res.i213.us
  %res.i209.us = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %offsetptr.i211.us, i32 8, <8 x i1> %mask.i222, <8 x double> zeroinitializer)
  %value_phi17266.us = fmul reassoc nnan ninf nsz arcp contract <8 x double> %.sroa.0113.0.us, %res.i209.us
  %res.i202.us = select <8 x i1> %mask.i222, <8 x double> %value_phi17266.us, <8 x double> %.sroa.0113.0.us
  br label %L175.us

L175.us:                                          ; preds = %L93.us, %L139.us
  %value_phi20.ph.us = phi <8 x double> [ %res.i202.us, %L139.us ], [ %.sroa.0113.0.us, %L93.us ]
  %offsetptr.i195272.us = getelementptr inbounds double, double* %ptr.i194282, i64 %value_phi355.us
  %res.i193273.us = load double, double* %offsetptr.i195272.us, align 1
  %vec_4_1.i182.us = shufflevector <8 x double> %value_phi20.ph.us, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %vec_4_2.i183.us = shufflevector <8 x double> %value_phi20.ph.us, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %vec_4.i184.us = fmul <4 x double> %vec_4_1.i182.us, %vec_4_2.i183.us
  %vec_2_1.i185.us = shufflevector <4 x double> %vec_4.i184.us, <4 x double> undef, <2 x i32> <i32 0, i32 1>
  %vec_2_2.i186.us = shufflevector <4 x double> %vec_4.i184.us, <4 x double> undef, <2 x i32> <i32 2, i32 3>
  %vec_2.i187.us = fmul <2 x double> %vec_2_1.i185.us, %vec_2_2.i186.us
  %vec_1_1.i188.us = shufflevector <2 x double> %vec_2.i187.us, <2 x double> undef, <1 x i32> zeroinitializer
  %vec_1_2.i189.us = shufflevector <2 x double> %vec_2.i187.us, <2 x double> undef, <1 x i32> <i32 1>
  %vec_1.i190.us = fmul <1 x double> %vec_1_1.i188.us, %vec_1_2.i189.us
  %res.i191.us = extractelement <1 x double> %vec_1.i190.us, i32 0
  %39 = fmul fast double %res.i191.us, %res.i193273.us
  %40 = fadd double %value_phi1361.us, %39
  %41 = add nuw nsw i64 %value_phi355.us, 1
  %exitcond369 = icmp eq i64 %41, %6
  br i1 %exitcond369, label %L331, label %L44.lr.ph.us

L84:                                              ; preds = %L44.us
  call void @jl_throw(%jl_value_t addrspace(12)* addrspacecast (%jl_value_t* inttoptr (i64 140677114703680 to %jl_value_t*) to %jl_value_t addrspace(12)*))
  unreachable

L93:                                              ; preds = %L28.lr.ph, %L175
  %value_phi1361 = phi double [ %45, %L175 ], [ 0.000000e+00, %L28.lr.ph ]
  %value_phi355 = phi i64 [ %46, %L175 ], [ 0, %L28.lr.ph ]
  %.sroa.0117.0354 = phi <8 x double> [ %.sroa.0117.0.vec.insert, %L175 ], [ undef, %L28.lr.ph ]
  %.sroa.097.0353 = phi <8 x double> [ %.sroa.097.1274, %L175 ], [ undef, %L28.lr.ph ]
  %.sroa.0117.0.vec.insert = insertelement <8 x double> %.sroa.0117.0354, double 1.000000e+00, i32 0
  %42 = icmp sgt i64 %11, 0
  br i1 %42, label %L164, label %L170

L164:                                             ; preds = %L93
  %43 = mul i64 %value_phi355, %22
  %offsetptr.i225 = getelementptr inbounds i64, i64* %ptr.i, i64 %43
  %ptr.i221 = bitcast i64* %offsetptr.i225 to <8 x i64>*
  %res.i223 = call <8 x i64> @llvm.masked.load.v8i64.p0v8i64(<8 x i64>* nonnull %ptr.i221, i32 8, <8 x i1> %mask.i222, <8 x i64> zeroinitializer)
  %res.i220 = add nsw <8 x i64> %res.i223, <i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1>
  %res.i217 = mul nsw <8 x i64> %res.i220, %v.i242
  %res.i213 = add nsw <8 x i64> %res.i217, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
  %offsetptr.i211 = getelementptr inbounds double, double* %ptr.i233, <8 x i64> %res.i213
  %res.i209 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %offsetptr.i211, i32 8, <8 x i1> %mask.i222, <8 x double> zeroinitializer)
  %v.i205 = shufflevector <8 x double> %.sroa.0117.0.vec.insert, <8 x double> undef, <8 x i32> zeroinitializer
  %value_phi17 = fmul reassoc nnan ninf nsz arcp contract <8 x double> %v.i205, %res.i209
  %res.i198 = select <8 x i1> %mask.i222, <8 x double> %value_phi17, <8 x double> %v.i205
  %offsetptr.i195272 = getelementptr inbounds double, double* %ptr.i194282, i64 %value_phi355
  %res.i193273 = load double, double* %offsetptr.i195272, align 1
  %vec_4_1.i182 = shufflevector <8 x double> %res.i198, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %vec_4_2.i183 = shufflevector <8 x double> %res.i198, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %vec_4.i184 = fmul <4 x double> %vec_4_1.i182, %vec_4_2.i183
  %vec_2_1.i185 = shufflevector <4 x double> %vec_4.i184, <4 x double> undef, <2 x i32> <i32 0, i32 1>
  %vec_2_2.i186 = shufflevector <4 x double> %vec_4.i184, <4 x double> undef, <2 x i32> <i32 2, i32 3>
  %vec_2.i187 = fmul <2 x double> %vec_2_1.i185, %vec_2_2.i186
  %vec_1_1.i188 = shufflevector <2 x double> %vec_2.i187, <2 x double> undef, <1 x i32> zeroinitializer
  %vec_1_2.i189 = shufflevector <2 x double> %vec_2.i187, <2 x double> undef, <1 x i32> <i32 1>
  %vec_1.i190 = fmul <1 x double> %vec_1_1.i188, %vec_1_2.i189
  %res.i191 = extractelement <1 x double> %vec_1.i190, i32 0
  %44 = fmul fast double %res.i191, %res.i193273
  br label %L175

L170:                                             ; preds = %L93
  %.sroa.097.0.vec.insert103 = insertelement <8 x double> %.sroa.097.0353, double 1.000000e+00, i32 0
  %offsetptr.i195283 = getelementptr inbounds double, double* %ptr.i194282, i64 %value_phi355
  %res.i193284 = load double, double* %offsetptr.i195283, align 1
  br label %L175

L175:                                             ; preds = %L170, %L164
  %.sroa.097.1274 = phi <8 x double> [ %res.i198, %L164 ], [ %.sroa.097.0.vec.insert103, %L170 ]
  %value_phi23 = phi double [ %44, %L164 ], [ %res.i193284, %L170 ]
  %45 = fadd double %value_phi1361, %value_phi23
  %46 = add nuw nsw i64 %value_phi355, 1
  %exitcond = icmp eq i64 %46, %6
  br i1 %exitcond, label %L331, label %L93

L179:                                             ; preds = %top
  %47 = icmp eq i64 %6, 0
  br i1 %47, label %L331, label %L183.preheader

L183.preheader:                                   ; preds = %L179
  %48 = add i64 %11, -7
  %49 = icmp sgt i64 %48, 0
  br i1 %49, label %L198.lr.ph, label %L247

L198.lr.ph:                                       ; preds = %L183.preheader
  %ptr.i179 = inttoptr i64 %18 to i64*
  %ie.i174 = insertelement <8 x i64> undef, i64 %11, i32 0
  %v.i175 = shufflevector <8 x i64> %ie.i174, <8 x i64> undef, <8 x i32> zeroinitializer
  %ptr.i166 = inttoptr i64 %25 to double*
  br label %L198

L198:                                             ; preds = %L198.lr.ph, %L240
  %tindex_phi41341 = phi i8 [ 1, %L198.lr.ph ], [ 2, %L240 ]
  %value_phi37340 = phi <8 x double> [ undef, %L198.lr.ph ], [ %.sroa.085.0, %L240 ]
  %value_phi30339 = phi i64 [ 0, %L198.lr.ph ], [ %50, %L240 ]
  %.sroa.089.0338 = phi <8 x double> [ <double 1.000000e+00, double undef, double undef, double undef, double undef, double undef, double undef, double undef>, %L198.lr.ph ], [ %.sroa.085.0, %L240 ]
  %offsetptr.i180 = getelementptr inbounds i64, i64* %ptr.i179, i64 %value_phi30339
  %ptr.i177 = bitcast i64* %offsetptr.i180 to <8 x i64>*
  %res.i178 = load <8 x i64>, <8 x i64>* %ptr.i177, align 8
  %res.i176 = add nsw <8 x i64> %res.i178, <i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1>
  %res.i173 = mul nsw <8 x i64> %res.i176, %v.i175
  %ie.i171 = insertelement <8 x i64> undef, i64 %value_phi30339, i32 0
  %v.i172 = shufflevector <8 x i64> %ie.i171, <8 x i64> undef, <8 x i32> zeroinitializer
  %res.i170 = add nsw <8 x i64> %v.i172, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
  %res.i169 = add nsw <8 x i64> %res.i170, %res.i173
  %offsetptr.i167 = getelementptr inbounds double, double* %ptr.i166, <8 x i64> %res.i169
  %res.i165 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %offsetptr.i167, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
  switch i8 %tindex_phi41341, label %L238 [
    i8 2, label %L240
    i8 1, label %L231
  ]

L231:                                             ; preds = %L198
  %v.i162 = shufflevector <8 x double> %.sroa.089.0338, <8 x double> undef, <8 x i32> zeroinitializer
  br label %L240

L238:                                             ; preds = %L198
  call void @jl_throw(%jl_value_t addrspace(12)* addrspacecast (%jl_value_t* inttoptr (i64 140677114703680 to %jl_value_t*) to %jl_value_t addrspace(12)*))
  unreachable

L240:                                             ; preds = %L198, %L231
  %v.i162.pn = phi <8 x double> [ %v.i162, %L231 ], [ %value_phi37340, %L198 ]
  %.sroa.085.0 = fmul reassoc nnan ninf nsz arcp contract <8 x double> %res.i165, %v.i162.pn
  %50 = add i64 %value_phi30339, 8
  %51 = icmp slt i64 %50, %48
  br i1 %51, label %L198, label %L247

L247:                                             ; preds = %L240, %L183.preheader
  %.sroa.089.0.lcssa = phi <8 x double> [ <double 1.000000e+00, double undef, double undef, double undef, double undef, double undef, double undef, double undef>, %L183.preheader ], [ %.sroa.085.0, %L240 ]
  %value_phi30.lcssa = phi i64 [ 0, %L183.preheader ], [ %50, %L240 ]
  %value_phi37.lcssa = phi <8 x double> [ undef, %L183.preheader ], [ %.sroa.085.0, %L240 ]
  %tindex_phi41.lcssa = phi i8 [ 1, %L183.preheader ], [ 2, %L240 ]
  %52 = icmp slt i64 %value_phi30.lcssa, %11
  br i1 %52, label %L249, label %L247.L310_crit_edge

L247.L310_crit_edge:                              ; preds = %L247
  switch i8 %tindex_phi41.lcssa, label %L310 [
    i8 1, label %L322.thread
    i8 2, label %L310.thread
  ]

L249:                                             ; preds = %L247
  %ptr.i157 = inttoptr i64 %18 to i64*
  %offsetptr.i158 = getelementptr inbounds i64, i64* %ptr.i157, i64 %value_phi30.lcssa
  %ptr.i154 = bitcast i64* %offsetptr.i158 to <8 x i64>*
  %mask.i155 = bitcast i8 %28 to <8 x i1>
  %res.i156 = call <8 x i64> @llvm.masked.load.v8i64.p0v8i64(<8 x i64>* nonnull %ptr.i154, i32 8, <8 x i1> %mask.i155, <8 x i64> zeroinitializer)
  %res.i153 = add nsw <8 x i64> %res.i156, <i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1>
  %ie.i151 = insertelement <8 x i64> undef, i64 %11, i32 0
  %v.i152 = shufflevector <8 x i64> %ie.i151, <8 x i64> undef, <8 x i32> zeroinitializer
  %res.i150 = mul nsw <8 x i64> %res.i153, %v.i152
  %ie.i148 = insertelement <8 x i64> undef, i64 %value_phi30.lcssa, i32 0
  %v.i149 = shufflevector <8 x i64> %ie.i148, <8 x i64> undef, <8 x i32> zeroinitializer
  %res.i147 = add nsw <8 x i64> %v.i149, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
  %res.i146 = add nsw <8 x i64> %res.i147, %res.i150
  %ptr.i143 = inttoptr i64 %25 to double*
  %offsetptr.i144 = getelementptr inbounds double, double* %ptr.i143, <8 x i64> %res.i146
  %res.i142 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %offsetptr.i144, i32 8, <8 x i1> %mask.i155, <8 x double> zeroinitializer)
  switch i8 %tindex_phi41.lcssa, label %L287 [
    i8 2, label %L293
    i8 1, label %L299
  ]

L287:                                             ; preds = %L249
  call void @jl_throw(%jl_value_t addrspace(12)* addrspacecast (%jl_value_t* inttoptr (i64 140677114703680 to %jl_value_t*) to %jl_value_t addrspace(12)*))
  unreachable

L293:                                             ; preds = %L249
  %value_phi49294 = fmul reassoc nnan ninf nsz arcp contract <8 x double> %value_phi37.lcssa, %res.i142
  %res.i136 = select <8 x i1> %mask.i155, <8 x double> %value_phi49294, <8 x double> %value_phi37.lcssa
  br label %L310.thread

L299:                                             ; preds = %L249
  %v.i139 = shufflevector <8 x double> %.sroa.089.0.lcssa, <8 x double> undef, <8 x i32> zeroinitializer
  %value_phi49 = fmul reassoc nnan ninf nsz arcp contract <8 x double> %v.i139, %res.i142
  %res.i134 = select <8 x i1> %mask.i155, <8 x double> %value_phi49, <8 x double> %v.i139
  br label %L310.thread

L310.thread:                                      ; preds = %L299, %L293, %L247.L310_crit_edge
  %value_phi52.ph = phi <8 x double> [ %res.i136, %L293 ], [ %res.i134, %L299 ], [ %value_phi37.lcssa, %L247.L310_crit_edge ]
  %ptr.i131299 = inttoptr i64 %14 to double*
  %res.i130301 = load double, double* %ptr.i131299, align 1
  %vec_4_1.i = shufflevector <8 x double> %value_phi52.ph, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %vec_4_2.i = shufflevector <8 x double> %value_phi52.ph, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %vec_4.i = fmul <4 x double> %vec_4_1.i, %vec_4_2.i
  %vec_2_1.i = shufflevector <4 x double> %vec_4.i, <4 x double> undef, <2 x i32> <i32 0, i32 1>
  %vec_2_2.i = shufflevector <4 x double> %vec_4.i, <4 x double> undef, <2 x i32> <i32 2, i32 3>
  %vec_2.i = fmul <2 x double> %vec_2_1.i, %vec_2_2.i
  %vec_1_1.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> zeroinitializer
  %vec_1_2.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> <i32 1>
  %vec_1.i = fmul <1 x double> %vec_1_1.i, %vec_1_2.i
  %res.i = extractelement <1 x double> %vec_1.i, i32 0
  %53 = fmul fast double %res.i, %res.i130301
  br label %L329

L310:                                             ; preds = %L247.L310_crit_edge
  call void @jl_throw(%jl_value_t addrspace(12)* addrspacecast (%jl_value_t* inttoptr (i64 140677114703680 to %jl_value_t*) to %jl_value_t addrspace(12)*))
  unreachable

L329:                                             ; preds = %L322.thread, %L310.thread
  %value_phi55 = phi double [ %53, %L310.thread ], [ %56, %L322.thread ]
  %54 = fadd double %value_phi55, 0.000000e+00
  br label %L331

L331:                                             ; preds = %L175, %L175.us, %L179, %L329
  %value_phi56 = phi double [ %54, %L329 ], [ 0.000000e+00, %L179 ], [ %40, %L175.us ], [ %45, %L175 ]
  ret double %value_phi56

L322.thread:                                      ; preds = %L247.L310_crit_edge
  %ptr.i131308 = inttoptr i64 %14 to double*
  %res.i130310 = load double, double* %ptr.i131308, align 1
  %55 = extractelement <8 x double> %.sroa.089.0.lcssa, i32 0
  %56 = fmul fast double %res.i130310, %55
  br label %L329
}

specifically, look for the loop body:

L44.us:                                           ; preds = %L86.us, %L44.lr.ph.us
  %tindex_phi348.us = phi i8 [ 1, %L44.lr.ph.us ], [ 2, %L86.us ]
  %value_phi9347.us = phi <8 x double> [ undef, %L44.lr.ph.us ], [ %.sroa.0113.0.us, %L86.us ]
  %value_phi2346.us = phi i64 [ 0, %L44.lr.ph.us ], [ %34, %L86.us ]
  %.sroa.0117.1345.us = phi <8 x double> [ %.sroa.0117.0.vec.insert.us, %L44.lr.ph.us ], [ %.sroa.0113.0.us, %L86.us ]
  %33 = add i64 %value_phi2346.us, %32
  %offsetptr.i.us = getelementptr inbounds i64, i64* %ptr.i, i64 %33
  %ptr.i244.us = bitcast i64* %offsetptr.i.us to <8 x i64>*
  %res.i245.us = load <8 x i64>, <8 x i64>* %ptr.i244.us, align 8
  %res.i243.us = add nsw <8 x i64> %res.i245.us, <i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1>
  %res.i240.us = mul nsw <8 x i64> %res.i243.us, %v.i242
  %ie.i238.us = insertelement <8 x i64> undef, i64 %value_phi2346.us, i32 0
  %v.i239.us = shufflevector <8 x i64> %ie.i238.us, <8 x i64> undef, <8 x i32> zeroinitializer
  %res.i237.us = add nsw <8 x i64> %v.i239.us, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
  %res.i236.us = add nsw <8 x i64> %res.i237.us, %res.i240.us
  %offsetptr.i234.us = getelementptr inbounds double, double* %ptr.i233, <8 x i64> %res.i236.us
  %res.i232.us = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %offsetptr.i234.us, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
  switch i8 %tindex_phi348.us, label %L84 [
    i8 2, label %L86.us
    i8 1, label %L77.us
  ]

L77.us:                                           ; preds = %L44.us
  %v.i229.us = shufflevector <8 x double> %.sroa.0117.1345.us, <8 x double> undef, <8 x i32> zeroinitializer
  br label %L86.us

L86.us:                                           ; preds = %L77.us, %L44.us
  %v.i229.pn.us = phi <8 x double> [ %v.i229.us, %L77.us ], [ %value_phi9347.us, %L44.us ]
  %.sroa.0113.0.us = fmul reassoc nnan ninf nsz arcp contract <8 x double> %res.i232.us, %v.i229.pn.us
  %34 = add i64 %value_phi2346.us, 8
  %35 = icmp slt i64 %34, %30
  br i1 %35, label %L44.us, label %L93.us
2 Likes

Wow - thanks for testing this. I’m trying to understand this now. So you saying that this “gather” operation when performed in every loop iteration will make AVX essentially useless. That is, I need to figure out a representation that doesn’t require me to load basis[a, n] at each iteration, but just compute it on the fly?

The following seems to help a little, but still gains < factor 2

function mvp_opt(tmp, P, basis, coeffs::Vector{T}) where {T}
   len_c = length(coeffs)
   len_P = size(P, 1)
   fill!(tmp, one(T))
   for a = 1:len_P
      @avx for n = 1:len_c
         tmp[n] *= P[a, basis[n, a]]
      end
   end
   return dot(coeffs, tmp)
end

tmp = zeros(nbasis)
basist = collect(basis')
@btime mvp_opt($tmp, $P, $basis, $coeffs)
1 Like

Thanks for this package! I don’t know anything about vectorization but I learned a lot from this thread.

I’m heavily using scatter/gather operations in my package FixedEffects
(here is the gather part and here is the scatter part) For now, I decompose the vector into nthreads part and do the scatter/gather in each thread.

I guess that one alternative would be to vectorize these parts. Do you think it could benefit from it? I have just tried to experiment but @avx does not seem to support scatter yet.

I’m at work now, where I only have easy access to avx2 (although I could use aws).

I wouldn’t say it makes AVX useless. If you’re really curious, you can take a look through Agner Fog’s instruction tables. Roughly in the middle of page 269, it says VGATHERQPD has a reciprocal throughput (RT) of about 5 clock cycles (meaning the CPU could do about 200 of them every 1000 cycles).
You can think of reciprocal throughput as an average amount of time it takes. The actual number of cycles an instruction takes will generally be longer, but a single core can normally work on multiple in parallel. If code can be run out of order, reciprocal throughput may be a better indicator. (Memory loads can be evaluated out of order, since we’re not writing to memory.)

For comparison, on page 267, VMOVUPS/D operating on v,m (any vector and memory) has a RT of 0.5 clock cycles – 10 times faster.
Additionally, all add/multiply/fused multiply add instructions have RTs of 0.5-1 cycle.
Gather is a steep price to pay. To gain a speedup, the rest of vectorization will have to speed things up enough to be worth it.

For example, running on an AWS instance (to get avx512 at work):

julia> using LoopVectorization, BenchmarkTools

julia> A = rand(100,100); B1 = similar(A); B2 = similar(A);

julia> @benchmark @. $B1 = $A + $A'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.882 μs (0.00% GC)
  median time:      6.932 μs (0.00% GC)
  mean time:        6.986 μs (0.00% GC)
  maximum time:     20.675 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark @avx @. $B2 = $A + $A'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.915 μs (0.00% GC)
  median time:      3.947 μs (0.00% GC)
  mean time:        3.967 μs (0.00% GC)
  maximum time:     7.984 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

@avx uses vgatherqpd here, and does get a speedup.
However, it is slower on the avx2 computer I tried it on (7.5 vs 10 microseconds, instead of 6.9 vs 4).

In your example, it tried to vectorize the inner loop which had dim = 10. 10 is a little awkward for vectors of length 8.
Setting dim = 30 (again on AWS):

julia> maxdeg = 20; dim =30;

julia> nbasis = 1_000;

julia> P = rand(dim, maxdeg + 1);

julia> function mvp(P, basis, coeffs::Vector{T}) where {T}
          len_c = length(coeffs)
          len_P = size(P, 1)
          p = zero(T)
          for n = 1:len_c
             pn = coeffs[n]
             for a = 1:len_P
                pn *= P[a, basis[a, n]]
             end
             p += pn
          end
          return p
       end
mvp (generic function with 1 method)

julia> function mvpinbounds(P, basis, coeffs::Vector{T}) where {T}
          len_c = length(coeffs)
          len_P = size(P, 1)
          p = zero(T)
          @inbounds for n = 1:len_c
             pn = coeffs[n]
             for a = 1:len_P
                pn *= P[a, basis[a, n]]
             end
             p += pn
          end
          return p
       end
mvpinbounds (generic function with 1 method)

julia> function mvpavx(P, basis, coeffs::Vector{T}) where {T}
          len_c = length(coeffs)
          len_P = size(P, 1)
          p = zero(T)
          @avx for n = 1:len_c
             pn = coeffs[n]
             for a = 1:len_P
                pn *= P[a, basis[a, n]]
             end
             p += pn
          end
          return p
       end
mvpavx (generic function with 1 method)

julia> basis = rand(1:(maxdeg+1), (dim, nbasis));

julia> coeffs = rand(nbasis);

julia> @btime mvp($P, $basis, $coeffs)
  40.369 μs (0 allocations: 0 bytes)
8.855318486319671e-7

julia> @btime mvpinbounds($P, $basis, $coeffs)
  28.164 μs (0 allocations: 0 bytes)
8.855318486319671e-7

julia> @btime mvpavx($P, $basis, $coeffs)
  19.370 μs (0 allocations: 0 bytes)
8.855318486319674e-7

30 (4x8, 2 wasted) is less awkward than 10 (2x8, 6 “wasted”). 30/32 was useful, vs 10/16.

I think that’s why your second version is faster (len_c is much larger than len_P).
I also see about a 2x improvement when dim = 10. (Also, note you meant to use basist instead of basis as an argument.)

While I haven’t really been using it, the plumbing is there for LoopVectorization to use static sizes / size hints to make optimization decisions (I’ve not made it take that into account intelligently yet, but it should be simple).
What would be a nice API for passing “size hints” that it will use when optimizing (generated loops will remain dynamic)?
Knowing that one dimension is much longer, it should be able to figure out on its own that it wants to vectorize along that much longer dimension instead of the short one. It won’t be able to transpose matrices for you, but that shouldn’t be as big a deal, since you’re gathering from it anyway.

It may depend on architecture, but definitely worth trying in my opinion.
I think the gather would likely be faster. I’m not sure about scatter.

I haven’t tested scatter as much. Please file an issue with a small example. I’ll get it working and add it to the tests.

4 Likes

I’ve been following this whole thread with great interest and wondering how it relates to arrays which have static size information available in the type. The obvious example is StaticArrays which mostly takes the simple minded but quite general approach of “unroll everything in sight, and hope for the best”. From the perspective of StaticArrays we don’t know how a given array will be used within a larger computation (we don’t have knowledge of a nested set of loops like you do with @avx) so perhaps it’s hard to do better. But I do wonder whether we’re missing out on important optimizations.

I feel like there’s quite some overlap with HybridArrays.jl here, which has sizes for a subset of dimensions available in the type. To make this all work in general (with a really nice API), have you thought about pursuing a partially type-based solution and moving some of the logic to later in the compilation pipeline? I see that’s kind of what you’ve already done with the broadcast generated function. The @avx macro wouldn’t go away because you’d obviously need custom loop lowering, but it might be helpful to mix in some type information.

2 Likes

LoopVectorization is an extremely cool package. It looks like one can write loops for matrix-matrix multiplication in any order, and this package will fix it up.

Pure Julia, triple loop on regular double-precision arrays.
Without the @avx

With the @avx

Reference: https://github.com/PetrKryslUCSD/MatMulExperiments

4 Likes

It will be hard to actually support scatter using vectors of indices.
When trying to store a vector, you’ll have a vector of indices. It will have to make sure that none of these indices conflict, otherwise it’ll try to write two values to the same place at the same time, and the answers will be wrong (eg, in the += example that was linked).
Some architectures have conflict detection instructions, which could be used. However, LLVM doesn’t seem to provide an intrinsic (that I could find).

You can thank @Mason for adding a macro that wraps a generated function.
Currently, it simple turns an expression into a nested type object to pass to the generated function. The generated function then converts it back into an expression, and then uses the existing code to create a LoopSet object (which performs the analysis on loops) from the expression.

I’ll begin work shortly to extend this in two complementary ways:

  1. Use type information of the arguments to inform LoopSet.
  2. Define an interface for other libraries to be able to pass this information.

That interface will take the form in overloads for VectorizationBase.stridedpointer. This already has special overloads for transposed/adjoints and SubArrays whose index is Tuple{Int,Vararg} (because they don’t have unit stride on the first axis, and in the case of transpose/adjoints do have unit stride in a different dimension).

MArray and SArray would each have their own overload.

Taking advantage of size is made more difficult by the fact that size is not the same thing as loop bounds. Code could also special case size and length calls in defining loopbounds, and check if an array argument’s size is known.
Any ideas for a type-based approach?

I’d also like this to be extended to use a StructVector (that is, like StructArrays) for small isbits composites, constructing separate vectors via shuffles. Except when the input array was already a StructArray, in which case it will be faster / can skip the shuffling. Besides implementing all of that, LoopVectorization’s optimizer cannot handle it right now either; it has no concept of a number occupying multiple registers. Ideally it could figure out accurate estimates for instruction costs as well.

As demoed in the benchmarks (I’ll add more in the next few days, and add icc and ifort to the mix), it tends to outperform the compiler on many of these loops:

julia> using LoopVectorization, StaticArrays, BenchmarkTools, LinearAlgebra

julia> function AmulBavx1!(C, A, B)
           @avx for m ∈ 1:size(A,1), n ∈ 1:size(B,2)
               Cₘₙ = zero(eltype(C))
               for k ∈ 1:size(A,2)
                   Cₘₙ += A[m,k] * B[k,n]
               end
               C[m,n] = Cₘₙ
           end
       end
AmulBavx1! (generic function with 1 method)

julia> A = @MMatrix rand(7,7);

julia> B = @MMatrix rand(7,7);

julia> C1 = similar(A); C2 = similar(A);

julia> @benchmark mul!($C1, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     66.800 ns (0.00% GC)
  median time:      68.009 ns (0.00% GC)
  mean time:        67.927 ns (0.00% GC)
  maximum time:     84.059 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     975

julia> @benchmark AmulBavx1!($C2, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.605 ns (0.00% GC)
  median time:      19.700 ns (0.00% GC)
  mean time:        19.725 ns (0.00% GC)
  maximum time:     31.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

@avx is fairly aggressive in unrolling the generated code itself. Julia/LLVM then do plenty of dead code elimination when the loops are short and statically sized.
We’re left with the fairly clean:

#julia> @code_native debuginfo=:none AmulBavx1!(C2, A, B)
	.text
	movq	%rsi, -8(%rsp)
	movq	(%rsi), %rax
	movq	8(%rsi), %rdx
	movq	16(%rsi), %rcx
	movb	$127, %sil
	kmovd	%esi, %k1
	vmovupd	(%rdx), %zmm1 {%k1} {z}
	vxorpd	%xmm0, %xmm0, %xmm0
	vxorpd	%xmm2, %xmm2, %xmm2
	vfmadd231pd	(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vxorpd	%xmm3, %xmm3, %xmm3
	vfmadd231pd	56(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vxorpd	%xmm4, %xmm4, %xmm4
	vfmadd231pd	112(%rcx){1to8}, %zmm1, %zmm4 # zmm4 = (zmm1 * mem) + zmm4
	vfmadd132pd	168(%rcx){1to8}, %zmm0, %zmm1 # zmm1 = (zmm1 * mem) + zmm0
	vmovupd	56(%rdx), %zmm5 {%k1} {z}
	vfmadd231pd	8(%rcx){1to8}, %zmm5, %zmm2 # zmm2 = (zmm5 * mem) + zmm2
	vfmadd231pd	64(%rcx){1to8}, %zmm5, %zmm3 # zmm3 = (zmm5 * mem) + zmm3
	vfmadd231pd	120(%rcx){1to8}, %zmm5, %zmm4 # zmm4 = (zmm5 * mem) + zmm4
	vfmadd231pd	176(%rcx){1to8}, %zmm5, %zmm1 # zmm1 = (zmm5 * mem) + zmm1
	vmovupd	112(%rdx), %zmm5 {%k1} {z}
	vfmadd231pd	16(%rcx){1to8}, %zmm5, %zmm2 # zmm2 = (zmm5 * mem) + zmm2
	vfmadd231pd	72(%rcx){1to8}, %zmm5, %zmm3 # zmm3 = (zmm5 * mem) + zmm3
	vfmadd231pd	128(%rcx){1to8}, %zmm5, %zmm4 # zmm4 = (zmm5 * mem) + zmm4
	vfmadd231pd	184(%rcx){1to8}, %zmm5, %zmm1 # zmm1 = (zmm5 * mem) + zmm1
	vmovupd	168(%rdx), %zmm5 {%k1} {z}
	vfmadd231pd	24(%rcx){1to8}, %zmm5, %zmm2 # zmm2 = (zmm5 * mem) + zmm2
	vfmadd231pd	80(%rcx){1to8}, %zmm5, %zmm3 # zmm3 = (zmm5 * mem) + zmm3
	vfmadd231pd	136(%rcx){1to8}, %zmm5, %zmm4 # zmm4 = (zmm5 * mem) + zmm4
	vfmadd231pd	192(%rcx){1to8}, %zmm5, %zmm1 # zmm1 = (zmm5 * mem) + zmm1
	vmovupd	224(%rdx), %zmm5 {%k1} {z}
	vfmadd231pd	32(%rcx){1to8}, %zmm5, %zmm2 # zmm2 = (zmm5 * mem) + zmm2
	vfmadd231pd	88(%rcx){1to8}, %zmm5, %zmm3 # zmm3 = (zmm5 * mem) + zmm3
	vfmadd231pd	144(%rcx){1to8}, %zmm5, %zmm4 # zmm4 = (zmm5 * mem) + zmm4
	vfmadd231pd	200(%rcx){1to8}, %zmm5, %zmm1 # zmm1 = (zmm5 * mem) + zmm1
	vmovupd	280(%rdx), %zmm5 {%k1} {z}
	vfmadd231pd	40(%rcx){1to8}, %zmm5, %zmm2 # zmm2 = (zmm5 * mem) + zmm2
	vfmadd231pd	96(%rcx){1to8}, %zmm5, %zmm3 # zmm3 = (zmm5 * mem) + zmm3
	vfmadd231pd	152(%rcx){1to8}, %zmm5, %zmm4 # zmm4 = (zmm5 * mem) + zmm4
	vfmadd231pd	208(%rcx){1to8}, %zmm5, %zmm1 # zmm1 = (zmm5 * mem) + zmm1
	vmovupd	336(%rdx), %zmm5 {%k1} {z}
	vfmadd231pd	48(%rcx){1to8}, %zmm5, %zmm2 # zmm2 = (zmm5 * mem) + zmm2
	vfmadd231pd	104(%rcx){1to8}, %zmm5, %zmm3 # zmm3 = (zmm5 * mem) + zmm3
	vfmadd231pd	160(%rcx){1to8}, %zmm5, %zmm4 # zmm4 = (zmm5 * mem) + zmm4
	vfmadd231pd	216(%rcx){1to8}, %zmm5, %zmm1 # zmm1 = (zmm5 * mem) + zmm1
	vmovupd	%zmm2, (%rax) {%k1}
	vmovupd	%zmm3, 56(%rax) {%k1}
	vmovupd	%zmm4, 112(%rax) {%k1}
	vmovupd	%zmm1, 168(%rax) {%k1}
	vmovupd	(%rdx), %zmm1 {%k1} {z}
	vxorpd	%xmm2, %xmm2, %xmm2
	vfmadd231pd	224(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vxorpd	%xmm3, %xmm3, %xmm3
	vfmadd231pd	280(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vfmadd231pd	336(%rcx){1to8}, %zmm1, %zmm0 # zmm0 = (zmm1 * mem) + zmm0
	vmovupd	56(%rdx), %zmm1 {%k1} {z}
	vfmadd231pd	232(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vfmadd231pd	288(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vfmadd231pd	344(%rcx){1to8}, %zmm1, %zmm0 # zmm0 = (zmm1 * mem) + zmm0
	vmovupd	112(%rdx), %zmm1 {%k1} {z}
	vfmadd231pd	240(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vfmadd231pd	296(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vfmadd231pd	352(%rcx){1to8}, %zmm1, %zmm0 # zmm0 = (zmm1 * mem) + zmm0
	vmovupd	168(%rdx), %zmm1 {%k1} {z}
	vfmadd231pd	248(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vfmadd231pd	304(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vfmadd231pd	360(%rcx){1to8}, %zmm1, %zmm0 # zmm0 = (zmm1 * mem) + zmm0
	vmovupd	224(%rdx), %zmm1 {%k1} {z}
	vfmadd231pd	256(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vfmadd231pd	312(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vfmadd231pd	368(%rcx){1to8}, %zmm1, %zmm0 # zmm0 = (zmm1 * mem) + zmm0
	vmovupd	280(%rdx), %zmm1 {%k1} {z}
	vfmadd231pd	264(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vfmadd231pd	320(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vfmadd231pd	376(%rcx){1to8}, %zmm1, %zmm0 # zmm0 = (zmm1 * mem) + zmm0
	vmovupd	336(%rdx), %zmm1 {%k1} {z}
	vfmadd231pd	272(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vfmadd231pd	328(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vfmadd231pd	384(%rcx){1to8}, %zmm1, %zmm0 # zmm0 = (zmm1 * mem) + zmm0
	vmovupd	%zmm2, 224(%rax) {%k1}
	vmovupd	%zmm3, 280(%rax) {%k1}
	vmovupd	%zmm0, 336(%rax) {%k1}
	movabsq	$jl_system_image_data, %rax
	vzeroupper
	retq
	nopl	(%rax)

[ It may be better to use fastmath so that xor -> fmadd can be replaced with a mul, but I’ve noticed that in earlier versions of LLVM, this would lead to lots of extraneous ‘vmovapd’ instructions. I haven’t tried in a while. ]

Letting LLVM take care of it in this way is a form of moving it later in the pipeline.
A question is, how much better could we do ourselves?
We’re hurt by the fact that it is the loop remainder / clean up code that does the heavy lifting. We could do a better job divvying up small numbers of iterations.

@PetrKryslUCSD I also like how the first graphic peaks at <2, while thr second hovers around 10.
EDIT: Although it looks like you didn’t use @inbounds or @simd for the comparisons.

2 Likes

I added the six permutations of the matrix-matrix multiplication with @inbounds.

EDIT: Performance has also been evaluated for @simd.

4 Likes

@simd needs @inbounds to work, otherwise the branches (bounds checks) prevent the optimizations.
In this case, @inbounds is all that is required to vectorize the loop, because your mulCABjpi and mulCABpji loops don’t need any reordering to vectorize (thus @simd's permission isn’t required).

2 Likes

Good catch, thanks.