[ANN] LoopVectorization

Your code works now (and is tested as of LoopVectorization v0.3.6), but it is slower on my computer (with avx512) than just adding @inbounds:

julia> using BenchmarkTools, LoopVectorization

julia> maxdeg = 20; dim = 10; nbasis = 1_000;

julia> P = rand(dim, maxdeg + 1);

julia> function mvp(P, basis, coeffs::Vector{T}) where {T}
          len_c = length(coeffs)
          len_P = size(P, 1)
          p = zero(T)
          for n = 1:len_c
             pn = coeffs[n]
             for a = 1:len_P
                pn *= P[a, basis[a, n]]
             end
             p += pn
          end
          return p
       end
mvp (generic function with 1 method)

julia> function mvpinbounds(P, basis, coeffs::Vector{T}) where {T}
          len_c = length(coeffs)
          len_P = size(P, 1)
          p = zero(T)
          @inbounds for n = 1:len_c
             pn = coeffs[n]
             for a = 1:len_P
                pn *= P[a, basis[a, n]]
             end
             p += pn
          end
          return p
       end
mvpinbounds (generic function with 1 method)

julia> function mvpavx(P, basis, coeffs::Vector{T}) where {T}
          len_c = length(coeffs)
          len_P = size(P, 1)
          p = zero(T)
          @avx for n = 1:len_c
             pn = coeffs[n]
             for a = 1:len_P
                pn *= P[a, basis[a, n]]
             end
             p += pn
          end
          return p
       end
mvpavx (generic function with 1 method)

julia> basis = rand(1:(maxdeg+1), (dim, nbasis));

julia> coeffs = rand(nbasis);

julia> @btime mvp($P, $basis, $coeffs)
  9.464 μs (0 allocations: 0 bytes)
0.8416355183275668

julia> @btime mvpinbounds($P, $basis, $coeffs)
  5.322 μs (0 allocations: 0 bytes)
0.8416355183275668

julia> @btime mvpavx($P, $basis, $coeffs)
  7.692 μs (0 allocations: 0 bytes)
0.8416355183275668

julia> nbasis = 10_000;

julia> basis = rand(1:(maxdeg+1), (dim, nbasis));

julia> coeffs = rand(nbasis);

julia> @btime mvp($P, $basis, $coeffs)
  97.394 μs (0 allocations: 0 bytes)
9.049892658664312

julia> @btime mvpinbounds($P, $basis, $coeffs)
  55.024 μs (0 allocations: 0 bytes)
9.049892658664312

julia> @btime mvpavx($P, $basis, $coeffs)
  71.441 μs (0 allocations: 0 bytes)
9.049892658664312

You can also confirm with @code_native or @code_llvm that the @inbounds version is not vectorized, while the @avx is (but is slower anyway):

julia> @code_llvm debuginfo=:none mvpinbounds(P, basis, coeffs)

define double @julia_mvpinbounds_18548(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
top:
  %3 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*
  %4 = bitcast %jl_value_t addrspace(11)* %3 to %jl_array_t addrspace(11)*
  %5 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %4, i64 0, i32 1
  %6 = load i64, i64 addrspace(11)* %5, align 8
  %7 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %8 = bitcast %jl_value_t addrspace(11)* %7 to %jl_value_t addrspace(10)* addrspace(11)*
  %9 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %8, i64 3
  %10 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %9 to i64 addrspace(11)*
  %11 = load i64, i64 addrspace(11)* %10, align 8
  %12 = icmp sgt i64 %6, 0
  %13 = select i1 %12, i64 %6, i64 0
  br i1 %12, label %L14.preheader, label %L59

L14.preheader:                                    ; preds = %top
  %14 = bitcast %jl_value_t addrspace(11)* %3 to double addrspace(13)* addrspace(11)*
  %15 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %14, align 8
  %16 = icmp sgt i64 %11, 0
  %17 = select i1 %16, i64 %11, i64 0
  %18 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*
  %19 = bitcast %jl_value_t addrspace(11)* %18 to %jl_value_t addrspace(10)* addrspace(11)*
  %20 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %19, i64 3
  %21 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %20 to i64 addrspace(11)*
  %22 = load i64, i64 addrspace(11)* %21, align 8
  %23 = bitcast %jl_value_t addrspace(11)* %18 to i64 addrspace(13)* addrspace(11)*
  %24 = load i64 addrspace(13)*, i64 addrspace(13)* addrspace(11)* %23, align 8
  %25 = bitcast %jl_value_t addrspace(11)* %7 to double addrspace(13)* addrspace(11)*
  %26 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %25, align 8
  br i1 %16, label %L29.preheader.us, label %L46

L29.preheader.us:                                 ; preds = %L14.preheader, %L46.us
  %value_phi3.us = phi double [ %43, %L46.us ], [ 0.000000e+00, %L14.preheader ]
  %value_phi4.us = phi i64 [ %45, %L46.us ], [ 1, %L14.preheader ]
  %27 = add nsw i64 %value_phi4.us, -1
  %28 = getelementptr inbounds double, double addrspace(13)* %15, i64 %27
  %29 = load double, double addrspace(13)* %28, align 8
  %30 = mul i64 %22, %27
  br label %L29.us

L29.us:                                           ; preds = %L29.us, %L29.preheader.us
  %value_phi9.us = phi double [ %40, %L29.us ], [ %29, %L29.preheader.us ]
  %value_phi10.us = phi i64 [ %42, %L29.us ], [ 1, %L29.preheader.us ]
  %31 = add nsw i64 %value_phi10.us, -1
  %32 = add i64 %31, %30
  %33 = getelementptr inbounds i64, i64 addrspace(13)* %24, i64 %32
  %34 = load i64, i64 addrspace(13)* %33, align 8
  %35 = add i64 %34, -1
  %36 = mul i64 %35, %11
  %37 = add i64 %31, %36
  %38 = getelementptr inbounds double, double addrspace(13)* %26, i64 %37
  %39 = load double, double addrspace(13)* %38, align 8
  %40 = fmul double %value_phi9.us, %39
  %41 = icmp eq i64 %value_phi10.us, %17
  %42 = add nuw i64 %value_phi10.us, 1
  br i1 %41, label %L46.us, label %L29.us

L46.us:                                           ; preds = %L29.us
  %43 = fadd double %value_phi3.us, %40
  %44 = icmp eq i64 %value_phi4.us, %13
  %45 = add nuw i64 %value_phi4.us, 1
  br i1 %44, label %L59, label %L29.preheader.us

L46:                                              ; preds = %L14.preheader, %L46
  %value_phi3 = phi double [ %49, %L46 ], [ 0.000000e+00, %L14.preheader ]
  %value_phi4 = phi i64 [ %51, %L46 ], [ 1, %L14.preheader ]
  %46 = add nsw i64 %value_phi4, -1
  %47 = getelementptr inbounds double, double addrspace(13)* %15, i64 %46
  %48 = load double, double addrspace(13)* %47, align 8
  %49 = fadd double %value_phi3, %48
  %50 = icmp eq i64 %value_phi4, %13
  %51 = add nuw i64 %value_phi4, 1
  br i1 %50, label %L59, label %L46

L59:                                              ; preds = %L46, %L46.us, %top
  %value_phi19 = phi double [ 0.000000e+00, %top ], [ %43, %L46.us ], [ %49, %L46 ]
  ret double %value_phi19
}

vs

;; julia> @code_llvm debuginfo=:none mvpavx(P, basis, coeffs)

define double @julia_mvpavx_18517(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40), %jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
top:
  %3 = addrspacecast %jl_value_t addrspace(10)* %2 to %jl_value_t addrspace(11)*
  %4 = bitcast %jl_value_t addrspace(11)* %3 to %jl_array_t addrspace(11)*
  %5 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %4, i64 0, i32 1
  %6 = load i64, i64 addrspace(11)* %5, align 8
  %7 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
  %8 = bitcast %jl_value_t addrspace(11)* %7 to %jl_value_t addrspace(10)* addrspace(11)*
  %9 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %8, i64 3
  %10 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %9 to i64 addrspace(11)*
  %11 = load i64, i64 addrspace(11)* %10, align 8
  %12 = addrspacecast %jl_value_t addrspace(11)* %3 to %jl_value_t*
  %13 = bitcast %jl_value_t* %12 to i64*
  %14 = load i64, i64* %13, align 8
  %15 = addrspacecast %jl_value_t addrspace(10)* %1 to %jl_value_t addrspace(11)*
  %16 = addrspacecast %jl_value_t addrspace(11)* %15 to %jl_value_t*
  %17 = bitcast %jl_value_t* %16 to i64*
  %18 = load i64, i64* %17, align 8
  %19 = bitcast %jl_value_t addrspace(11)* %15 to %jl_value_t addrspace(10)* addrspace(11)*
  %20 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %19, i64 3
  %21 = bitcast %jl_value_t addrspace(10)* addrspace(11)* %20 to i64 addrspace(11)*
  %22 = load i64, i64 addrspace(11)* %21, align 8
  %23 = addrspacecast %jl_value_t addrspace(11)* %7 to %jl_value_t*
  %24 = bitcast %jl_value_t* %23 to i64*
  %25 = load i64, i64* %24, align 8
  %26 = trunc i64 %11 to i8
  %27 = and i8 %26, 7
  %notmask = shl nsw i8 -1, %27
  %28 = xor i8 %notmask, -1
  %29 = icmp sgt i64 %6, 0
  br i1 %29, label %L28.lr.ph, label %L179

L28.lr.ph:                                        ; preds = %top
  %30 = add i64 %11, -7
  %31 = icmp sgt i64 %30, 0
  %ptr.i = inttoptr i64 %18 to i64*
  %ie.i241 = insertelement <8 x i64> undef, i64 %11, i32 0
  %v.i242 = shufflevector <8 x i64> %ie.i241, <8 x i64> undef, <8 x i32> zeroinitializer
  %ptr.i233 = inttoptr i64 %25 to double*
  %ptr.i194282 = inttoptr i64 %14 to double*
  %mask.i222 = bitcast i8 %28 to <8 x i1>
  br i1 %31, label %L44.lr.ph.us, label %L93

L44.lr.ph.us:                                     ; preds = %L28.lr.ph, %L175.us
  %value_phi1361.us = phi double [ %40, %L175.us ], [ 0.000000e+00, %L28.lr.ph ]
  %value_phi355.us = phi i64 [ %41, %L175.us ], [ 0, %L28.lr.ph ]
  %.sroa.0117.0354.us = phi <8 x double> [ %.sroa.0113.0.us, %L175.us ], [ undef, %L28.lr.ph ]
  %.sroa.0117.0.vec.insert.us = insertelement <8 x double> %.sroa.0117.0354.us, double 1.000000e+00, i32 0
  %32 = mul i64 %value_phi355.us, %22
  br label %L44.us

L44.us:                                           ; preds = %L86.us, %L44.lr.ph.us
  %tindex_phi348.us = phi i8 [ 1, %L44.lr.ph.us ], [ 2, %L86.us ]
  %value_phi9347.us = phi <8 x double> [ undef, %L44.lr.ph.us ], [ %.sroa.0113.0.us, %L86.us ]
  %value_phi2346.us = phi i64 [ 0, %L44.lr.ph.us ], [ %34, %L86.us ]
  %.sroa.0117.1345.us = phi <8 x double> [ %.sroa.0117.0.vec.insert.us, %L44.lr.ph.us ], [ %.sroa.0113.0.us, %L86.us ]
  %33 = add i64 %value_phi2346.us, %32
  %offsetptr.i.us = getelementptr inbounds i64, i64* %ptr.i, i64 %33
  %ptr.i244.us = bitcast i64* %offsetptr.i.us to <8 x i64>*
  %res.i245.us = load <8 x i64>, <8 x i64>* %ptr.i244.us, align 8
  %res.i243.us = add nsw <8 x i64> %res.i245.us, <i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1>
  %res.i240.us = mul nsw <8 x i64> %res.i243.us, %v.i242
  %ie.i238.us = insertelement <8 x i64> undef, i64 %value_phi2346.us, i32 0
  %v.i239.us = shufflevector <8 x i64> %ie.i238.us, <8 x i64> undef, <8 x i32> zeroinitializer
  %res.i237.us = add nsw <8 x i64> %v.i239.us, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
  %res.i236.us = add nsw <8 x i64> %res.i237.us, %res.i240.us
  %offsetptr.i234.us = getelementptr inbounds double, double* %ptr.i233, <8 x i64> %res.i236.us
  %res.i232.us = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %offsetptr.i234.us, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
  switch i8 %tindex_phi348.us, label %L84 [
    i8 2, label %L86.us
    i8 1, label %L77.us
  ]

L77.us:                                           ; preds = %L44.us
  %v.i229.us = shufflevector <8 x double> %.sroa.0117.1345.us, <8 x double> undef, <8 x i32> zeroinitializer
  br label %L86.us

L86.us:                                           ; preds = %L77.us, %L44.us
  %v.i229.pn.us = phi <8 x double> [ %v.i229.us, %L77.us ], [ %value_phi9347.us, %L44.us ]
  %.sroa.0113.0.us = fmul reassoc nnan ninf nsz arcp contract <8 x double> %res.i232.us, %v.i229.pn.us
  %34 = add i64 %value_phi2346.us, 8
  %35 = icmp slt i64 %34, %30
  br i1 %35, label %L44.us, label %L93.us

L93.us:                                           ; preds = %L86.us
  %36 = icmp slt i64 %34, %11
  br i1 %36, label %L139.us, label %L175.us

L139.us:                                          ; preds = %L93.us
  %ie.i215.us = insertelement <8 x i64> undef, i64 %34, i32 0
  %v.i216.us = shufflevector <8 x i64> %ie.i215.us, <8 x i64> undef, <8 x i32> zeroinitializer
  %res.i214.us = add nsw <8 x i64> %v.i216.us, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
  %37 = mul i64 %value_phi355.us, %22
  %38 = add i64 %34, %37
  %offsetptr.i225.us = getelementptr inbounds i64, i64* %ptr.i, i64 %38
  %ptr.i221.us = bitcast i64* %offsetptr.i225.us to <8 x i64>*
  %res.i223.us = call <8 x i64> @llvm.masked.load.v8i64.p0v8i64(<8 x i64>* nonnull %ptr.i221.us, i32 8, <8 x i1> %mask.i222, <8 x i64> zeroinitializer)
  %res.i220.us = add nsw <8 x i64> %res.i223.us, <i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1>
  %res.i217.us = mul nsw <8 x i64> %res.i220.us, %v.i242
  %res.i213.us = add nsw <8 x i64> %res.i214.us, %res.i217.us
  %offsetptr.i211.us = getelementptr inbounds double, double* %ptr.i233, <8 x i64> %res.i213.us
  %res.i209.us = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %offsetptr.i211.us, i32 8, <8 x i1> %mask.i222, <8 x double> zeroinitializer)
  %value_phi17266.us = fmul reassoc nnan ninf nsz arcp contract <8 x double> %.sroa.0113.0.us, %res.i209.us
  %res.i202.us = select <8 x i1> %mask.i222, <8 x double> %value_phi17266.us, <8 x double> %.sroa.0113.0.us
  br label %L175.us

L175.us:                                          ; preds = %L93.us, %L139.us
  %value_phi20.ph.us = phi <8 x double> [ %res.i202.us, %L139.us ], [ %.sroa.0113.0.us, %L93.us ]
  %offsetptr.i195272.us = getelementptr inbounds double, double* %ptr.i194282, i64 %value_phi355.us
  %res.i193273.us = load double, double* %offsetptr.i195272.us, align 1
  %vec_4_1.i182.us = shufflevector <8 x double> %value_phi20.ph.us, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %vec_4_2.i183.us = shufflevector <8 x double> %value_phi20.ph.us, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %vec_4.i184.us = fmul <4 x double> %vec_4_1.i182.us, %vec_4_2.i183.us
  %vec_2_1.i185.us = shufflevector <4 x double> %vec_4.i184.us, <4 x double> undef, <2 x i32> <i32 0, i32 1>
  %vec_2_2.i186.us = shufflevector <4 x double> %vec_4.i184.us, <4 x double> undef, <2 x i32> <i32 2, i32 3>
  %vec_2.i187.us = fmul <2 x double> %vec_2_1.i185.us, %vec_2_2.i186.us
  %vec_1_1.i188.us = shufflevector <2 x double> %vec_2.i187.us, <2 x double> undef, <1 x i32> zeroinitializer
  %vec_1_2.i189.us = shufflevector <2 x double> %vec_2.i187.us, <2 x double> undef, <1 x i32> <i32 1>
  %vec_1.i190.us = fmul <1 x double> %vec_1_1.i188.us, %vec_1_2.i189.us
  %res.i191.us = extractelement <1 x double> %vec_1.i190.us, i32 0
  %39 = fmul fast double %res.i191.us, %res.i193273.us
  %40 = fadd double %value_phi1361.us, %39
  %41 = add nuw nsw i64 %value_phi355.us, 1
  %exitcond369 = icmp eq i64 %41, %6
  br i1 %exitcond369, label %L331, label %L44.lr.ph.us

L84:                                              ; preds = %L44.us
  call void @jl_throw(%jl_value_t addrspace(12)* addrspacecast (%jl_value_t* inttoptr (i64 140677114703680 to %jl_value_t*) to %jl_value_t addrspace(12)*))
  unreachable

L93:                                              ; preds = %L28.lr.ph, %L175
  %value_phi1361 = phi double [ %45, %L175 ], [ 0.000000e+00, %L28.lr.ph ]
  %value_phi355 = phi i64 [ %46, %L175 ], [ 0, %L28.lr.ph ]
  %.sroa.0117.0354 = phi <8 x double> [ %.sroa.0117.0.vec.insert, %L175 ], [ undef, %L28.lr.ph ]
  %.sroa.097.0353 = phi <8 x double> [ %.sroa.097.1274, %L175 ], [ undef, %L28.lr.ph ]
  %.sroa.0117.0.vec.insert = insertelement <8 x double> %.sroa.0117.0354, double 1.000000e+00, i32 0
  %42 = icmp sgt i64 %11, 0
  br i1 %42, label %L164, label %L170

L164:                                             ; preds = %L93
  %43 = mul i64 %value_phi355, %22
  %offsetptr.i225 = getelementptr inbounds i64, i64* %ptr.i, i64 %43
  %ptr.i221 = bitcast i64* %offsetptr.i225 to <8 x i64>*
  %res.i223 = call <8 x i64> @llvm.masked.load.v8i64.p0v8i64(<8 x i64>* nonnull %ptr.i221, i32 8, <8 x i1> %mask.i222, <8 x i64> zeroinitializer)
  %res.i220 = add nsw <8 x i64> %res.i223, <i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1>
  %res.i217 = mul nsw <8 x i64> %res.i220, %v.i242
  %res.i213 = add nsw <8 x i64> %res.i217, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
  %offsetptr.i211 = getelementptr inbounds double, double* %ptr.i233, <8 x i64> %res.i213
  %res.i209 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %offsetptr.i211, i32 8, <8 x i1> %mask.i222, <8 x double> zeroinitializer)
  %v.i205 = shufflevector <8 x double> %.sroa.0117.0.vec.insert, <8 x double> undef, <8 x i32> zeroinitializer
  %value_phi17 = fmul reassoc nnan ninf nsz arcp contract <8 x double> %v.i205, %res.i209
  %res.i198 = select <8 x i1> %mask.i222, <8 x double> %value_phi17, <8 x double> %v.i205
  %offsetptr.i195272 = getelementptr inbounds double, double* %ptr.i194282, i64 %value_phi355
  %res.i193273 = load double, double* %offsetptr.i195272, align 1
  %vec_4_1.i182 = shufflevector <8 x double> %res.i198, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %vec_4_2.i183 = shufflevector <8 x double> %res.i198, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %vec_4.i184 = fmul <4 x double> %vec_4_1.i182, %vec_4_2.i183
  %vec_2_1.i185 = shufflevector <4 x double> %vec_4.i184, <4 x double> undef, <2 x i32> <i32 0, i32 1>
  %vec_2_2.i186 = shufflevector <4 x double> %vec_4.i184, <4 x double> undef, <2 x i32> <i32 2, i32 3>
  %vec_2.i187 = fmul <2 x double> %vec_2_1.i185, %vec_2_2.i186
  %vec_1_1.i188 = shufflevector <2 x double> %vec_2.i187, <2 x double> undef, <1 x i32> zeroinitializer
  %vec_1_2.i189 = shufflevector <2 x double> %vec_2.i187, <2 x double> undef, <1 x i32> <i32 1>
  %vec_1.i190 = fmul <1 x double> %vec_1_1.i188, %vec_1_2.i189
  %res.i191 = extractelement <1 x double> %vec_1.i190, i32 0
  %44 = fmul fast double %res.i191, %res.i193273
  br label %L175

L170:                                             ; preds = %L93
  %.sroa.097.0.vec.insert103 = insertelement <8 x double> %.sroa.097.0353, double 1.000000e+00, i32 0
  %offsetptr.i195283 = getelementptr inbounds double, double* %ptr.i194282, i64 %value_phi355
  %res.i193284 = load double, double* %offsetptr.i195283, align 1
  br label %L175

L175:                                             ; preds = %L170, %L164
  %.sroa.097.1274 = phi <8 x double> [ %res.i198, %L164 ], [ %.sroa.097.0.vec.insert103, %L170 ]
  %value_phi23 = phi double [ %44, %L164 ], [ %res.i193284, %L170 ]
  %45 = fadd double %value_phi1361, %value_phi23
  %46 = add nuw nsw i64 %value_phi355, 1
  %exitcond = icmp eq i64 %46, %6
  br i1 %exitcond, label %L331, label %L93

L179:                                             ; preds = %top
  %47 = icmp eq i64 %6, 0
  br i1 %47, label %L331, label %L183.preheader

L183.preheader:                                   ; preds = %L179
  %48 = add i64 %11, -7
  %49 = icmp sgt i64 %48, 0
  br i1 %49, label %L198.lr.ph, label %L247

L198.lr.ph:                                       ; preds = %L183.preheader
  %ptr.i179 = inttoptr i64 %18 to i64*
  %ie.i174 = insertelement <8 x i64> undef, i64 %11, i32 0
  %v.i175 = shufflevector <8 x i64> %ie.i174, <8 x i64> undef, <8 x i32> zeroinitializer
  %ptr.i166 = inttoptr i64 %25 to double*
  br label %L198

L198:                                             ; preds = %L198.lr.ph, %L240
  %tindex_phi41341 = phi i8 [ 1, %L198.lr.ph ], [ 2, %L240 ]
  %value_phi37340 = phi <8 x double> [ undef, %L198.lr.ph ], [ %.sroa.085.0, %L240 ]
  %value_phi30339 = phi i64 [ 0, %L198.lr.ph ], [ %50, %L240 ]
  %.sroa.089.0338 = phi <8 x double> [ <double 1.000000e+00, double undef, double undef, double undef, double undef, double undef, double undef, double undef>, %L198.lr.ph ], [ %.sroa.085.0, %L240 ]
  %offsetptr.i180 = getelementptr inbounds i64, i64* %ptr.i179, i64 %value_phi30339
  %ptr.i177 = bitcast i64* %offsetptr.i180 to <8 x i64>*
  %res.i178 = load <8 x i64>, <8 x i64>* %ptr.i177, align 8
  %res.i176 = add nsw <8 x i64> %res.i178, <i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1>
  %res.i173 = mul nsw <8 x i64> %res.i176, %v.i175
  %ie.i171 = insertelement <8 x i64> undef, i64 %value_phi30339, i32 0
  %v.i172 = shufflevector <8 x i64> %ie.i171, <8 x i64> undef, <8 x i32> zeroinitializer
  %res.i170 = add nsw <8 x i64> %v.i172, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
  %res.i169 = add nsw <8 x i64> %res.i170, %res.i173
  %offsetptr.i167 = getelementptr inbounds double, double* %ptr.i166, <8 x i64> %res.i169
  %res.i165 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %offsetptr.i167, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
  switch i8 %tindex_phi41341, label %L238 [
    i8 2, label %L240
    i8 1, label %L231
  ]

L231:                                             ; preds = %L198
  %v.i162 = shufflevector <8 x double> %.sroa.089.0338, <8 x double> undef, <8 x i32> zeroinitializer
  br label %L240

L238:                                             ; preds = %L198
  call void @jl_throw(%jl_value_t addrspace(12)* addrspacecast (%jl_value_t* inttoptr (i64 140677114703680 to %jl_value_t*) to %jl_value_t addrspace(12)*))
  unreachable

L240:                                             ; preds = %L198, %L231
  %v.i162.pn = phi <8 x double> [ %v.i162, %L231 ], [ %value_phi37340, %L198 ]
  %.sroa.085.0 = fmul reassoc nnan ninf nsz arcp contract <8 x double> %res.i165, %v.i162.pn
  %50 = add i64 %value_phi30339, 8
  %51 = icmp slt i64 %50, %48
  br i1 %51, label %L198, label %L247

L247:                                             ; preds = %L240, %L183.preheader
  %.sroa.089.0.lcssa = phi <8 x double> [ <double 1.000000e+00, double undef, double undef, double undef, double undef, double undef, double undef, double undef>, %L183.preheader ], [ %.sroa.085.0, %L240 ]
  %value_phi30.lcssa = phi i64 [ 0, %L183.preheader ], [ %50, %L240 ]
  %value_phi37.lcssa = phi <8 x double> [ undef, %L183.preheader ], [ %.sroa.085.0, %L240 ]
  %tindex_phi41.lcssa = phi i8 [ 1, %L183.preheader ], [ 2, %L240 ]
  %52 = icmp slt i64 %value_phi30.lcssa, %11
  br i1 %52, label %L249, label %L247.L310_crit_edge

L247.L310_crit_edge:                              ; preds = %L247
  switch i8 %tindex_phi41.lcssa, label %L310 [
    i8 1, label %L322.thread
    i8 2, label %L310.thread
  ]

L249:                                             ; preds = %L247
  %ptr.i157 = inttoptr i64 %18 to i64*
  %offsetptr.i158 = getelementptr inbounds i64, i64* %ptr.i157, i64 %value_phi30.lcssa
  %ptr.i154 = bitcast i64* %offsetptr.i158 to <8 x i64>*
  %mask.i155 = bitcast i8 %28 to <8 x i1>
  %res.i156 = call <8 x i64> @llvm.masked.load.v8i64.p0v8i64(<8 x i64>* nonnull %ptr.i154, i32 8, <8 x i1> %mask.i155, <8 x i64> zeroinitializer)
  %res.i153 = add nsw <8 x i64> %res.i156, <i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1>
  %ie.i151 = insertelement <8 x i64> undef, i64 %11, i32 0
  %v.i152 = shufflevector <8 x i64> %ie.i151, <8 x i64> undef, <8 x i32> zeroinitializer
  %res.i150 = mul nsw <8 x i64> %res.i153, %v.i152
  %ie.i148 = insertelement <8 x i64> undef, i64 %value_phi30.lcssa, i32 0
  %v.i149 = shufflevector <8 x i64> %ie.i148, <8 x i64> undef, <8 x i32> zeroinitializer
  %res.i147 = add nsw <8 x i64> %v.i149, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
  %res.i146 = add nsw <8 x i64> %res.i147, %res.i150
  %ptr.i143 = inttoptr i64 %25 to double*
  %offsetptr.i144 = getelementptr inbounds double, double* %ptr.i143, <8 x i64> %res.i146
  %res.i142 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %offsetptr.i144, i32 8, <8 x i1> %mask.i155, <8 x double> zeroinitializer)
  switch i8 %tindex_phi41.lcssa, label %L287 [
    i8 2, label %L293
    i8 1, label %L299
  ]

L287:                                             ; preds = %L249
  call void @jl_throw(%jl_value_t addrspace(12)* addrspacecast (%jl_value_t* inttoptr (i64 140677114703680 to %jl_value_t*) to %jl_value_t addrspace(12)*))
  unreachable

L293:                                             ; preds = %L249
  %value_phi49294 = fmul reassoc nnan ninf nsz arcp contract <8 x double> %value_phi37.lcssa, %res.i142
  %res.i136 = select <8 x i1> %mask.i155, <8 x double> %value_phi49294, <8 x double> %value_phi37.lcssa
  br label %L310.thread

L299:                                             ; preds = %L249
  %v.i139 = shufflevector <8 x double> %.sroa.089.0.lcssa, <8 x double> undef, <8 x i32> zeroinitializer
  %value_phi49 = fmul reassoc nnan ninf nsz arcp contract <8 x double> %v.i139, %res.i142
  %res.i134 = select <8 x i1> %mask.i155, <8 x double> %value_phi49, <8 x double> %v.i139
  br label %L310.thread

L310.thread:                                      ; preds = %L299, %L293, %L247.L310_crit_edge
  %value_phi52.ph = phi <8 x double> [ %res.i136, %L293 ], [ %res.i134, %L299 ], [ %value_phi37.lcssa, %L247.L310_crit_edge ]
  %ptr.i131299 = inttoptr i64 %14 to double*
  %res.i130301 = load double, double* %ptr.i131299, align 1
  %vec_4_1.i = shufflevector <8 x double> %value_phi52.ph, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %vec_4_2.i = shufflevector <8 x double> %value_phi52.ph, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %vec_4.i = fmul <4 x double> %vec_4_1.i, %vec_4_2.i
  %vec_2_1.i = shufflevector <4 x double> %vec_4.i, <4 x double> undef, <2 x i32> <i32 0, i32 1>
  %vec_2_2.i = shufflevector <4 x double> %vec_4.i, <4 x double> undef, <2 x i32> <i32 2, i32 3>
  %vec_2.i = fmul <2 x double> %vec_2_1.i, %vec_2_2.i
  %vec_1_1.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> zeroinitializer
  %vec_1_2.i = shufflevector <2 x double> %vec_2.i, <2 x double> undef, <1 x i32> <i32 1>
  %vec_1.i = fmul <1 x double> %vec_1_1.i, %vec_1_2.i
  %res.i = extractelement <1 x double> %vec_1.i, i32 0
  %53 = fmul fast double %res.i, %res.i130301
  br label %L329

L310:                                             ; preds = %L247.L310_crit_edge
  call void @jl_throw(%jl_value_t addrspace(12)* addrspacecast (%jl_value_t* inttoptr (i64 140677114703680 to %jl_value_t*) to %jl_value_t addrspace(12)*))
  unreachable

L329:                                             ; preds = %L322.thread, %L310.thread
  %value_phi55 = phi double [ %53, %L310.thread ], [ %56, %L322.thread ]
  %54 = fadd double %value_phi55, 0.000000e+00
  br label %L331

L331:                                             ; preds = %L175, %L175.us, %L179, %L329
  %value_phi56 = phi double [ %54, %L329 ], [ 0.000000e+00, %L179 ], [ %40, %L175.us ], [ %45, %L175 ]
  ret double %value_phi56

L322.thread:                                      ; preds = %L247.L310_crit_edge
  %ptr.i131308 = inttoptr i64 %14 to double*
  %res.i130310 = load double, double* %ptr.i131308, align 1
  %55 = extractelement <8 x double> %.sroa.089.0.lcssa, i32 0
  %56 = fmul fast double %res.i130310, %55
  br label %L329
}

specifically, look for the loop body:

L44.us:                                           ; preds = %L86.us, %L44.lr.ph.us
  %tindex_phi348.us = phi i8 [ 1, %L44.lr.ph.us ], [ 2, %L86.us ]
  %value_phi9347.us = phi <8 x double> [ undef, %L44.lr.ph.us ], [ %.sroa.0113.0.us, %L86.us ]
  %value_phi2346.us = phi i64 [ 0, %L44.lr.ph.us ], [ %34, %L86.us ]
  %.sroa.0117.1345.us = phi <8 x double> [ %.sroa.0117.0.vec.insert.us, %L44.lr.ph.us ], [ %.sroa.0113.0.us, %L86.us ]
  %33 = add i64 %value_phi2346.us, %32
  %offsetptr.i.us = getelementptr inbounds i64, i64* %ptr.i, i64 %33
  %ptr.i244.us = bitcast i64* %offsetptr.i.us to <8 x i64>*
  %res.i245.us = load <8 x i64>, <8 x i64>* %ptr.i244.us, align 8
  %res.i243.us = add nsw <8 x i64> %res.i245.us, <i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1, i64 -1>
  %res.i240.us = mul nsw <8 x i64> %res.i243.us, %v.i242
  %ie.i238.us = insertelement <8 x i64> undef, i64 %value_phi2346.us, i32 0
  %v.i239.us = shufflevector <8 x i64> %ie.i238.us, <8 x i64> undef, <8 x i32> zeroinitializer
  %res.i237.us = add nsw <8 x i64> %v.i239.us, <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
  %res.i236.us = add nsw <8 x i64> %res.i237.us, %res.i240.us
  %offsetptr.i234.us = getelementptr inbounds double, double* %ptr.i233, <8 x i64> %res.i236.us
  %res.i232.us = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %offsetptr.i234.us, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
  switch i8 %tindex_phi348.us, label %L84 [
    i8 2, label %L86.us
    i8 1, label %L77.us
  ]

L77.us:                                           ; preds = %L44.us
  %v.i229.us = shufflevector <8 x double> %.sroa.0117.1345.us, <8 x double> undef, <8 x i32> zeroinitializer
  br label %L86.us

L86.us:                                           ; preds = %L77.us, %L44.us
  %v.i229.pn.us = phi <8 x double> [ %v.i229.us, %L77.us ], [ %value_phi9347.us, %L44.us ]
  %.sroa.0113.0.us = fmul reassoc nnan ninf nsz arcp contract <8 x double> %res.i232.us, %v.i229.pn.us
  %34 = add i64 %value_phi2346.us, 8
  %35 = icmp slt i64 %34, %30
  br i1 %35, label %L44.us, label %L93.us
3 Likes

Wow - thanks for testing this. I’m trying to understand this now. So you saying that this “gather” operation when performed in every loop iteration will make AVX essentially useless. That is, I need to figure out a representation that doesn’t require me to load basis[a, n] at each iteration, but just compute it on the fly?

The following seems to help a little, but still gains < factor 2

function mvp_opt(tmp, P, basis, coeffs::Vector{T}) where {T}
   len_c = length(coeffs)
   len_P = size(P, 1)
   fill!(tmp, one(T))
   for a = 1:len_P
      @avx for n = 1:len_c
         tmp[n] *= P[a, basis[n, a]]
      end
   end
   return dot(coeffs, tmp)
end

tmp = zeros(nbasis)
basist = collect(basis')
@btime mvp_opt($tmp, $P, $basis, $coeffs)
1 Like

Thanks for this package! I don’t know anything about vectorization but I learned a lot from this thread.

I’m heavily using scatter/gather operations in my package FixedEffects
(here is the gather part and here is the scatter part) For now, I decompose the vector into nthreads part and do the scatter/gather in each thread.

I guess that one alternative would be to vectorize these parts. Do you think it could benefit from it? I have just tried to experiment but @avx does not seem to support scatter yet.

I’m at work now, where I only have easy access to avx2 (although I could use aws).

I wouldn’t say it makes AVX useless. If you’re really curious, you can take a look through Agner Fog’s instruction tables. Roughly in the middle of page 269, it says VGATHERQPD has a reciprocal throughput (RT) of about 5 clock cycles (meaning the CPU could do about 200 of them every 1000 cycles).
You can think of reciprocal throughput as an average amount of time it takes. The actual number of cycles an instruction takes will generally be longer, but a single core can normally work on multiple in parallel. If code can be run out of order, reciprocal throughput may be a better indicator. (Memory loads can be evaluated out of order, since we’re not writing to memory.)

For comparison, on page 267, VMOVUPS/D operating on v,m (any vector and memory) has a RT of 0.5 clock cycles – 10 times faster.
Additionally, all add/multiply/fused multiply add instructions have RTs of 0.5-1 cycle.
Gather is a steep price to pay. To gain a speedup, the rest of vectorization will have to speed things up enough to be worth it.

For example, running on an AWS instance (to get avx512 at work):

julia> using LoopVectorization, BenchmarkTools

julia> A = rand(100,100); B1 = similar(A); B2 = similar(A);

julia> @benchmark @. $B1 = $A + $A'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.882 μs (0.00% GC)
  median time:      6.932 μs (0.00% GC)
  mean time:        6.986 μs (0.00% GC)
  maximum time:     20.675 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark @avx @. $B2 = $A + $A'
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.915 μs (0.00% GC)
  median time:      3.947 μs (0.00% GC)
  mean time:        3.967 μs (0.00% GC)
  maximum time:     7.984 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

@avx uses vgatherqpd here, and does get a speedup.
However, it is slower on the avx2 computer I tried it on (7.5 vs 10 microseconds, instead of 6.9 vs 4).

In your example, it tried to vectorize the inner loop which had dim = 10. 10 is a little awkward for vectors of length 8.
Setting dim = 30 (again on AWS):

julia> maxdeg = 20; dim =30;

julia> nbasis = 1_000;

julia> P = rand(dim, maxdeg + 1);

julia> function mvp(P, basis, coeffs::Vector{T}) where {T}
          len_c = length(coeffs)
          len_P = size(P, 1)
          p = zero(T)
          for n = 1:len_c
             pn = coeffs[n]
             for a = 1:len_P
                pn *= P[a, basis[a, n]]
             end
             p += pn
          end
          return p
       end
mvp (generic function with 1 method)

julia> function mvpinbounds(P, basis, coeffs::Vector{T}) where {T}
          len_c = length(coeffs)
          len_P = size(P, 1)
          p = zero(T)
          @inbounds for n = 1:len_c
             pn = coeffs[n]
             for a = 1:len_P
                pn *= P[a, basis[a, n]]
             end
             p += pn
          end
          return p
       end
mvpinbounds (generic function with 1 method)

julia> function mvpavx(P, basis, coeffs::Vector{T}) where {T}
          len_c = length(coeffs)
          len_P = size(P, 1)
          p = zero(T)
          @avx for n = 1:len_c
             pn = coeffs[n]
             for a = 1:len_P
                pn *= P[a, basis[a, n]]
             end
             p += pn
          end
          return p
       end
mvpavx (generic function with 1 method)

julia> basis = rand(1:(maxdeg+1), (dim, nbasis));

julia> coeffs = rand(nbasis);

julia> @btime mvp($P, $basis, $coeffs)
  40.369 μs (0 allocations: 0 bytes)
8.855318486319671e-7

julia> @btime mvpinbounds($P, $basis, $coeffs)
  28.164 μs (0 allocations: 0 bytes)
8.855318486319671e-7

julia> @btime mvpavx($P, $basis, $coeffs)
  19.370 μs (0 allocations: 0 bytes)
8.855318486319674e-7

30 (4x8, 2 wasted) is less awkward than 10 (2x8, 6 “wasted”). 30/32 was useful, vs 10/16.

I think that’s why your second version is faster (len_c is much larger than len_P).
I also see about a 2x improvement when dim = 10. (Also, note you meant to use basist instead of basis as an argument.)

While I haven’t really been using it, the plumbing is there for LoopVectorization to use static sizes / size hints to make optimization decisions (I’ve not made it take that into account intelligently yet, but it should be simple).
What would be a nice API for passing “size hints” that it will use when optimizing (generated loops will remain dynamic)?
Knowing that one dimension is much longer, it should be able to figure out on its own that it wants to vectorize along that much longer dimension instead of the short one. It won’t be able to transpose matrices for you, but that shouldn’t be as big a deal, since you’re gathering from it anyway.

It may depend on architecture, but definitely worth trying in my opinion.
I think the gather would likely be faster. I’m not sure about scatter.

I haven’t tested scatter as much. Please file an issue with a small example. I’ll get it working and add it to the tests.

6 Likes

I’ve been following this whole thread with great interest and wondering how it relates to arrays which have static size information available in the type. The obvious example is StaticArrays which mostly takes the simple minded but quite general approach of “unroll everything in sight, and hope for the best”. From the perspective of StaticArrays we don’t know how a given array will be used within a larger computation (we don’t have knowledge of a nested set of loops like you do with @avx) so perhaps it’s hard to do better. But I do wonder whether we’re missing out on important optimizations.

I feel like there’s quite some overlap with HybridArrays.jl here, which has sizes for a subset of dimensions available in the type. To make this all work in general (with a really nice API), have you thought about pursuing a partially type-based solution and moving some of the logic to later in the compilation pipeline? I see that’s kind of what you’ve already done with the broadcast generated function. The @avx macro wouldn’t go away because you’d obviously need custom loop lowering, but it might be helpful to mix in some type information.

3 Likes

LoopVectorization is an extremely cool package. It looks like one can write loops for matrix-matrix multiplication in any order, and this package will fix it up.

Pure Julia, triple loop on regular double-precision arrays.
Without the @avx

With the @avx

Reference: GitHub - PetrKryslUCSD/MatMulExperiments

5 Likes

It will be hard to actually support scatter using vectors of indices.
When trying to store a vector, you’ll have a vector of indices. It will have to make sure that none of these indices conflict, otherwise it’ll try to write two values to the same place at the same time, and the answers will be wrong (eg, in the += example that was linked).
Some architectures have conflict detection instructions, which could be used. However, LLVM doesn’t seem to provide an intrinsic (that I could find).

You can thank @Mason for adding a macro that wraps a generated function.
Currently, it simple turns an expression into a nested type object to pass to the generated function. The generated function then converts it back into an expression, and then uses the existing code to create a LoopSet object (which performs the analysis on loops) from the expression.

I’ll begin work shortly to extend this in two complementary ways:

  1. Use type information of the arguments to inform LoopSet.
  2. Define an interface for other libraries to be able to pass this information.

That interface will take the form in overloads for VectorizationBase.stridedpointer. This already has special overloads for transposed/adjoints and SubArrays whose index is Tuple{Int,Vararg} (because they don’t have unit stride on the first axis, and in the case of transpose/adjoints do have unit stride in a different dimension).

MArray and SArray would each have their own overload.

Taking advantage of size is made more difficult by the fact that size is not the same thing as loop bounds. Code could also special case size and length calls in defining loopbounds, and check if an array argument’s size is known.
Any ideas for a type-based approach?

I’d also like this to be extended to use a StructVector (that is, like StructArrays) for small isbits composites, constructing separate vectors via shuffles. Except when the input array was already a StructArray, in which case it will be faster / can skip the shuffling. Besides implementing all of that, LoopVectorization’s optimizer cannot handle it right now either; it has no concept of a number occupying multiple registers. Ideally it could figure out accurate estimates for instruction costs as well.

As demoed in the benchmarks (I’ll add more in the next few days, and add icc and ifort to the mix), it tends to outperform the compiler on many of these loops:

julia> using LoopVectorization, StaticArrays, BenchmarkTools, LinearAlgebra

julia> function AmulBavx1!(C, A, B)
           @avx for m ∈ 1:size(A,1), n ∈ 1:size(B,2)
               Cₘₙ = zero(eltype(C))
               for k ∈ 1:size(A,2)
                   Cₘₙ += A[m,k] * B[k,n]
               end
               C[m,n] = Cₘₙ
           end
       end
AmulBavx1! (generic function with 1 method)

julia> A = @MMatrix rand(7,7);

julia> B = @MMatrix rand(7,7);

julia> C1 = similar(A); C2 = similar(A);

julia> @benchmark mul!($C1, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     66.800 ns (0.00% GC)
  median time:      68.009 ns (0.00% GC)
  mean time:        67.927 ns (0.00% GC)
  maximum time:     84.059 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     975

julia> @benchmark AmulBavx1!($C2, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.605 ns (0.00% GC)
  median time:      19.700 ns (0.00% GC)
  mean time:        19.725 ns (0.00% GC)
  maximum time:     31.941 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     997

@avx is fairly aggressive in unrolling the generated code itself. Julia/LLVM then do plenty of dead code elimination when the loops are short and statically sized.
We’re left with the fairly clean:

#julia> @code_native debuginfo=:none AmulBavx1!(C2, A, B)
	.text
	movq	%rsi, -8(%rsp)
	movq	(%rsi), %rax
	movq	8(%rsi), %rdx
	movq	16(%rsi), %rcx
	movb	$127, %sil
	kmovd	%esi, %k1
	vmovupd	(%rdx), %zmm1 {%k1} {z}
	vxorpd	%xmm0, %xmm0, %xmm0
	vxorpd	%xmm2, %xmm2, %xmm2
	vfmadd231pd	(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vxorpd	%xmm3, %xmm3, %xmm3
	vfmadd231pd	56(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vxorpd	%xmm4, %xmm4, %xmm4
	vfmadd231pd	112(%rcx){1to8}, %zmm1, %zmm4 # zmm4 = (zmm1 * mem) + zmm4
	vfmadd132pd	168(%rcx){1to8}, %zmm0, %zmm1 # zmm1 = (zmm1 * mem) + zmm0
	vmovupd	56(%rdx), %zmm5 {%k1} {z}
	vfmadd231pd	8(%rcx){1to8}, %zmm5, %zmm2 # zmm2 = (zmm5 * mem) + zmm2
	vfmadd231pd	64(%rcx){1to8}, %zmm5, %zmm3 # zmm3 = (zmm5 * mem) + zmm3
	vfmadd231pd	120(%rcx){1to8}, %zmm5, %zmm4 # zmm4 = (zmm5 * mem) + zmm4
	vfmadd231pd	176(%rcx){1to8}, %zmm5, %zmm1 # zmm1 = (zmm5 * mem) + zmm1
	vmovupd	112(%rdx), %zmm5 {%k1} {z}
	vfmadd231pd	16(%rcx){1to8}, %zmm5, %zmm2 # zmm2 = (zmm5 * mem) + zmm2
	vfmadd231pd	72(%rcx){1to8}, %zmm5, %zmm3 # zmm3 = (zmm5 * mem) + zmm3
	vfmadd231pd	128(%rcx){1to8}, %zmm5, %zmm4 # zmm4 = (zmm5 * mem) + zmm4
	vfmadd231pd	184(%rcx){1to8}, %zmm5, %zmm1 # zmm1 = (zmm5 * mem) + zmm1
	vmovupd	168(%rdx), %zmm5 {%k1} {z}
	vfmadd231pd	24(%rcx){1to8}, %zmm5, %zmm2 # zmm2 = (zmm5 * mem) + zmm2
	vfmadd231pd	80(%rcx){1to8}, %zmm5, %zmm3 # zmm3 = (zmm5 * mem) + zmm3
	vfmadd231pd	136(%rcx){1to8}, %zmm5, %zmm4 # zmm4 = (zmm5 * mem) + zmm4
	vfmadd231pd	192(%rcx){1to8}, %zmm5, %zmm1 # zmm1 = (zmm5 * mem) + zmm1
	vmovupd	224(%rdx), %zmm5 {%k1} {z}
	vfmadd231pd	32(%rcx){1to8}, %zmm5, %zmm2 # zmm2 = (zmm5 * mem) + zmm2
	vfmadd231pd	88(%rcx){1to8}, %zmm5, %zmm3 # zmm3 = (zmm5 * mem) + zmm3
	vfmadd231pd	144(%rcx){1to8}, %zmm5, %zmm4 # zmm4 = (zmm5 * mem) + zmm4
	vfmadd231pd	200(%rcx){1to8}, %zmm5, %zmm1 # zmm1 = (zmm5 * mem) + zmm1
	vmovupd	280(%rdx), %zmm5 {%k1} {z}
	vfmadd231pd	40(%rcx){1to8}, %zmm5, %zmm2 # zmm2 = (zmm5 * mem) + zmm2
	vfmadd231pd	96(%rcx){1to8}, %zmm5, %zmm3 # zmm3 = (zmm5 * mem) + zmm3
	vfmadd231pd	152(%rcx){1to8}, %zmm5, %zmm4 # zmm4 = (zmm5 * mem) + zmm4
	vfmadd231pd	208(%rcx){1to8}, %zmm5, %zmm1 # zmm1 = (zmm5 * mem) + zmm1
	vmovupd	336(%rdx), %zmm5 {%k1} {z}
	vfmadd231pd	48(%rcx){1to8}, %zmm5, %zmm2 # zmm2 = (zmm5 * mem) + zmm2
	vfmadd231pd	104(%rcx){1to8}, %zmm5, %zmm3 # zmm3 = (zmm5 * mem) + zmm3
	vfmadd231pd	160(%rcx){1to8}, %zmm5, %zmm4 # zmm4 = (zmm5 * mem) + zmm4
	vfmadd231pd	216(%rcx){1to8}, %zmm5, %zmm1 # zmm1 = (zmm5 * mem) + zmm1
	vmovupd	%zmm2, (%rax) {%k1}
	vmovupd	%zmm3, 56(%rax) {%k1}
	vmovupd	%zmm4, 112(%rax) {%k1}
	vmovupd	%zmm1, 168(%rax) {%k1}
	vmovupd	(%rdx), %zmm1 {%k1} {z}
	vxorpd	%xmm2, %xmm2, %xmm2
	vfmadd231pd	224(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vxorpd	%xmm3, %xmm3, %xmm3
	vfmadd231pd	280(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vfmadd231pd	336(%rcx){1to8}, %zmm1, %zmm0 # zmm0 = (zmm1 * mem) + zmm0
	vmovupd	56(%rdx), %zmm1 {%k1} {z}
	vfmadd231pd	232(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vfmadd231pd	288(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vfmadd231pd	344(%rcx){1to8}, %zmm1, %zmm0 # zmm0 = (zmm1 * mem) + zmm0
	vmovupd	112(%rdx), %zmm1 {%k1} {z}
	vfmadd231pd	240(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vfmadd231pd	296(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vfmadd231pd	352(%rcx){1to8}, %zmm1, %zmm0 # zmm0 = (zmm1 * mem) + zmm0
	vmovupd	168(%rdx), %zmm1 {%k1} {z}
	vfmadd231pd	248(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vfmadd231pd	304(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vfmadd231pd	360(%rcx){1to8}, %zmm1, %zmm0 # zmm0 = (zmm1 * mem) + zmm0
	vmovupd	224(%rdx), %zmm1 {%k1} {z}
	vfmadd231pd	256(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vfmadd231pd	312(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vfmadd231pd	368(%rcx){1to8}, %zmm1, %zmm0 # zmm0 = (zmm1 * mem) + zmm0
	vmovupd	280(%rdx), %zmm1 {%k1} {z}
	vfmadd231pd	264(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vfmadd231pd	320(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vfmadd231pd	376(%rcx){1to8}, %zmm1, %zmm0 # zmm0 = (zmm1 * mem) + zmm0
	vmovupd	336(%rdx), %zmm1 {%k1} {z}
	vfmadd231pd	272(%rcx){1to8}, %zmm1, %zmm2 # zmm2 = (zmm1 * mem) + zmm2
	vfmadd231pd	328(%rcx){1to8}, %zmm1, %zmm3 # zmm3 = (zmm1 * mem) + zmm3
	vfmadd231pd	384(%rcx){1to8}, %zmm1, %zmm0 # zmm0 = (zmm1 * mem) + zmm0
	vmovupd	%zmm2, 224(%rax) {%k1}
	vmovupd	%zmm3, 280(%rax) {%k1}
	vmovupd	%zmm0, 336(%rax) {%k1}
	movabsq	$jl_system_image_data, %rax
	vzeroupper
	retq
	nopl	(%rax)

[ It may be better to use fastmath so that xorfmadd can be replaced with a mul, but I’ve noticed that in earlier versions of LLVM, this would lead to lots of extraneous ‘vmovapd’ instructions. I haven’t tried in a while. ]

Letting LLVM take care of it in this way is a form of moving it later in the pipeline.
A question is, how much better could we do ourselves?
We’re hurt by the fact that it is the loop remainder / clean up code that does the heavy lifting. We could do a better job divvying up small numbers of iterations.

@PetrKryslUCSD I also like how the first graphic peaks at <2, while thr second hovers around 10.
EDIT: Although it looks like you didn’t use @inbounds or @simd for the comparisons.

2 Likes

I added the six permutations of the matrix-matrix multiplication with @inbounds.

EDIT: Performance has also been evaluated for @simd.

4 Likes

@simd needs @inbounds to work, otherwise the branches (bounds checks) prevent the optimizations.
In this case, @inbounds is all that is required to vectorize the loop, because your mulCABjpi and mulCABpji loops don’t need any reordering to vectorize (thus @simd’s permission isn’t required).

2 Likes

Good catch, thanks.

Check out the updated benchmarks in the opening post for matrixmatrix multiplication and matrixvector multiplication.

The same @avx loop didn’t need modification for tranposing the argument arrays; LoopVectorization was able to infer whether the arrays were column or row major based on whether or not they were wrapped in an Adjoint/Transpose struct (ie, column major if they aren’t, row major if they are), and then optimize accordingly.

The macro now constructs a call to a generated function, providing access to this type information.

It can also be used to read size information by overloading maybestaticsize and/or maybestaticlength. This requires that you actually use size or length directly to define loop bounds; it will not read them from the arrays, as array size does not correspond to loop bounds in general.

For example

julia> using LoopVectorization, StaticArrays, BenchmarkTools, LinearAlgebra

julia> @inline function gemmavx!(C, A, B)
           @avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
               Cᵢⱼ = 0.0
               for k ∈ 1:size(A,2)
                   Cᵢⱼ += A[i,k] * B[k,j]
               end
               C[i,j] = Cᵢⱼ
           end
       end
gemmavx! (generic function with 1 method)

julia> A = @MMatrix rand(15, 22);

julia> B = @MMatrix rand(22, 13);

julia> C = MMatrix{15,13,Float64}(undef);

julia> @benchmark mul!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     277.336 ns (0.00% GC)
  median time:      318.701 ns (0.00% GC)
  mean time:        319.073 ns (0.00% GC)
  maximum time:     376.450 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     298

julia> @benchmark gemmavx!($C, $A, $B) # no size info
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     104.176 ns (0.00% GC)
  median time:      105.280 ns (0.00% GC)
  mean time:        105.585 ns (0.00% GC)
  maximum time:     152.010 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     937

julia> LoopVectorization.maybestaticsize(::MArray{<:Tuple{M,Vararg}}, ::Val{1}) where {M} = LoopVectorization.Static{M}()

julia> LoopVectorization.maybestaticsize(::MArray{<:Tuple{M,N,Vararg}}, ::Val{2}) where {M,N} = LoopVectorization.Static{N}()

julia> @benchmark gemmavx!($C, $A, $B) # size info
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     83.113 ns (0.00% GC)
  median time:      83.666 ns (0.00% GC)
  mean time:        84.118 ns (0.00% GC)
  maximum time:     113.457 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     964

Currently, LoopVectorization would normally use a 32x4 tile given AVX512. That’s less than ideal if you’re trying to calculate a 15x13 product. Given the static size information, it’ll use a 16x13 tile instead.

6 Likes

Thank you for updating the OP. After rereading it in detail I noticed under limitations that only Cartesian multiple loops are supported. What would it take to support loops over sunsets of a cube that are defined by some simple analytically computes bounds, eg a multi-dimensional simplex? Is there a deep fundamental difficulty to support this?

I plan to eventually add support for some shapes. Maybe some day I’ll look into implementing polyhedral modeling.
In the short term, I want to support loops operating on only one triangle of a matrix (a 2-simplex?), to support symmetric, upper, and lower triangular matrices.
I’d like it to support such operations as triangular/symmetric matrix multiplication (multiplied with general or other triangular/symmetric matrices), solving triangular systems of equations (and inverting triangular matrices), as well as some basic decompositions like the Cholesky decomposition.

What I’d need to be able to handle these includes:

  1. A representation of the loop bounds I can transform to swap loop orders and tile. This shouldn’t be difficult if restricted to simple patterns.
  2. Determine strategies for efficiently evaluating these loops.
  3. Code to model the cost of those evaluation strategies, so that it can pick which to use for a given set of loops.
  4. Code to produce a Julia expression following the strategy it picked.

I’ll start thinking about “2.”.

Moving forward, I’d like to try modifying LoopVectorization to be able to optimize the problems I’m working on, rather than optimize those problems directly.

I would also like it to be able to help with the sorts of loops encountered when fitting models like longitudinal model-based network meta analysis, which requires ragged arrays.

10 Likes

The feature I’m most interested in is cache efficiency modelling so that LoopVectorization can match BLAS for large arrays. You mention plans for it in the README, do you mind if I ask where it lies on your priority queue and how difficult you anticipate it to be?

I don’t want to be demanding and I hugely appreciate the astonishingly fast work you’ve put in so far. I just want to manage my expectations.

9 Likes

Unfortunately, it will probably be a while. My priorities at the moment:

  1. Update ProbabilityModels to use LoopVectorization, and release it.
  2. Add support for operating on triangular matrices.
  3. Memory efficiency.

This will probably take months.
I don’t think it will be too difficult, but will probably take at least a couple weeks of dedicated work. I suspect “within a few percent” of OpenBLAS is achievable (and it should be easy to beat on avx512 systems).
I am not much good at forecasting how long something will take, and tend to drastically undershoot.

I’d like to say “before the end of the year”, but I cannot make promises.

I’m paid to develop tools for statistical analysis. That affords me a little freedom to work now for a more convenient future, but I have to show some progress / cannot look like I am lost in the weeds for too long.

16 Likes

Is the macro safe for generic types, i.e. it’ll fallback to doing nothing if it hits a CuArray or array of quaternions?

4 Likes

Generally speaking, no. It is not really designed to be a safe macro, for instance it doesn’t perform bounds checking. Now that the backend of the macro is a generated function, it does have type level information it can use to start providing smarter errors, but for the most part they’re not friendly or necessarily informative errors yet. However it should usually error rather than segfault, but I’m not sure there are any strong guarantees yet.

Right so there’s a 3.4× speedup for those medium-sized MMatrix on my avx2 system, vs the simple unrolled mul!, even without static size information. That’s a huge improvement which would eventually be great to expose directly from StaticArrays itself. What I’m not sure about is how to do that appropriately for the compilation target. (For example, doing this low level performance stuff in the front/middle end might be problematic for GPU users of static arrays? I’m not really familiar with how CUDANative hooks into the compilation pipeline.)

Very nice, this makes a lot of sense :+1: By the way, writing maybestaticsize more generically using the StaticArrays.Size trait seems to work:

LoopVectorization.maybestaticsize(A::StaticArray, ::Val{N}) where {N} = LoopVectorization.Static{Size(A)[N]}()
3 Likes

I think that would be reasonable to implement.
When it is implemented, I guess it would be preferred to have array types opt in rather than opt out (ie, require they define certain methods that have fall back definitions that result in opting out)?

The errors are really dumb at the moment (only helpful to me, who knows the internals well enough to know what went wrong), but I’d rather fix the error than make it more helpful ;).

[quote=“Mason, post:60, topic:32843”]
However it should usually error rather than segfault, but I’m not sure there are any strong guarantees yet.[/quote]

I’ll fix broadcasting to support dimensions of 1, since as is, someone who doesn’t read the opening post closely will be in for a nasty surprise.
The check will result in a tiny performance penalty to broadcasting, but billions of time less than is lost to a crash.
When I do, I’ll also add a means of opting out of the check. A means other than just writing out the loops.

So far, most of the segfaults I’ve personally encountered have all been caused by this issue which looks like it may soon be fixed. If code was generated incorrectly, sometimes things don’t inline, and then a struct-wrapped vector gets passed through a function barrier, triggering the LLVM alignment assertion.

LoopVectorization itself should internally be safe. There are only 4 @inbounds used within the library, and they all involve static ranges/indices.

The generated code itself is unsafe, but hopefully the users took care to check bounds themselves.
I’ll start adding code for checking to my examples, to make it more clear that this is required.
It also always GC.@preserves all the arrays it takes a pointer of.

Would be great if StaticArrays started taking advantage of this!
MMarray and SizedArray certainly can, but the namesake SArray is harder.

NTuples lower to LLVM arrays, and I can’t get pointers to them. This means I can’t use LLVM intrinsics to efficiently load and store vectors.
LLVM is good at producing the correct instructions from a series of getindexes when the vectors are a power of 2, but does poorly otherwise, while the intrinsics work well regardless via vmaskmov on AVX (or bitmasking a move on avx512).

Something to keep an eye on once Julia moves to LLVM 10 is the new matrix intrinsics. These are transpose, multiple, load, and store intrinsics for column major matrices.

Another suggestion may be, now that this merged, create a new version of StaticArrays that drops support for older versions of Julia, and instead of wrapping an NTuple{N,T} (lowered to an LLVM array), wrap NTuple{N,Core.VecElement{T}} (lowered to an LLVM vector). These vectors can be treated directly as matrices for the new multiply / transpose / load / store operations.

Other vectors can be loaded via shufflevectors (although I wouldn’t bet on this always generating good code).
An alternative would be a tuple of vectors, so that each column is its own vector.

These require the “subtly change meaning of datatype align” PR to merge to avoid being inundated with crashes.
But if folks are willing to experiment with the internal representation of an SArray, there’s a lot of potential on the horizon. MArrays don’t need it since we have pointers.

I don’t know much about GPU compute, or where the parallelization happens when someone does array operations with StaticArrays, but I’d strongly suspect generating code optimized for the host CPU is not going to be doing LLVM or the GPU any favors when the GPU is the target.

That’s much nicer. I should get more comfortable with type inference and const prop in these situations.

4 Likes