LoopVectorization almost doubles execution time?

Hi everyone,
I am testing LoopVectorization in my program and since it did not produce any change in speed, I boiled it down to an exemplary function. Here I found something which puzzled me because although the function is basically a glorified scalar product, which worked well for me in LoopVectorization, I get an increase in runtime by a factor ~1.5.

using LoopVectorization,BenchmarkTools

function testSum(indices,Vectors)
    A,B,C,D = Vectors
    sum = 0.
    for ksum in eachindex(indices)
        k = indices[ksum]
        sum += A[k] * B[k] +C[k] * D[k]
    end
    return sum
end

function testSum_avx(indices,Vectors)
    A,B,C,D = Vectors
    sum = 0.
    @turbo for ksum in eachindex(indices)
        k = indices[ksum]
        sum += A[k] * B[k] +C[k] * D[k]
    end
    return sum
end

N = 63
Vectors = Tuple(rand(N) for _ in 1:4);
indices = rand(1:N,500);

@btime testSum($indices,$Vectors)
@btime testSum_avx($indices,$Vectors)

which returns

  510.417 ns (0 allocations: 0 bytes)
  871.698 ns (0 allocations: 0 bytes)

(If I replace @turbo by Base.@simd I do get a speedup)
I don’t know if understanding this will solve my original problem but at this point I’d like to learn more about the limitations of LoopVectorization. Does anyone have some experience on using the package to its best performance that they can share? I’m happy about an answer to any of these points:

  1. What causes runtime increase here? It seems to be connected to finding the correct index to the vectors via the “indices” array, but I will need to do that in my more complicated original problem.
  2. If I do not gain any benefit from either @simd or @turbo, is there any way of figuring out if my loop is even getting vectorized or is there something that I am doing wrong?
  3. Are there any advanced tips on what I can do to get the most out of LV? I tried playing around with the unroll and inline options, but so far they haven’t done much.

Looking forward to hearing your Ideas :slight_smile:

1 Like

Looks like a bug. I would file an issue.

1 Like

Thanks,
I have opened an issue: Increased execution time bug (Ryzen CPU only?) · Issue #299 · JuliaSIMD/LoopVectorization.jl · GitHub
So far it seems to me that this only affects Ryzen CPU’s.
In any case, I am still interested if anyone has answer to my other questions, i.e. regarding equivalent run times between LoopVectorization and non-vectorized code and how to identify the reason.

Unrelated but why

for ksum in eachindex(indices)
    k = indices[ksum]

and not

for k in indices

?

Did you replace @turbo with @inbounds @simd or just @simd?
@inbunds @simd would be SIMD, but @simd here by itself would not.

You can use code_llvm or Cthulhu.jl to inspect code and see if it was vectorized.
Here is what I get, the loop for testSum is filled with branches, and is not SIMD:

L17:                                              ; preds = %idxend13, %L17.preheader
  %value_phi3 = phi i64 [ %65, %idxend13 ], [ 1, %L17.preheader ]
  %value_phi5 = phi double [ %64, %idxend13 ], [ 0.000000e+00, %L17.preheader ]
  %exitcond.not = icmp eq i64 %value_phi3, %39
  br i1 %exitcond.not, label %oob, label %idxend

L40:                                              ; preds = %idxend13, %top
  %value_phi17 = phi double [ 0.000000e+00, %top ], [ %64, %idxend13 ]
  ret double %value_phi17

oob:                                              ; preds = %L17
  %40 = alloca i64, align 8
  store i64 %39, i64* %40, align 8
  call void @jl_bounds_error_ints({}* %0, i64* nonnull %40, i64 1)
  unreachable

idxend:                                           ; preds = %L17
  %41 = add nsw i64 %value_phi3, -1
  %42 = getelementptr inbounds i64, i64* %18, i64 %41
  %43 = load i64, i64* %42, align 8
  %44 = add i64 %43, -1
  %45 = icmp ult i64 %44, %21
  br i1 %45, label %idxend7, label %oob6

oob6:                                             ; preds = %idxend
  %46 = alloca i64, align 8
  store i64 %43, i64* %46, align 8
  call void @jl_bounds_error_ints({}* %3, i64* nonnull %46, i64 1)
  unreachable

idxend7:                                          ; preds = %idxend
  %47 = icmp ult i64 %44, %24
  br i1 %47, label %idxend9, label %oob8

oob8:                                             ; preds = %idxend7
  %48 = alloca i64, align 8
  store i64 %43, i64* %48, align 8
  call void @jl_bounds_error_ints({}* %5, i64* nonnull %48, i64 1)
  unreachable

idxend9:                                          ; preds = %idxend7
  %49 = getelementptr inbounds double, double* %26, i64 %44
  %50 = load double, double* %49, align 8
  %51 = getelementptr inbounds double, double* %28, i64 %44
  %52 = load double, double* %51, align 8
  %53 = fmul double %50, %52
  %54 = icmp ult i64 %44, %31
  br i1 %54, label %idxend11, label %oob10

oob10:                                            ; preds = %idxend9
  %55 = alloca i64, align 8
  store i64 %43, i64* %55, align 8
  call void @jl_bounds_error_ints({}* %7, i64* nonnull %55, i64 1)
  unreachable

idxend11:                                         ; preds = %idxend9
  %56 = icmp ult i64 %44, %34
  br i1 %56, label %idxend13, label %oob12

oob12:                                            ; preds = %idxend11
  %57 = alloca i64, align 8
  store i64 %43, i64* %57, align 8
  call void @jl_bounds_error_ints({}* %9, i64* nonnull %57, i64 1)
  unreachable

idxend13:                                         ; preds = %idxend11
  %58 = getelementptr inbounds double, double* %36, i64 %44
  %59 = load double, double* %58, align 8
  %60 = getelementptr inbounds double, double* %38, i64 %44
  %61 = load double, double* %60, align 8
  %62 = fmul double %59, %61
  %63 = fadd double %53, %62
  %64 = fadd double %value_phi5, %63
  %.not.not20 = icmp eq i64 %value_phi3, %13
  %65 = add nuw nsw i64 %value_phi3, 1
  br i1 %.not.not20, label %L40, label %L17
}

The branches are for all the bounds checks.

This is testSum_avx:

L37:                                              ; preds = %L37, %top
  %value_phi464 = phi i64 [ %res.i30, %L37 ], [ 0, %top ]
  %value_phi63 = phi <8 x double> [ %res.i31, %L37 ], [ zeroinitializer, %top ]
  %res.i48 = shl nsw i64 %value_phi464, 3
  %ptr.1.i45 = getelementptr inbounds i8, i8* %18, i64 %res.i48
  %ptr.2.i46 = bitcast i8* %ptr.1.i45 to <8 x i64>*
  %res.i47 = load <8 x i64>, <8 x i64>* %ptr.2.i46, align 8
  %ptr.1.i43 = getelementptr inbounds double, double* %ptr.1.i59, <8 x i64> %res.i47
  %res.i44 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %ptr.1.i43, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
  %ptr.1.i40 = getelementptr inbounds double, double* %ptr.1.i56, <8 x i64> %res.i47
  %res.i41 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %ptr.1.i40, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
  %ptr.1.i37 = getelementptr inbounds double, double* %ptr.1.i53, <8 x i64> %res.i47
  %res.i38 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %ptr.1.i37, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
  %ptr.1.i34 = getelementptr inbounds double, double* %ptr.1.i50, <8 x i64> %res.i47
  %res.i35 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %ptr.1.i34, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
  %res.i32 = call reassoc nsz arcp contract afn <8 x double> @llvm.fma.v8f64(<8 x double> %res.i38, <8 x double> %res.i35, <8 x double> %value_phi63)
  %res.i31 = call reassoc nsz arcp contract afn <8 x double> @llvm.fma.v8f64(<8 x double> %res.i44, <8 x double> %res.i41, <8 x double> %res.i32)
  %res.i30 = add nuw nsw i64 %value_phi464, 8
  %.not = icmp eq i64 %res.i30, %24
  br i1 %.not, label %L47, label %L37

Highlights:

%res.i47 = load <8 x i64>, <8 x i64>* %ptr.2.i46, align 8

This loads <8 x i64> (a SIMD vector) from `indices. Loading 8 elements in row like this is fast.
Then, there are 4 instances that look like

  %ptr.1.i43 = getelementptr inbounds double, double* %ptr.1.i59, <8 x i64> %res.i47
  %res.i44 = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double*> %ptr.1.i43, i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)

That is, it takes a base pointer to the array, and then getelementptr to index into it using the vector %res.i47. This returns a vector of pointers.
Then it uses gather to load a vector of <8 x double>.

gather is very slow. It varies by CPU, but gathering X elements tends to take as long as loading X elements one by one, meaning if the slowest part of your code is using gather, it’d be hard to get a speed up at all.
On my laptop, I get:

julia> versioninfo()
Julia Version 1.6.2-pre.0
Commit dd122918ce* (2021-04-23 21:21 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, tigerlake)
Environment:
  JULIA_NUM_THREADS = 8

julia> function testSum_inbounds(indices,Vectors)
           A,B,C,D = Vectors
           sum = 0.
           @inbounds for ksum in eachindex(indices)
               k = indices[ksum]
               sum += A[k] * B[k] +C[k] * D[k]
           end
           return sum
       end
testSum_inbounds (generic function with 1 method)

julia> function testSum_avx_u2(indices,Vectors)
           A,B,C,D = Vectors
           sum = 0.
           @turbo unroll=2 for ksum in eachindex(indices)
               k = indices[ksum]
               sum += A[k] * B[k] +C[k] * D[k]
           end
           return sum
       end
testSum_avx_u2 (generic function with 1 method)

julia> function testSum_avx(indices,Vectors)
           A,B,C,D = Vectors
           sum = 0.
           @turbo for ksum in eachindex(indices)
               k = indices[ksum]
               sum += A[k] * B[k] +C[k] * D[k]
           end
           return sum
       end
testSum_avx (generic function with 1 method)

julia> function testSum_simd(indices,Vectors)
           A,B,C,D = Vectors
           sum = 0.
           @inbounds @simd for ksum in eachindex(indices)
               k = indices[ksum]
               sum += A[k] * B[k] +C[k] * D[k]
           end
           return sum
       end
testSum_simd (generic function with 1 method)

julia> function testSum(indices,Vectors)
           A,B,C,D = Vectors
           sum = 0.
           for ksum in eachindex(indices)
               k = indices[ksum]
               sum += A[k] * B[k] +C[k] * D[k]
           end
           return sum
       end
testSum (generic function with 1 method)

julia> @btime testSum_inbounds($indices,$Vectors)
  421.201 ns (0 allocations: 0 bytes)
247.660070648417

julia> @btime testSum_avx($indices,$Vectors)
  304.891 ns (0 allocations: 0 bytes)
247.66007064841654

julia> @btime testSum_avx_u2($indices,$Vectors)
  299.310 ns (0 allocations: 0 bytes)
247.66007064841654

julia> @btime testSum_simd($indices,$Vectors)
  295.878 ns (0 allocations: 0 bytes)
247.66007064841656

julia> @btime testSum($indices,$Vectors)
  487.128 ns (0 allocations: 0 bytes)
247.660070648417

julia> function testSum_avx_u4(indices,Vectors)
           A,B,C,D = Vectors
           sum = 0.
           @turbo unroll=4 for ksum in eachindex(indices)
               k = indices[ksum]
               sum += A[k] * B[k] +C[k] * D[k]
           end
           return sum
       end
testSum_avx_u4 (generic function with 1 method)

julia> @btime testSum_avx_u4($indices,$Vectors)
  287.645 ns (0 allocations: 0 bytes)
247.66007064841654

So it looks like LoopVectorization might not be unrolling enough. By default, it chooses not to unroll at all, which yields a slower time than forcing it to unroll by 4.

@inbounds @simd does unroll by 4, and it looks like that choice is a bit better here.

@gustaphe LoopVectorization only supports AbstractRange{<:Integer} iterators, meaning the former works with @turbo but the latter does not.

2 Likes

Regarding performance of different architectures, here is a table giving the performance of 256-bit gathers using 4 x Int64 indices and loading 4 x Float64.
The throughput figures give the reciprocal throughput. Lower is better; you can interpret this as the average number of clock cycles per completion of such an instruction when you’re executing a lot of them.
The latency gives the actual time for an individual instruction, but many instructions can be executed in parallel, hence the throughput is the most useful figure here.

Arch RThroughput
Zen+ 12
Zen2 9
Zen3 4
Haswell 8
Skylake 4
Tiger Lake 3

Zen+, like the CPUs you tested on, are very bad here and really ought to avoid using any gather instructions. Haswell and Zen2 are bad, too.

But you can see the trend that this does keep getting better with newer architectures, so we can hope for further improvements in the future.

The fix in LoopVectorization will be to add the option to not vectorize code at all, and then have it avoid vectorization on such loops when we expect not to see a speedup or even a slow down due to bad performance of the involved instructions.

For comparison, for scalar loads, they’re all at 0.5.
4x scalar loads thus places them at 4*0.5 = 2.
That is, 4x scalar loads are still 50% faster than the gather instruction on Tiger Lake, and at least 2x faster with everything else.
With Zen+, they’re 6x faster.

Loading 8 at a time, like in the LLVM code I shared above (possible on CPUs with AVX512) has a r-throughput of 5, vs the 8 * 0.5 = 4 expected from 8x scalar loads. But that is only 25% slower, which is starting to get into more reasonable territory.

3 Likes

Yes, that was what I meant, I should have been more specific here.

You can use code_llvm or Cthulhu.jl to inspect code and see if it was vectorized.

Thanks, I’ll try this and maybe figure out what the issue behind the lack of performance gain was that lead me into this rabbit hole :slight_smile:. It hope that at least on the (Skylake-X) CPU, I should be able to increase my program’s performance.

Zen+, like the CPUs you tested on, are very bad here and really ought to avoid using any gather instructions. Haswell and Zen2 are bad, too.

Thanks for the solution to my mystery and also for the further explanation! I believe this checks all the questions on my list so far!