LoopVectorization: @turbo performs worse than @inbounds on trivial loop

Mark the @turbo function @inline:

julia> a = rand(64);

julia> b = rand(64);

julia> c = rand(64);

julia> @inline function vmul_turbo!(a, b, c)
           @turbo for i in eachindex(a)
               c[i] = a[i] * b[i]
           end
       end
vmul_turbo! (generic function with 1 method)

julia> @inline function vmul_inbounds_simd!(a, b, c)
           @inbounds @simd for i in eachindex(a)
               c[i] = a[i] * b[i]
           end
       end
vmul_inbounds_simd! (generic function with 1 method)

julia> @inline function vmul_inbounds!(a, b, c)
           @inbounds for i in eachindex(a)
               c[i] = a[i] * b[i]
           end
       end
vmul_inbounds! (generic function with 1 method)

julia> @btime vmul_turbo!($c, $a, $b)
  3.374 ns (0 allocations: 0 bytes)

julia> @btime vmul_inbounds_simd!($c, $a, $b)
  3.938 ns (0 allocations: 0 bytes)

julia> @btime vmul_inbounds!($c, $a, $b)
  3.942 ns (0 allocations: 0 bytes)

This is what I got before (without @inline):

julia> @btime vmul_inbounds!($c, $a, $b) # inlined speed: 3.942 ns (0 allocations: 0 bytes)
  3.938 ns (0 allocations: 0 bytes)

julia> @btime vmul_inbounds_simd!($c, $a, $b) # inlined speed: 3.938 ns (0 allocations: 0 bytes)
  3.944 ns (0 allocations: 0 bytes)

julia> @btime vmul_turbo!($c, $a, $b) # inlined speed: 3.374 ns (0 allocations: 0 bytes)
  11.054 ns (0 allocations: 0 bytes)

Basically, Julia’s compiler automatically inlines functions it estimates to be cheap.
It however automatically estimates @turbo to be very expensive (because it things llvmcall, which it relies on, is very expensive – even though it normally calls just a single instruction [or less!]).

Hence, the functions without @turbo inlined into your benchmark, but the @turbo version did not, adding function call overhead.

For a “fair” comparison, you should add @inline (or maybe @noinline) to all the functions.

Why are the timings for @inbounds and @inbounds @simd essentially the same, i.e. why do I not get any additional speedup from @simd ? lscpu returns sse , sse2 , and avx , so I’d expect some speedup for floating-point operations.

@simd tells the compiler it is okay to change the order of operations in order to SIMD your code, i.e. assume that floating point operations are associative (even though they aren’t!).
This speeds up your code, but changes the answer. Normally, it actually makes the answer more accurate, because using separate parallel accumulators in a reduction (e.g. a sum) makes it much more accurate (along with much faster).
However, this changes the result vs what you literally wrote, so you must give the compiler permission to do it.

With your simple loop, however, the compiler can SIMD it without changing the answer at all.
Try:

julia> @code_llvm debuginfo=:none vmul_inbounds!(c, a, b)

Among the output, I get:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %19 = getelementptr inbounds double, double* %16, i64 %index
  %20 = bitcast double* %19 to <8 x double>*
  %wide.load = load <8 x double>, <8 x double>* %20, align 8
  %21 = getelementptr inbounds double, double* %19, i64 8
  %22 = bitcast double* %21 to <8 x double>*
  %wide.load17 = load <8 x double>, <8 x double>* %22, align 8
  %23 = getelementptr inbounds double, double* %19, i64 16
  %24 = bitcast double* %23 to <8 x double>*
  %wide.load18 = load <8 x double>, <8 x double>* %24, align 8
  %25 = getelementptr inbounds double, double* %19, i64 24
  %26 = bitcast double* %25 to <8 x double>*
  %wide.load19 = load <8 x double>, <8 x double>* %26, align 8
  %27 = getelementptr inbounds double, double* %17, i64 %index
  %28 = bitcast double* %27 to <8 x double>*
  %wide.load20 = load <8 x double>, <8 x double>* %28, align 8
  %29 = getelementptr inbounds double, double* %27, i64 8
  %30 = bitcast double* %29 to <8 x double>*
  %wide.load21 = load <8 x double>, <8 x double>* %30, align 8
  %31 = getelementptr inbounds double, double* %27, i64 16
  %32 = bitcast double* %31 to <8 x double>*
  %wide.load22 = load <8 x double>, <8 x double>* %32, align 8
  %33 = getelementptr inbounds double, double* %27, i64 24
  %34 = bitcast double* %33 to <8 x double>*
  %wide.load23 = load <8 x double>, <8 x double>* %34, align 8
  %35 = fmul <8 x double> %wide.load, %wide.load20
  %36 = fmul <8 x double> %wide.load17, %wide.load21
  %37 = fmul <8 x double> %wide.load18, %wide.load22
  %38 = fmul <8 x double> %wide.load19, %wide.load23
  %39 = getelementptr inbounds double, double* %18, i64 %index
  %40 = bitcast double* %39 to <8 x double>*
  store <8 x double> %35, <8 x double>* %40, align 8
  %41 = getelementptr inbounds double, double* %39, i64 8
  %42 = bitcast double* %41 to <8 x double>*
  store <8 x double> %36, <8 x double>* %42, align 8
  %43 = getelementptr inbounds double, double* %39, i64 16
  %44 = bitcast double* %43 to <8 x double>*
  store <8 x double> %37, <8 x double>* %44, align 8
  %45 = getelementptr inbounds double, double* %39, i64 24
  %46 = bitcast double* %45 to <8 x double>*
  store <8 x double> %38, <8 x double>* %46, align 8
  %index.next = add i64 %index, 32
  %47 = icmp eq i64 %index.next, %n.vec
  br i1 %47, label %middle.block, label %vector.body

Note all the <8 x double>. This is an easy way to confirm SIMD.

Also, as a final comment, I’d be interested in seeing the chart with @inline (preferably) or @noinline applied to each of the functions. I expect LoopVectorization to win overall, particularly whenever N % 16 is more than a couple.

12 Likes